Wafer icon

Wafer

Wafer is an enterprise LLM inference platform for fast open-source model access via serverless APIs and dedicated endpoints, with OpenAI-compatible workflows.

Wafer

Enterprise LLM inference platform

Wafer is an enterprise-focused platform for serving open-source large language models through both serverless and dedicated inference. Its homepage positions the service around fast APIs for open models, while the manifesto frames the company mission as maximizing intelligence per watt through AI infrastructure optimization.

The platform splits into two main offerings: Serverless access for open models with no infrastructure or deployment overhead, and Dedicated Inference for sensitive or mission-critical workloads. The site also says dedicated endpoints can be set up in less than 24 hours and that Serverless endpoints follow the OpenAI Chat Completions schema for easier client compatibility.

Core capabilities

Serverless access to open models

Access open models through Serverless inference without managing infrastructure or deployment overhead.

Dedicated inference endpoints

Use dedicated endpoints for mission-critical workloads that need tailored inference settings and predictable performance.

OpenAI-compatible API workflow

Send requests with an OpenAI Chat Completions-compatible schema, including streaming, tool use, and JSON mode on Serverless models.

Server-side cache pricing

Rely on automatic prompt-prefix caching for repeated prompts, long system prompts, multi-turn chats, and RAG-heavy workloads.

Published model lineup

Choose from the models shown on the homepage, including GLM-5.1, Kimi-K2.6, and Qwen 3.5 397B-A17B.

Workload-specific optimization

Use performance-tuned deployments designed around model, accelerator family, traffic patterns, and production constraints.

Where Wafer fits

  • Fast access to open models

    Teams that want to call open models without standing up their own inference stack can use Serverless APIs and avoid deployment overhead.

  • Production AI workloads

    Applications with sensitive data or uptime requirements can use Dedicated Inference with isolated endpoints and SLA-backed availability.

  • OpenAI-compatible integrations

    Builders of chatbots, copilots, and agents can keep existing OpenAI-style clients and switch the base URL and API key to Wafer.

  • Repeated-context prompting

    Workloads with long prompts or repeated context, such as multi-turn support or RAG, can benefit from automatic cache pricing on repeated prefixes.

  • Custom model optimization

    Model teams that need tuned performance for a specific accelerator family or workload profile can use dedicated deployments optimized around those constraints.

Pros and Cons

Pros

  • Offers both serverless and dedicated inference options.
  • Supports OpenAI Chat Completions-compatible requests for easier drop-in use.
  • Describes automatic cache billing for repeated prompt prefixes.
  • Publishes benchmark results and latency-throughput comparisons on the homepage.
  • Provides an SLA with a 99.9% monthly availability target for Dedicated Inference.

Cons

  • Pricing details are not available on the pricing page; the pricing URL currently returns a 404.
  • The public homepage shows a limited model list, with three Serverless models named explicitly and more only hinted at.
  • Integrations beyond OpenAI-compatible clients are not documented in the provided sources.

FAQ

What does Wafer do?

Wafer provides serverless inference for open models and dedicated endpoints for sensitive or production workloads.

Can Wafer work with OpenAI-compatible clients?

Yes. Wafer says its Serverless endpoints follow the OpenAI Chat Completions schema, so existing clients can switch by changing the base URL and API key.

How does caching work?

Wafer says repeated prompt prefixes are cached automatically and billed at the Cache rate shown on each model card. The cache is server-side, so there is no header or flag to enable it.

What is Wafer's dedicated offering for?

For Dedicated Inference, Wafer says it can provision custom-tuned deployments in under 24 hours and offers SLA-backed uptime with zero data retention available for compliance-bound workloads.

Which models are available on Wafer?

The homepage lists three Serverless models today: GLM-5.1, Kimi-K2.6, and Qwen 3.5 397B-A17B. The site also says more models are rolling out.

Quick Facts

Category
Enterprise LLM inference
Product type
Open-source model hosting and serving
Deployment options
Serverless and Dedicated Inference
API compatibility
OpenAI Chat Completions schema for Serverless
SLA
99.9% monthly availability target for Dedicated Inference
Website
wafer.ai
Wafer - AI Tool, Features, Use Cases & Alternatives | UStack