Serverless access to open models
Access open models through Serverless inference without managing infrastructure or deployment overhead.
Wafer is an enterprise LLM inference platform for fast open-source model access via serverless APIs and dedicated endpoints, with OpenAI-compatible workflows.
Wafer is an enterprise-focused platform for serving open-source large language models through both serverless and dedicated inference. Its homepage positions the service around fast APIs for open models, while the manifesto frames the company mission as maximizing intelligence per watt through AI infrastructure optimization.
The platform splits into two main offerings: Serverless access for open models with no infrastructure or deployment overhead, and Dedicated Inference for sensitive or mission-critical workloads. The site also says dedicated endpoints can be set up in less than 24 hours and that Serverless endpoints follow the OpenAI Chat Completions schema for easier client compatibility.
Access open models through Serverless inference without managing infrastructure or deployment overhead.
Use dedicated endpoints for mission-critical workloads that need tailored inference settings and predictable performance.
Send requests with an OpenAI Chat Completions-compatible schema, including streaming, tool use, and JSON mode on Serverless models.
Rely on automatic prompt-prefix caching for repeated prompts, long system prompts, multi-turn chats, and RAG-heavy workloads.
Choose from the models shown on the homepage, including GLM-5.1, Kimi-K2.6, and Qwen 3.5 397B-A17B.
Use performance-tuned deployments designed around model, accelerator family, traffic patterns, and production constraints.
Teams that want to call open models without standing up their own inference stack can use Serverless APIs and avoid deployment overhead.
Applications with sensitive data or uptime requirements can use Dedicated Inference with isolated endpoints and SLA-backed availability.
Builders of chatbots, copilots, and agents can keep existing OpenAI-style clients and switch the base URL and API key to Wafer.
Workloads with long prompts or repeated context, such as multi-turn support or RAG, can benefit from automatic cache pricing on repeated prefixes.
Model teams that need tuned performance for a specific accelerator family or workload profile can use dedicated deployments optimized around those constraints.
Wafer provides serverless inference for open models and dedicated endpoints for sensitive or production workloads.
Yes. Wafer says its Serverless endpoints follow the OpenAI Chat Completions schema, so existing clients can switch by changing the base URL and API key.
Wafer says repeated prompt prefixes are cached automatically and billed at the Cache rate shown on each model card. The cache is server-side, so there is no header or flag to enable it.
For Dedicated Inference, Wafer says it can provision custom-tuned deployments in under 24 hours and offers SLA-backed uptime with zero data retention available for compliance-bound workloads.
The homepage lists three Serverless models today: GLM-5.1, Kimi-K2.6, and Qwen 3.5 397B-A17B. The site also says more models are rolling out.
Pioneer AI fine-tunes open-source language models and keeps them improving in production for classification, extraction, and other tasks.
AakarDev AI helps teams manage AI provider access, project-level setups, logs, and analytics from one dashboard. It supports BYOK workflows and lists providers including OpenAI, Google Gemini, Anthropic, Groq, Mistral AI, and Perplexity AI.
Benchspan is an AI agent security platform that discovers agents, blocks prompt injection and data exfiltration in real time, and supports pre-launch red teaming. It is aimed at teams running agents in production and includes Python and TypeScript SDKs.
Edgee is an AI gateway for coding agents and LLM-powered apps. It compresses token traffic, routes requests across models, and provides observability and team controls to help reduce cost and keep sessions running.
Codex Plugins bundle reusable skills, app integrations, and MCP servers into workflows you can install in the Codex app or use from Codex CLI. They help extend Codex with connected-service tasks, reusable instructions, and shared team workflows.
Wallie is an open-source AI streamer that watches your screen, hears chat, and delivers live commentary in a configurable persona. Runs locally with your own keys.