UStackUStack
Mercury 2 icon

Mercury 2

Mercury 2 is Inception’s diffusion-based reasoning LLM for low-latency production AI workflows with iterative agent and retrieval steps.

Mercury 2

What is Mercury 2?

Mercury 2 is a reasoning-focused large language model (LLM) introduced by Inception. Its core purpose is to deliver fast reasoning performance for production AI workloads—especially where latency compounds across iterative “loops” like agent steps, retrieval pipelines, and extraction jobs.

Unlike autoregressive models that generate one token at a time left-to-right, Mercury 2 is described as using a diffusion-based approach for real-time reasoning. The model generates outputs through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps.

Key Features

  • Diffusion-based, parallel refinement generation: Produces multiple tokens at once rather than sequential decoding, targeting lower end-to-end latency for interactive systems.
  • Speed optimized for production: Reported as 1,009 tokens/sec on NVIDIA Blackwell GPUs, designed to reduce perceived wait times under load.
  • Tunable reasoning: Allows configuration of reasoning behavior while maintaining the intended speed–quality balance.
  • 128K context: Supports long inputs via a 128K context window.
  • Native tool use: Includes built-in capability for invoking tools as part of reasoning workflows.
  • Schema-aligned JSON output: Can return structured outputs aligned to a schema, useful for downstream automation.

How to Use Mercury 2

  1. Integrate Mercury 2 into your LLM pipeline where latency matters (e.g., agent loops, retrieval-augmented workflows, or extraction tasks).
  2. Choose a reasoning setting that fits your quality and response-time needs (the model supports tunable reasoning).
  3. Provide inputs within the 128K context window and, when needed, request schema-aligned JSON output for reliable parsing.
  4. Use tool calls for workflows that require external actions (e.g., search, database lookups, or other tool-backed steps), particularly in multi-step agent scenarios.

Use Cases

  • Coding and editing workflows: Autocomplete, next-edit suggestions, refactors, and interactive code agents where pauses can disrupt developer flow.
  • Agentic loop tasks: Systems that chain many inference calls per job (e.g., multi-step decision-making), where reducing per-call latency changes how many steps are affordable.
  • Real-time voice and interaction: Voice interfaces and interactive HCI scenarios with tight latency budgets, where faster reasoning helps keep speech-like interaction responsive.
  • Search and RAG pipelines: Multi-hop retrieval and summarization workflows where reasoning is added to the search loop without exceeding latency constraints.
  • Transcript cleanup and other iterative transformation tasks: Applications that need fast, consistent transformations and refinement over user-facing interfaces.

FAQ

How does Mercury 2 differ from typical LLM decoding? Mercury 2 is described as diffusion-based and generating responses through parallel refinement rather than sequential, one-token-at-a-time autoregressive decoding.

What performance characteristics are stated for Mercury 2? The page reports >5x faster generation and 1,009 tokens/sec on NVIDIA Blackwell GPUs, along with guidance about optimizing for user-perceived responsiveness (including p95 latency under high concurrency).

What context length does Mercury 2 support? It lists 128K context.

Can Mercury 2 produce structured outputs? Yes. It is described as supporting schema-aligned JSON output for structured responses.

Does Mercury 2 support tool use? The page states it has native tool use, intended for integrating tools into reasoning workflows.

Alternatives

  • Autoregressive reasoning LLMs: Traditional token-by-token LLMs may be simpler to integrate but typically generate sequentially, which can increase latency in multi-step loops.
  • Other diffusion- or non-autoregressive generation approaches: Alternative model architectures aimed at parallel generation may offer similar latency goals, though implementation details and output behavior can differ.
  • Smaller speed-optimized LLMs for interactive use: Models focused on low latency may trade off reasoning depth or controllability compared to a reasoning-tuned setup like Mercury 2.
  • Agent/RAG orchestration strategies that minimize calls: Instead of changing the model architecture, teams can reduce latency by restructuring workflows (e.g., fewer retrieval steps, caching, or batching), though it may limit how much reasoning can be done per task.