UStackUStack
MulmoChat icon

MulmoChat

MulmoChat is a research prototype for multimodal AI chat on a canvas, combining conversational text with rich visual and interactive content, plus APIs and ComfyUI.

MulmoChat

What is MulmoChat?

MulmoChat is a research prototype for exploring multimodal AI chat experiences. Instead of limiting interactions to a text message stream, it aims to support conversational experiences that include rich visual and interactive content rendered directly on a canvas.

The core purpose is to demonstrate an architecture, design patterns, and UX principles for multimodal chat interfaces where visual experiences and language understanding work together within a single conversational flow.

Key Features

  • Multimodal chat on a canvas: Designed to combine conversation with visual, interactive content (for example, images and other rich visual elements) in the same user experience.
  • Research-oriented architecture & UX patterns: Includes documentation aimed at both product-oriented exploration and engineering implementation (e.g., LLM_OS.md and WHITEPAPER.md).
  • Provider-agnostic text generation API: Exposes a unified backend API that normalizes text generation responses across multiple LLM providers.
  • Text provider discovery endpoint: GET /api/text/providers returns configured providers (OpenAI, Anthropic, Google Gemini, Ollama), along with model suggestions and credential availability.
  • Unified text generation endpoint: POST /api/text/generate accepts a provider, model, and messages, returning a normalized text response regardless of vendor.
  • Local image generation integration via ComfyUI: Integrates with ComfyUI Desktop for local image generation using locally hosted models and workflows (e.g., FLUX), rather than only relying on cloud generation.

How to Use MulmoChat

  1. Install dependencies: Run yarn install.
  2. Configure environment variables: Create a .env file with keys such as OPENAI_API_KEY and GEMINI_API_KEY (required by the project depending on enabled features), plus optional keys for map features (GOOGLE_MAP_API_KEY), AI-powered search (EXA_API_KEY), HTML generation (ANTHROPIC_API_KEY), and more.
  3. Start the development server: Run yarn dev.
  4. Use voice input (browser permission required): When prompted, allow microphone access, then click “Start Voice Chat” and speak to the AI.
  5. Test the unified text API (optional): With the dev server running, run the TypeScript scripts in server/tests/ to verify text generation against configured providers.

For local setups, the project supports Ollama (via OLLAMA_BASE_URL, defaulting to http://127.0.0.1:11434) and ComfyUI Desktop (via COMFYUI_BASE_URL, defaulting to http://127.0.0.1:8000).

Use Cases

  • Voice-first multimodal interaction prototypes: Use the voice chat flow to test how spoken user input can drive an AI experience that also produces generated visuals.
  • Experimenting with AI-native “OS” mindset for product teams: Product strategists and designers can read the high-level paradigm documentation (LLM_OS.md) to frame interaction concepts beyond text-only chat.
  • Engineering or evaluating orchestration stacks: Developers and researchers can use the system diagrams and workflow detail in WHITEPAPER.md to understand and assess orchestration behavior for multimodal chat.
  • Extending chat capabilities with plugins: Developers can follow TOOLPLUGIN.md to implement extensions end-to-end, including TypeScript contracts and Vue views.
  • Local, controllable image generation in a chat loop: When image generation needs to run locally (model/workflow control), integrate with ComfyUI Desktop and use the local API to generate images.

FAQ

Q: What does “provider-agnostic” text generation mean in MulmoChat?
A: The project provides a unified API (POST /api/text/generate) that takes provider, model, and messages and returns a normalized text response across supported vendors.

Q: Which LLM providers does the unified text API support?
A: The repository text API documentation lists OpenAI, Anthropic, Google Gemini, and Ollama as supported providers (with provider availability depending on configured credentials).

Q: Do I need API keys for all providers?
A: No—features and provider availability depend on what you configure in your .env. Optional keys are noted for specific capabilities (e.g., maps, AI-powered search, HTML generation).

Q: How do I verify text generation works?
A: Run the provided scripts under server/tests/ (e.g., server/tests/test-text-openai.ts, test-text-anthropic.ts, etc.). These scripts report the selected model and normalized output, and log diagnostics on failure.

Q: How is local image generation handled?
A: MulmoChat integrates with ComfyUI Desktop via a local API server (configured through COMFYUI_BASE_URL). This supports local model/workflow usage rather than cloud-only generation.

Alternatives

  • Text-only chat applications: Traditional chat interfaces focus on message streams without canvas-based multimodal rendering, which simplifies implementation but doesn’t demonstrate multimodal interaction patterns.
  • General multimodal model clients (separate UI + model calls): Tools that combine images and chat typically require composing UI rendering and model calls yourself; MulmoChat focuses on a reference architecture and interaction principles.
  • Local image generation front-ends (ComfyUI or similar) without a chat UX layer: Running image workflows locally can be done outside of a conversational interface, but you won’t get the unified multimodal chat flow described here.
  • Agent frameworks with tool calling (without a specific multimodal canvas architecture): Agent tooling can orchestrate model actions and tools, but may not provide the same canvas-centered multimodal interaction patterns.