MulmoChat
MulmoChat is a research prototype for multimodal AI chat on a canvas, combining conversational text with rich visual and interactive content, plus APIs and ComfyUI.
What is MulmoChat?
MulmoChat is a research prototype for exploring multimodal AI chat experiences. Instead of limiting interactions to a text message stream, it aims to support conversational experiences that include rich visual and interactive content rendered directly on a canvas.
The core purpose is to demonstrate an architecture, design patterns, and UX principles for multimodal chat interfaces where visual experiences and language understanding work together within a single conversational flow.
Key Features
- Multimodal chat on a canvas: Designed to combine conversation with visual, interactive content (for example, images and other rich visual elements) in the same user experience.
- Research-oriented architecture & UX patterns: Includes documentation aimed at both product-oriented exploration and engineering implementation (e.g., LLM_OS.md and WHITEPAPER.md).
- Provider-agnostic text generation API: Exposes a unified backend API that normalizes text generation responses across multiple LLM providers.
- Text provider discovery endpoint:
GET /api/text/providersreturns configured providers (OpenAI, Anthropic, Google Gemini, Ollama), along with model suggestions and credential availability. - Unified text generation endpoint:
POST /api/text/generateaccepts a provider, model, and messages, returning a normalized text response regardless of vendor. - Local image generation integration via ComfyUI: Integrates with ComfyUI Desktop for local image generation using locally hosted models and workflows (e.g., FLUX), rather than only relying on cloud generation.
How to Use MulmoChat
- Install dependencies: Run
yarn install. - Configure environment variables: Create a
.envfile with keys such asOPENAI_API_KEYandGEMINI_API_KEY(required by the project depending on enabled features), plus optional keys for map features (GOOGLE_MAP_API_KEY), AI-powered search (EXA_API_KEY), HTML generation (ANTHROPIC_API_KEY), and more. - Start the development server: Run
yarn dev. - Use voice input (browser permission required): When prompted, allow microphone access, then click “Start Voice Chat” and speak to the AI.
- Test the unified text API (optional): With the dev server running, run the TypeScript scripts in
server/tests/to verify text generation against configured providers.
For local setups, the project supports Ollama (via OLLAMA_BASE_URL, defaulting to http://127.0.0.1:11434) and ComfyUI Desktop (via COMFYUI_BASE_URL, defaulting to http://127.0.0.1:8000).
Use Cases
- Voice-first multimodal interaction prototypes: Use the voice chat flow to test how spoken user input can drive an AI experience that also produces generated visuals.
- Experimenting with AI-native “OS” mindset for product teams: Product strategists and designers can read the high-level paradigm documentation (LLM_OS.md) to frame interaction concepts beyond text-only chat.
- Engineering or evaluating orchestration stacks: Developers and researchers can use the system diagrams and workflow detail in WHITEPAPER.md to understand and assess orchestration behavior for multimodal chat.
- Extending chat capabilities with plugins: Developers can follow TOOLPLUGIN.md to implement extensions end-to-end, including TypeScript contracts and Vue views.
- Local, controllable image generation in a chat loop: When image generation needs to run locally (model/workflow control), integrate with ComfyUI Desktop and use the local API to generate images.
FAQ
Q: What does “provider-agnostic” text generation mean in MulmoChat?
A: The project provides a unified API (POST /api/text/generate) that takes provider, model, and messages and returns a normalized text response across supported vendors.
Q: Which LLM providers does the unified text API support?
A: The repository text API documentation lists OpenAI, Anthropic, Google Gemini, and Ollama as supported providers (with provider availability depending on configured credentials).
Q: Do I need API keys for all providers?
A: No—features and provider availability depend on what you configure in your .env. Optional keys are noted for specific capabilities (e.g., maps, AI-powered search, HTML generation).
Q: How do I verify text generation works?
A: Run the provided scripts under server/tests/ (e.g., server/tests/test-text-openai.ts, test-text-anthropic.ts, etc.). These scripts report the selected model and normalized output, and log diagnostics on failure.
Q: How is local image generation handled?
A: MulmoChat integrates with ComfyUI Desktop via a local API server (configured through COMFYUI_BASE_URL). This supports local model/workflow usage rather than cloud-only generation.
Alternatives
- Text-only chat applications: Traditional chat interfaces focus on message streams without canvas-based multimodal rendering, which simplifies implementation but doesn’t demonstrate multimodal interaction patterns.
- General multimodal model clients (separate UI + model calls): Tools that combine images and chat typically require composing UI rendering and model calls yourself; MulmoChat focuses on a reference architecture and interaction principles.
- Local image generation front-ends (ComfyUI or similar) without a chat UX layer: Running image workflows locally can be done outside of a conversational interface, but you won’t get the unified multimodal chat flow described here.
- Agent frameworks with tool calling (without a specific multimodal canvas architecture): Agent tooling can orchestrate model actions and tools, but may not provide the same canvas-centered multimodal interaction patterns.
Alternatives
BookAI.chat
BookAI allows you to chat with your books using AI by simply providing the title and author.
Grok AI Assistant
Grok is a free AI assistant developed by xAI, engineered to prioritize truth and objectivity while offering advanced capabilities like real-time information access and image generation.
AakarDev AI
AakarDev AI is a powerful platform that simplifies the development of AI applications with seamless vector database integration, enabling rapid deployment and scalability.
Talkpal
Talkpal is an AI language teacher that helps users learn languages faster through immersive conversations and real-time feedback.
FeelFish
FeelFish AI Novel Writing Agent PC client helps novel creators plan characters and settings, generate and edit chapters, and continue plots with context consistency.
BenchSpan
BenchSpan runs AI agent benchmarks in parallel, captures scores and failures in run history, and uses commit-tagged executions to improve reproducibility.