MulmoChat
MulmoChat is a research prototype for multimodal AI chat on a canvas, combining conversational text with rich visual and interactive content, plus APIs and ComfyUI.
What is MulmoChat?
MulmoChat is a research prototype for exploring multimodal AI chat experiences. Instead of limiting interactions to a text message stream, it aims to support conversational experiences that include rich visual and interactive content rendered directly on a canvas.
The core purpose is to demonstrate an architecture, design patterns, and UX principles for multimodal chat interfaces where visual experiences and language understanding work together within a single conversational flow.
Key Features
- Multimodal chat on a canvas: Designed to combine conversation with visual, interactive content (for example, images and other rich visual elements) in the same user experience.
- Research-oriented architecture & UX patterns: Includes documentation aimed at both product-oriented exploration and engineering implementation (e.g., LLM_OS.md and WHITEPAPER.md).
- Provider-agnostic text generation API: Exposes a unified backend API that normalizes text generation responses across multiple LLM providers.
- Text provider discovery endpoint:
GET /api/text/providersreturns configured providers (OpenAI, Anthropic, Google Gemini, Ollama), along with model suggestions and credential availability. - Unified text generation endpoint:
POST /api/text/generateaccepts a provider, model, and messages, returning a normalized text response regardless of vendor. - Local image generation integration via ComfyUI: Integrates with ComfyUI Desktop for local image generation using locally hosted models and workflows (e.g., FLUX), rather than only relying on cloud generation.
How to Use MulmoChat
- Install dependencies: Run
yarn install. - Configure environment variables: Create a
.envfile with keys such asOPENAI_API_KEYandGEMINI_API_KEY(required by the project depending on enabled features), plus optional keys for map features (GOOGLE_MAP_API_KEY), AI-powered search (EXA_API_KEY), HTML generation (ANTHROPIC_API_KEY), and more. - Start the development server: Run
yarn dev. - Use voice input (browser permission required): When prompted, allow microphone access, then click “Start Voice Chat” and speak to the AI.
- Test the unified text API (optional): With the dev server running, run the TypeScript scripts in
server/tests/to verify text generation against configured providers.
For local setups, the project supports Ollama (via OLLAMA_BASE_URL, defaulting to http://127.0.0.1:11434) and ComfyUI Desktop (via COMFYUI_BASE_URL, defaulting to http://127.0.0.1:8000).
Use Cases
- Voice-first multimodal interaction prototypes: Use the voice chat flow to test how spoken user input can drive an AI experience that also produces generated visuals.
- Experimenting with AI-native “OS” mindset for product teams: Product strategists and designers can read the high-level paradigm documentation (LLM_OS.md) to frame interaction concepts beyond text-only chat.
- Engineering or evaluating orchestration stacks: Developers and researchers can use the system diagrams and workflow detail in WHITEPAPER.md to understand and assess orchestration behavior for multimodal chat.
- Extending chat capabilities with plugins: Developers can follow TOOLPLUGIN.md to implement extensions end-to-end, including TypeScript contracts and Vue views.
- Local, controllable image generation in a chat loop: When image generation needs to run locally (model/workflow control), integrate with ComfyUI Desktop and use the local API to generate images.
FAQ
Q: What does “provider-agnostic” text generation mean in MulmoChat?
A: The project provides a unified API (POST /api/text/generate) that takes provider, model, and messages and returns a normalized text response across supported vendors.
Q: Which LLM providers does the unified text API support?
A: The repository text API documentation lists OpenAI, Anthropic, Google Gemini, and Ollama as supported providers (with provider availability depending on configured credentials).
Q: Do I need API keys for all providers?
A: No—features and provider availability depend on what you configure in your .env. Optional keys are noted for specific capabilities (e.g., maps, AI-powered search, HTML generation).
Q: How do I verify text generation works?
A: Run the provided scripts under server/tests/ (e.g., server/tests/test-text-openai.ts, test-text-anthropic.ts, etc.). These scripts report the selected model and normalized output, and log diagnostics on failure.
Q: How is local image generation handled?
A: MulmoChat integrates with ComfyUI Desktop via a local API server (configured through COMFYUI_BASE_URL). This supports local model/workflow usage rather than cloud-only generation.
Alternatives
- Text-only chat applications: Traditional chat interfaces focus on message streams without canvas-based multimodal rendering, which simplifies implementation but doesn’t demonstrate multimodal interaction patterns.
- General multimodal model clients (separate UI + model calls): Tools that combine images and chat typically require composing UI rendering and model calls yourself; MulmoChat focuses on a reference architecture and interaction principles.
- Local image generation front-ends (ComfyUI or similar) without a chat UX layer: Running image workflows locally can be done outside of a conversational interface, but you won’t get the unified multimodal chat flow described here.
- Agent frameworks with tool calling (without a specific multimodal canvas architecture): Agent tooling can orchestrate model actions and tools, but may not provide the same canvas-centered multimodal interaction patterns.
Alternatives
BookAI.chat
BookAI allows you to chat with your books using AI by simply providing the title and author.
Ably Chat
Ably Chat is a chat API and SDKs for building custom realtime chat apps, with reactions, presence, and message edit/delete.
Grok AI Assistant
Grok is a free AI assistant developed by xAI, engineered to prioritize truth and objectivity while offering advanced capabilities like real-time information access and image generation.
AakarDev AI
AakarDev AI is a powerful platform that simplifies the development of AI applications with seamless vector database integration, enabling rapid deployment and scalability.
skills-janitor
Audit, track usage, and compare your Claude Code skills with skills-janitor—nine focused slash commands and zero dependencies.
Talkpal
Talkpal is an AI language teacher that helps users learn languages faster through immersive conversations and real-time feedback.