Building an AI Agent That Shares a Canvas With Humans
What happens when you give an AI the same editing surface as your users?
Most AI integrations in collaborative tools feel bolted on. A sidebar chatbot that generates text. A modal that produces an image. An assistant that operates in its own sandbox, disconnected from the thing you're actually building.
I wanted to try something different. What if the AI agent wasn't next to the whiteboard -- what if it was on the whiteboard? Same document, same real-time sync, same conflict resolution. No special rendering path, no "AI-generated" badge, no refresh required. Just another collaborator that happens to understand natural language.
This is what I built with CollabBoard, and here's what I learned along the way.
The Problem With Most AI Tool Integrations
The standard pattern for adding AI to a collaborative app looks like this: user sends a prompt, AI generates a response, the app takes that response and inserts it into the document. It works, but it creates a second-class experience. The AI output appears differently than human output. Other collaborators might not see it until they refresh. Conflict resolution between AI-generated content and human edits is either nonexistent or fragile.
The root cause is architectural. The AI operates on a copy of the data, generates its output, and then the app tries to merge it back. This is the same problem distributed systems solved decades ago -- and the solution is the same too. Don't give the AI a copy. Give it access to the shared document.
Architecture: One Document, Many Writers
CollabBoard is a real-time collaborative whiteboard. Multiple users can create sticky notes, draw shapes, build frames, and drag objects around simultaneously. The board state lives in a Yjs shared document -- a CRDT (Conflict-free Replicated Data Type) that handles conflict resolution automatically.
The key architectural decision was simple: the AI agent writes to the same Yjs document as human users. When the agent creates a sticky note, it calls the same mutation path that fires when a human drags one from the toolbar. The WebSocket server broadcasts the change to every connected client. No special handling, no reconciliation step, no eventual consistency lag.
This means a user in New York can watch an AI command from a user in London execute in real-time -- objects appearing on the canvas exactly as if someone placed them by hand.
Two Brains Are Better Than One
Early on I noticed something obvious in retrospect: most AI commands to a whiteboard are predictable. "Create a kanban board" should always produce three columns labeled To Do, In Progress, and Done. "Build a SWOT analysis" should always produce a 2x2 grid. Sending these to an LLM every time is wasteful -- you're paying for tokens to get the same answer you could have computed for free.
So the agent has two execution paths.
The deterministic planner is the first gate every command hits. It uses pattern matching to recognize structured intents -- templates, bulk operations, layout commands. When you say "create a SWOT analysis," the planner generates all 16 tool calls (four frames, four header stickies, eight content stickies) from predefined template definitions. No API call. Sub-second execution. Zero cost.
The planner currently handles Kanban boards, SWOT analyses, Retrospectives, Lean Canvases, Roadmaps, Eisenhower Matrices, Mind Maps, bulk sticky note generation (up to 5,000), grid arrangements, and color-based object grouping.
The LLM fallback catches everything else. When the planner returns null, the command is sent to GPT-4o-mini with function calling. The model receives a system prompt describing the board context and ten tool definitions, then responds with structured tool calls. For commands that need to reference existing objects ("change the color of that note"), the model first reads the board state, then issues mutations in a second pass.
In practice, the planner intercepts the majority of complex commands. The LLM handles the long tail -- the novel, one-off requests that don't fit a pattern. This split keeps costs low and latency predictable while still handling arbitrary natural language.
The Tool Calling Contract
The AI agent operates through ten tools, following OpenAI's function-calling schema:
Two are read-only. getBoardState returns a scoped view of the board -- never more than 50 objects, with a summary line showing total count versus returned count. This scoping is critical. A board with 500 objects would blow out any context window if sent in full. findObjects searches by type, color, text content, or spatial proximity.
Eight are mutations. createStickyNote, createShape, createFrame, and createConnector add new objects. moveObject, resizeObject, updateText, and changeColor modify existing ones.
Every tool has typed parameters. The agent can't hallucinate an invalid action -- it either produces a valid tool call or it doesn't produce one at all. This is the advantage of function calling over free-form text generation. The failure mode is "the agent didn't do anything" rather than "the agent did something unpredictable."
Mutations are executed sequentially through an HTTP bridge to the WebSocket server. The bridge applies each tool call to the live Yjs document and returns the affected object IDs. For bulk operations (ten or more mutations), they're batched into a single request to reduce round trips.
When Things Go Wrong: Resilience Patterns
Building an AI agent that mutates shared state means you need to think carefully about failure modes.
Stale object references. The agent reads board state, then issues a mutation referencing an object ID. But between the read and the write, another user might have deleted that object. The route layer detects this class of error and retries once with refreshed state.
Layout verification. For complex layouts like Kanban boards, the planner generates move/resize steps to arrange objects. After execution, a verification pass checks whether the layout actually matches the intent. If objects drifted (due to concurrent edits or rounding), corrective steps are issued automatically.
Bridge unavailability. The primary execution path goes through the WebSocket server's HTTP bridge. If the server is unreachable (network issues, deployment gaps), an inline fallback loads the board's Yjs snapshot from Supabase, applies mutations locally, and saves the updated snapshot back. The user's changes are preserved even if the real-time server is down.
API key validation. Before any LLM call, the route validates that the OpenAI API key is present and properly formatted. If it's misconfigured, the user gets a specific error message instead of a generic 500.
None of these patterns are glamorous. But they're the difference between a demo that works on a good day and a product that works on every day.
Real-Time Collaboration: The Foundation
The AI agent is compelling, but it's built on top of something more fundamental: bulletproof multiplayer sync.
Board objects live in a Yjs Y.Map, synchronized via WebSocket. Yjs is a CRDT implementation -- every edit is a commutative, associative, idempotent operation. Two users can edit the same object simultaneously and the result converges automatically. There's no locking, no last-write-wins, no manual merge resolution.
Cursor positions are broadcast separately via Socket.io at 20-30Hz. They're ephemeral -- not persisted, not conflict-resolved, just fire-and-forget position updates. This keeps cursor movement feeling instantaneous without polluting the document state.
Persistence works through debounced snapshotting. Every 500 milliseconds (or when the last user disconnects), the server serializes the Yjs document and writes it to Supabase Postgres as a binary snapshot. When a user loads a board, the snapshot is fetched and applied, then the WebSocket connection picks up any changes that happened since the snapshot was taken.
This architecture means the board survives every failure mode: all users leave and come back, the server restarts, the database fails over. The Yjs document is the source of truth, and it's reconstructable from any snapshot plus the update log.
Tracing: You Can't Optimize What You Can't Measure
Every AI command is traced. Not just "did it succeed or fail," but the full execution story:
- Which path did it take? Deterministic planner or LLM?
- If LLM: how many tokens were consumed? What was the latency? What was the estimated cost?
- Were there retries? Verification corrections? Fallback activations?
- What tool calls were generated, and in what order?
The tracing system supports Langfuse and LangSmith integration for production-grade observability. But even without those services configured, every command logs its execution path and cost to the console.
This matters more than it might seem. When you're paying per-token for AI features, understanding your cost distribution is essential. The deterministic planner handles a SWOT analysis in 200ms with zero tokens. The same request through the LLM takes 1-2 seconds and costs a fraction of a cent. Multiply that across thousands of users and the planner pays for itself immediately.
The Design System: Neo-Brutalism
A quick note on the visual design, because it's part of the product experience. CollabBoard uses a neo-brutalism aesthetic: thick black borders, hard offset box-shadows that shift on hover, bold Space Grotesk typography, and a warm off-white background. Interactive elements have a satisfying "pressed" feel -- hover reduces the shadow and translates the element, active removes the shadow entirely.
This wasn't just a style choice. Neo-brutalism's high contrast and bold borders make UI elements unambiguous at a glance, which matters when you're scanning a dashboard of boards or navigating a dense canvas. The design system is implemented through CSS custom properties and utility classes, so it's consistent across every component without ceremony.
What I'd Do Differently
Generate Supabase types. The biggest production bug -- a TypeScript error that blocked Vercel deployments for multiple commits -- was caused by untyped Supabase client calls. Generated types from supabase gen types would have caught this at development time instead of in the CI pipeline.
Multi-step agent reasoning. The current agent is single-step: one command produces one set of tool calls. A multi-step agent that can reason across turns ("create a kanban board, then move all the red notes to the Done column, then add a frame around the whole thing") would be significantly more powerful. The architecture supports it -- the planner already chains verification steps -- but the LLM path doesn't.
Smarter scoping for getBoardState. The current 50-object cap is a blunt instrument. Spatial scoping (only return objects near a given coordinate) or semantic scoping (only return objects matching a query) would give the agent better context without hitting token limits.
The Takeaway
The most interesting thing about building an AI agent for a collaborative tool isn't the AI part. It's the integration part. The agent is only as good as its connection to the shared state. If it operates in a silo, it's a gimmick. If it participates in the same system as humans -- same data layer, same real-time sync, same conflict resolution -- it becomes a genuinely useful collaborator.
The technical ingredients (CRDTs, function calling, deterministic planning, tracing) aren't individually novel. But composed together, they create something that feels different from the typical AI integration. The AI doesn't feel like a feature. It feels like a participant.
That's the bar I think we should be aiming for.
Subscribe for the latest updates from us
We respect your privacy. you can Unsubscribe at any time

