Skip to main content
Layercode architecture diagram Layercode is a real-time voice agent orchestration layer built on Cloudflare Workers. It handles the entire audio transport so you can ship production-grade voice AI agents without managing WebRTC, browser audio, or speech infrastructure yourself. From your perspective as a developer, Layercode is pretty much text in / text out:
  1. Layercode captures the caller’s audio, runs speech-to-text (STT), and sends the transcribed text to your backend webhook.
  2. Your backend decides what to do — calling an LLM, tools, or business logic — and responds with the text you want the user to hear.
  3. Layercode turns that text into speech (TTS) and streams it back to the user in real time.

Authentication and Session Model

Layercode routes every client through an authorize → WebSocket handshake so you can govern sessions centrally.

Client Authentication Flow

  1. Your frontend calls your backend (e.g., /api/authorize) with user context.
  2. The backend requests POST /v1/agents/web/authorize_session with agent_id and the org-scoped API key.
  3. Layercode returns a time-bounded client_session_key plus the conversation_id.
  4. The frontend connects to /v1/agents/web/websocket?client_session_key=... using the Layercode SDK.
See the REST API reference and Frontend WebSocket docs for field-level details.

Agent Webhook Flow

  1. Layercode sends signed POST requests (HMAC via layercode-signature) to your webhook.
  2. Verify requests with verifySignature from @layercode/node-server-sdk using LAYERCODE_WEBHOOK_SECRET.
  3. Handle events such as session.start, message, session.update, and session.end. The message event includes the transcription and conversation identifiers.
  4. Respond by calling streamResponse(payload, handler) and emitting stream.tts(), stream.data(), or tool call results. Always call stream.end() even for silent turns.
Minimal example of sending a welcome message to users:
import express from "express";
import { streamResponse } from "@layercode/node-server-sdk";

const app = express();
app.use(express.json());

app.post("/agent", async (req, res) => {
  return streamResponse(req.body, async ({ stream }) => {
    stream.tts("Hi, how can I help you today?");
    stream.end();
  });
});

Receiving messages from the client (user)

Every Layercode webhook request includes the transcribed user utterance so your backend never has to handle raw audio. A typical payload contains:
{
  "type": "message",
  "session_id": "sess_123",
  "conversation_id": "conv_456",
  "text": "What is our return policy?"
}

Generating LLM responses and replying

Once you have a response string (or stream) from your model, send it back through the stream helper. You can optionally stream interim data to the UI while you wait on the final text.
import { streamText } from "ai";
import { google } from "@ai-sdk/google";
import { streamResponse } from "@layercode/node-server-sdk";

app.post("/agent", async (req, res) => {
  const { type, text } = req.body;

  return streamResponse(req.body, async ({ stream }) => {
    if (type !== "message") {
      stream.end();
      return;
    }

    const { textStream } = await streamText({
      model: google("gemini-2.0-flash-001"),
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: text },
      ],
    });

    await stream.ttsTextStream(textStream);
    stream.end();
  });
});
That’s the full loop: Layercode gives you user text, you return assistant text. Layercode handles buffering, chunking, and converting that text back into speech for the caller.

Summary: what Layercode does and doesn’t do

What Layercode does

  • Connects browsers, mobile apps, or telephony clients to a single real-time voice pipeline.
  • Streams user audio, performs STT (Deepgram today, more providers coming), and delivers plain text to your webhook in milliseconds.
  • Accepts your text responses and converts them into low-latency speech using ElevenLabs, Cartesia, or Rime—bring your own keys or use Layercode-managed ones.
  • Manages turn taking (auto VAD or push-to-talk), jitter buffering, and session lifecycle so conversations feel natural.
  • Provides dashboards for observability, session recording, latency analytics, and agent configuration without redeploys.

What Layercode doesn’t do

  • Host your web app or backend logic — you run your own servers and own your customer state.
  • Provide the LLM or agent brain—you choose the model, prompts, and tool integrations. Layercode only transports text to and from your system.
  • Guarantee tool execution or business workflows — that remains inside your infrastructure; Layercode just keeps the audio loop in sync.
  • Currently, Layercode does not support real time Speech to Speech models
I