How SSE Streaming Works in Chatbots

June 2026 · Published by Amar Kumar

When ChatGPT-style UIs print text word by word, it looks like a typing animation. It is not. The browser is receiving real tokens from your server as the LLM generates them — usually over HTTP streaming, often formatted as Server-Sent Events (SSE).

I used to assume the chatbot waited for the full answer, then faked a typewriter effect in JavaScript. Once you see a real stream implementation, that mental model breaks: the first word can appear in under a second while the model is still thinking ahead.

Who is this for? Developers building chat UIs, RAG bots, or any page that needs live server updates — and anyone who confused streaming with CSS typing effects.

The typing effect misconception

There are two completely different UX patterns that look identical to users:

Fake typingReal SSE / HTTP stream
ServerReturns full JSON answer in one shotKeeps connection open; sends chunks as generated
BrowsersetInterval reveals charactersAppends each token from the network
First visible textAfter entire LLM call finishes (e.g. 8s)Often <500ms after send (time-to-first-token)
Cancel buttonCosmetic — work already doneAborts provider stream; saves cost
User perception"Why is it thinking so long then typing?""It's answering while it thinks"

Illustrative timeline — 8s total generation. SSE shows text at ~0.4s; fake typing starts at ~8s.

If your chatbot feels sluggish until the "typing" begins, you are probably waiting for completion before animating. Fix the pipeline, not the animation.

What is Server-Sent Events (SSE)?

SSE is a web standard for server → client push over a single HTTP connection.

// Raw SSE over the wire
data: {"type":"token","content":"The"}

data: {"type":"token","content":" answer"}

data: {"type":"token","content":" is"}

event: done
data: {"finish_reason":"stop"}

Each blank line separates events. The browser's stream parser fires as soon as a complete event arrives — no waiting for the full response body.

SSE vs WebSockets vs polling

MethodDirectionProtocolBest forChat fit
SSEServer → clientHTTPLLM tokens, progress bars, logsExcellent
WebSocketBidirectionalWS upgradeGames, collab editors, voiceOverkill for simple chat
Long pollingSimulated pushHTTPLegacy browsersHigher latency
Short pollingClient pullsHTTPSimple dashboardsPoor for token streams

Relative scores for chat use cases (higher is better for latency & simplicity)

Most RAG chatbots use POST + streaming response. WebSockets shine when the same socket carries dozens of event types both ways — not required for "user asks → assistant streams answer."

End-to-end flow in a chatbot

Browser POST /chat API: RAG retrieve LLM stream=True SSE data: tokens UI appends text
StepWhoWhat happens
1BrowserPOST { message, history } to /api/chat/stream
2Your APIEmbed query, search vector DB, optional rerank (not streamed)
3Your API → LLMstream: true completion with context + history
4LLM → APIToken deltas arrive over provider's HTTP stream
5API → BrowserEach delta forwarded as data: {...}\n\n
6BrowserAppend to message bubble; on done, re-enable input

How LLM providers stream tokens

Providers send token deltas (subword pieces), not whole words. Your server normalizes them into one text stream for the UI.

OpenAI / compatible APIs

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        yield delta  # → wrap in SSE data: line

Wire format: newline-delimited JSON lines, ends with data: [DONE].

Anthropic Claude

with client.messages.stream(
    model="claude-haiku-4-5",
    messages=messages,
    max_tokens=1024,
) as s:
    for text in s.text_stream:
        yield text

Events include content_block_delta with text fragments.

Google Gemini

for chunk in client.models.generate_content_stream(
    model="gemini-2.5-flash-lite",
    contents=prompt,
):
    if chunk.text:
        yield chunk.text

Your abstraction layer should emit uniform SSE JSON ({"type":"token","content":"..."}) so the frontend works with any provider.

Build an SSE streaming API

FastAPI (Python)

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json

app = FastAPI()

async def token_generator(user_message: str):
    context = retrieve_chunks(user_message)  # RAG — not streamed

    for token in stream_llm(user_message, context):
        payload = json.dumps({"type": "token", "content": token})
        yield f"data: {payload}\n\n"

    yield f"data: {json.dumps({'type': 'done'})}\n\n"

@app.post("/api/chat/stream")
async def chat_stream(body: ChatRequest):
    return StreamingResponse(
        token_generator(body.message),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        },
    )

Node / Express

app.post("/api/chat/stream", async (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  res.flushHeaders();

  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const text = chunk.choices[0]?.delta?.content ?? "";
    if (text) res.write(`data: ${JSON.stringify({ content: text })}\n\n`);
  }
  res.write(`data: ${JSON.stringify({ done: true })}\n\n`);
  res.end();
});

Headers that matter: text/event-stream, Cache-Control: no-cache, and X-Accel-Buffering: no on nginx. Without these, users see one big blob at the end — which feels like fake typing waiting to start.

Frontend: read the stream

fetch + ReadableStream (POST chat — recommended)

EventSource only supports GET. Chat almost always POSTs the message body, so use fetch:

async function streamChat(message, onToken) {
  const res = await fetch("/api/chat/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ message }),
  });

  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });

    const lines = buffer.split("\n");
    buffer = lines.pop() ?? "";

    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      const data = JSON.parse(line.slice(6));
      if (data.type === "token") onToken(data.content);
    }
  }
}

Append tokens immediately — no setTimeout between characters:

let text = "";
streamChat(userMsg, (token) => {
  text += token;
  bubble.textContent = text;
});

EventSource (GET progress / live site updates)

const es = new EventSource("/api/sync/progress");
es.addEventListener("progress", (e) => {
  const { pct, stage } = JSON.parse(e.data);
  updateBar(pct, stage);
});
es.addEventListener("complete", () => es.close());

RAG + streaming

RAG adds a retrieval phase before generation. Only the LLM phase streams.

PhaseStreamed?Typical duration
Embed queryNo50–200ms
Vector searchNo50–300ms
Rerank (optional)No100–500ms
LLM generationYes2–15s

UX pattern: show "Searching docs…" during retrieval, then switch to token stream. Optional multi-event SSE:

event: status
data: {"phase":"retrieval"}

event: token
data: {"content":"Based on"}

event: citation
data: {"source":"docs/api.md","chunk_id":"api_3"}

See How to Build a RAG Chatbot and economical model picks for RAG.

SSE for live site updates (not just chat)

The same transport powers sync progress, deploy logs, and indexing status on a dashboard — not only chat tokens.

event: progress
data: {"stage":"embed","pct":60,"file":"page-b.md"}

event: complete
data: {"duration_s":94,"files_indexed":12}

We document this pattern in Publish a Knowledge Base to GitHub PagesEventSource('/sync/stream') driving a progress bar while embeddings run in the background.

Common SSE event types — chat streams tokens; sync jobs stream progress percentages

Production checklist

Server

  • Disable nginx / CDN buffering on stream routes
  • Flush after every yield
  • Heartbeat comment lines (:\n\n) on long gaps
  • Abort provider stream when client disconnects

Client

  • AbortController for Stop button
  • Handle event: error gracefully
  • Don't fake typewriter delay
  • Debounce markdown re-parse if needed

Observability

  • Log time-to-first-token (TTFT)
  • Track tokens/sec and stream duration
  • Alert on buffer-timeout errors
  • Rate-limit stream endpoints

FAQ

Is ChatGPT's typing effect fake?

In production streaming APIs, no — text arrives as the model generates. Some marketing demos and tutorial apps fake it with timers. Check whether your backend uses stream: true.

SSE or WebSocket for my chatbot?

SSE (or chunked HTTP) for most chat + RAG bots. WebSocket when you need constant bidirectional traffic on one connection.

Why does nothing appear until the full answer is ready?

Almost always proxy or framework buffering. Add X-Accel-Buffering: no, disable response buffering in your WSGI/ASGI middleware, and verify with curl -N.

Can I stream markdown safely?

Append raw text live; run markdown rendering on a short debounce or after done to avoid broken partial syntax (unclosed ** or code fences).

Does streaming cost more?

Same tokens billed. Streaming improves UX and lets you cancel early — which can reduce cost on aborted long answers.

Glossary

TermMeaning
SSEServer-Sent Events — HTTP stream of data: lines from server to browser
TTFTTime to first token — latency until first visible character
Token deltaIncremental text chunk from the LLM's stream
ReadableStreamBrowser API to read a fetch response body chunk by chunk
EventSourceBrowser API for GET-based SSE subscriptions
BufferingProxy holding chunks until response completes — kills live UX

Stream real tokens. Skip the fake typewriter.