How SSE Streaming Works in Chatbots
When ChatGPT-style UIs print text word by word, it looks like a typing animation. It is not. The browser is receiving real tokens from your server as the LLM generates them — usually over HTTP streaming, often formatted as Server-Sent Events (SSE).
I used to assume the chatbot waited for the full answer, then faked a typewriter effect in JavaScript. Once you see a real stream implementation, that mental model breaks: the first word can appear in under a second while the model is still thinking ahead.
Who is this for? Developers building chat UIs, RAG bots, or any page that needs live server updates — and anyone who confused streaming with CSS typing effects.
The typing effect misconception
There are two completely different UX patterns that look identical to users:
| Fake typing | Real SSE / HTTP stream | |
|---|---|---|
| Server | Returns full JSON answer in one shot | Keeps connection open; sends chunks as generated |
| Browser | setInterval reveals characters | Appends each token from the network |
| First visible text | After entire LLM call finishes (e.g. 8s) | Often <500ms after send (time-to-first-token) |
| Cancel button | Cosmetic — work already done | Aborts provider stream; saves cost |
| User perception | "Why is it thinking so long then typing?" | "It's answering while it thinks" |
Illustrative timeline — 8s total generation. SSE shows text at ~0.4s; fake typing starts at ~8s.
If your chatbot feels sluggish until the "typing" begins, you are probably waiting for completion before animating. Fix the pipeline, not the animation.
What is Server-Sent Events (SSE)?
SSE is a web standard for server → client push over a single HTTP connection.
- Content-Type:
text/event-stream - Client API:
EventSource(GET) orfetch+ReadableStream(POST) - Direction: one-way — client sends the question via a separate POST
- Reconnect: browser auto-reconnects if the connection drops
- Format: lines starting with
data:(optionalevent:andid:)
// Raw SSE over the wire
data: {"type":"token","content":"The"}
data: {"type":"token","content":" answer"}
data: {"type":"token","content":" is"}
event: done
data: {"finish_reason":"stop"}
Each blank line separates events. The browser's stream parser fires as soon as a complete event arrives — no waiting for the full response body.
SSE vs WebSockets vs polling
| Method | Direction | Protocol | Best for | Chat fit |
|---|---|---|---|---|
| SSE | Server → client | HTTP | LLM tokens, progress bars, logs | Excellent |
| WebSocket | Bidirectional | WS upgrade | Games, collab editors, voice | Overkill for simple chat |
| Long polling | Simulated push | HTTP | Legacy browsers | Higher latency |
| Short polling | Client pulls | HTTP | Simple dashboards | Poor for token streams |
Relative scores for chat use cases (higher is better for latency & simplicity)
Most RAG chatbots use POST + streaming response. WebSockets shine when the same socket carries dozens of event types both ways — not required for "user asks → assistant streams answer."
End-to-end flow in a chatbot
| Step | Who | What happens |
|---|---|---|
| 1 | Browser | POST { message, history } to /api/chat/stream |
| 2 | Your API | Embed query, search vector DB, optional rerank (not streamed) |
| 3 | Your API → LLM | stream: true completion with context + history |
| 4 | LLM → API | Token deltas arrive over provider's HTTP stream |
| 5 | API → Browser | Each delta forwarded as data: {...}\n\n |
| 6 | Browser | Append to message bubble; on done, re-enable input |
How LLM providers stream tokens
Providers send token deltas (subword pieces), not whole words. Your server normalizes them into one text stream for the UI.
OpenAI / compatible APIs
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
yield delta # → wrap in SSE data: line
Wire format: newline-delimited JSON lines, ends with data: [DONE].
Anthropic Claude
with client.messages.stream(
model="claude-haiku-4-5",
messages=messages,
max_tokens=1024,
) as s:
for text in s.text_stream:
yield text
Events include content_block_delta with text fragments.
Google Gemini
for chunk in client.models.generate_content_stream(
model="gemini-2.5-flash-lite",
contents=prompt,
):
if chunk.text:
yield chunk.text
Your abstraction layer should emit uniform SSE JSON ({"type":"token","content":"..."}) so the frontend works with any provider.
Build an SSE streaming API
FastAPI (Python)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json
app = FastAPI()
async def token_generator(user_message: str):
context = retrieve_chunks(user_message) # RAG — not streamed
for token in stream_llm(user_message, context):
payload = json.dumps({"type": "token", "content": token})
yield f"data: {payload}\n\n"
yield f"data: {json.dumps({'type': 'done'})}\n\n"
@app.post("/api/chat/stream")
async def chat_stream(body: ChatRequest):
return StreamingResponse(
token_generator(body.message),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no",
},
)
Node / Express
app.post("/api/chat/stream", async (req, res) => {
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
res.flushHeaders();
const stream = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: req.body.messages,
stream: true,
});
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content ?? "";
if (text) res.write(`data: ${JSON.stringify({ content: text })}\n\n`);
}
res.write(`data: ${JSON.stringify({ done: true })}\n\n`);
res.end();
});
Headers that matter: text/event-stream, Cache-Control: no-cache, and X-Accel-Buffering: no on nginx. Without these, users see one big blob at the end — which feels like fake typing waiting to start.
Frontend: read the stream
fetch + ReadableStream (POST chat — recommended)
EventSource only supports GET. Chat almost always POSTs the message body, so use fetch:
async function streamChat(message, onToken) {
const res = await fetch("/api/chat/stream", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ message }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop() ?? "";
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = JSON.parse(line.slice(6));
if (data.type === "token") onToken(data.content);
}
}
}
Append tokens immediately — no setTimeout between characters:
let text = "";
streamChat(userMsg, (token) => {
text += token;
bubble.textContent = text;
});
EventSource (GET progress / live site updates)
const es = new EventSource("/api/sync/progress");
es.addEventListener("progress", (e) => {
const { pct, stage } = JSON.parse(e.data);
updateBar(pct, stage);
});
es.addEventListener("complete", () => es.close());
RAG + streaming
RAG adds a retrieval phase before generation. Only the LLM phase streams.
| Phase | Streamed? | Typical duration |
|---|---|---|
| Embed query | No | 50–200ms |
| Vector search | No | 50–300ms |
| Rerank (optional) | No | 100–500ms |
| LLM generation | Yes | 2–15s |
UX pattern: show "Searching docs…" during retrieval, then switch to token stream. Optional multi-event SSE:
event: status
data: {"phase":"retrieval"}
event: token
data: {"content":"Based on"}
event: citation
data: {"source":"docs/api.md","chunk_id":"api_3"}
See How to Build a RAG Chatbot and economical model picks for RAG.
SSE for live site updates (not just chat)
The same transport powers sync progress, deploy logs, and indexing status on a dashboard — not only chat tokens.
event: progress
data: {"stage":"embed","pct":60,"file":"page-b.md"}
event: complete
data: {"duration_s":94,"files_indexed":12}
We document this pattern in Publish a Knowledge Base to GitHub Pages — EventSource('/sync/stream') driving a progress bar while embeddings run in the background.
Common SSE event types — chat streams tokens; sync jobs stream progress percentages
Production checklist
Server
- Disable nginx / CDN buffering on stream routes
- Flush after every
yield - Heartbeat comment lines (
:\n\n) on long gaps - Abort provider stream when client disconnects
Client
AbortControllerfor Stop button- Handle
event: errorgracefully - Don't fake typewriter delay
- Debounce markdown re-parse if needed
Observability
- Log time-to-first-token (TTFT)
- Track tokens/sec and stream duration
- Alert on buffer-timeout errors
- Rate-limit stream endpoints
FAQ
Is ChatGPT's typing effect fake?
In production streaming APIs, no — text arrives as the model generates. Some marketing demos and tutorial apps fake it with timers. Check whether your backend uses stream: true.
SSE or WebSocket for my chatbot?
SSE (or chunked HTTP) for most chat + RAG bots. WebSocket when you need constant bidirectional traffic on one connection.
Why does nothing appear until the full answer is ready?
Almost always proxy or framework buffering. Add X-Accel-Buffering: no, disable response buffering in your WSGI/ASGI middleware, and verify with curl -N.
Can I stream markdown safely?
Append raw text live; run markdown rendering on a short debounce or after done to avoid broken partial syntax (unclosed ** or code fences).
Does streaming cost more?
Same tokens billed. Streaming improves UX and lets you cancel early — which can reduce cost on aborted long answers.
Glossary
| Term | Meaning |
|---|---|
| SSE | Server-Sent Events — HTTP stream of data: lines from server to browser |
| TTFT | Time to first token — latency until first visible character |
| Token delta | Incremental text chunk from the LLM's stream |
| ReadableStream | Browser API to read a fetch response body chunk by chunk |
| EventSource | Browser API for GET-based SSE subscriptions |
| Buffering | Proxy holding chunks until response completes — kills live UX |
Related guides
- How to Build a RAG Chatbot
- Best Economical LLM Models for RAG
- Publish a Knowledge Base (SSE progress)
Stream real tokens. Skip the fake typewriter.