Is the ChatGPT typing effect fake?

In production streaming implementations, no. The UI appends text as the server receives token chunks from the LLM over HTTP streaming, often formatted as Server-Sent Events. Some demo UIs fake typing with timers, but real APIs stream tokens.

Should I use SSE or WebSockets for a chatbot?

SSE or chunked HTTP is enough for most chatbots because traffic is one-way during generation: client POSTs a message, server streams tokens back. WebSockets help when you need constant bidirectional events on one socket.

Why does my SSE stream not update live in the browser?

Usually reverse-proxy or framework buffering. Disable nginx buffering with X-Accel-Buffering: no, set Cache-Control: no-cache, and flush after each yielded chunk on the server.

How SSE Streaming Works in Chatbots

June 2026 · Published by Amar Kumar

When ChatGPT-style UIs print text word by word, it looks like a typing animation. It is not. The browser is receiving real tokens from your server as the LLM generates them — usually over HTTP streaming, often formatted as Server-Sent Events (SSE).

I used to assume the chatbot waited for the full answer, then faked a typewriter effect in JavaScript. Once you see a real stream implementation, that mental model breaks: the first word can appear in under a second while the model is still thinking ahead.

Who is this for? Developers building chat UIs, RAG bots, or any page that needs live server updates — and anyone who confused streaming with CSS typing effects.

The typing effect misconception

There are two completely different UX patterns that look identical to users:

	Fake typing	Real SSE / HTTP stream
Server	Returns full JSON answer in one shot	Keeps connection open; sends chunks as generated
Browser	`setInterval` reveals characters	Appends each token from the network
First visible text	After entire LLM call finishes (e.g. 8s)	Often <500ms after send (time-to-first-token)
Cancel button	Cosmetic — work already done	Aborts provider stream; saves cost
User perception	"Why is it thinking so long then typing?"	"It's answering while it thinks"

Illustrative timeline — 8s total generation. SSE shows text at ~0.4s; fake typing starts at ~8s.

If your chatbot feels sluggish until the "typing" begins, you are probably waiting for completion before animating. Fix the pipeline, not the animation.

What is Server-Sent Events (SSE)?

SSE is a web standard for server → client push over a single HTTP connection.

Content-Type: text/event-stream
Client API: EventSource (GET) or fetch + ReadableStream (POST)
Direction: one-way — client sends the question via a separate POST
Reconnect: browser auto-reconnects if the connection drops
Format: lines starting with data: (optional event: and id:)

// Raw SSE over the wire
data: {"type":"token","content":"The"}

data: {"type":"token","content":" answer"}

data: {"type":"token","content":" is"}

event: done
data: {"finish_reason":"stop"}

Each blank line separates events. The browser's stream parser fires as soon as a complete event arrives — no waiting for the full response body.

SSE vs WebSockets vs polling

Method	Direction	Protocol	Best for	Chat fit
SSE	Server → client	HTTP	LLM tokens, progress bars, logs	Excellent
WebSocket	Bidirectional	WS upgrade	Games, collab editors, voice	Overkill for simple chat
Long polling	Simulated push	HTTP	Legacy browsers	Higher latency
Short polling	Client pulls	HTTP	Simple dashboards	Poor for token streams

Relative scores for chat use cases (higher is better for latency & simplicity)

Most RAG chatbots use POST + streaming response. WebSockets shine when the same socket carries dozens of event types both ways — not required for "user asks → assistant streams answer."

End-to-end flow in a chatbot

Browser POST /chat → API: RAG retrieve → LLM stream=True → SSE data: tokens → UI appends text

Step	Who	What happens
1	Browser	POST `{ message, history }` to `/api/chat/stream`
2	Your API	Embed query, search vector DB, optional rerank (not streamed)
3	Your API → LLM	`stream: true` completion with context + history
4	LLM → API	Token deltas arrive over provider's HTTP stream
5	API → Browser	Each delta forwarded as `data: {...}\n\n`
6	Browser	Append to message bubble; on `done`, re-enable input

How LLM providers stream tokens

Providers send token deltas (subword pieces), not whole words. Your server normalizes them into one text stream for the UI.

OpenAI / compatible APIs

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        yield delta  # → wrap in SSE data: line

Wire format: newline-delimited JSON lines, ends with data: [DONE].

Anthropic Claude

with client.messages.stream(
    model="claude-haiku-4-5",
    messages=messages,
    max_tokens=1024,
) as s:
    for text in s.text_stream:
        yield text

Events include content_block_delta with text fragments.

Google Gemini

for chunk in client.models.generate_content_stream(
    model="gemini-2.5-flash-lite",
    contents=prompt,
):
    if chunk.text:
        yield chunk.text

Your abstraction layer should emit uniform SSE JSON ({"type":"token","content":"..."}) so the frontend works with any provider.

Build an SSE streaming API

FastAPI (Python)

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json

app = FastAPI()

async def token_generator(user_message: str):
    context = retrieve_chunks(user_message)  # RAG — not streamed

    for token in stream_llm(user_message, context):
        payload = json.dumps({"type": "token", "content": token})
        yield f"data: {payload}\n\n"

    yield f"data: {json.dumps({'type': 'done'})}\n\n"

@app.post("/api/chat/stream")
async def chat_stream(body: ChatRequest):
    return StreamingResponse(
        token_generator(body.message),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        },
    )

Node / Express

app.post("/api/chat/stream", async (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  res.flushHeaders();

  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const text = chunk.choices[0]?.delta?.content ?? "";
    if (text) res.write(`data: ${JSON.stringify({ content: text })}\n\n`);
  }
  res.write(`data: ${JSON.stringify({ done: true })}\n\n`);
  res.end();
});

Headers that matter: text/event-stream, Cache-Control: no-cache, and X-Accel-Buffering: no on nginx. Without these, users see one big blob at the end — which feels like fake typing waiting to start.

Frontend: read the stream

fetch + ReadableStream (POST chat — recommended)

EventSource only supports GET. Chat almost always POSTs the message body, so use fetch:

async function streamChat(message, onToken) {
  const res = await fetch("/api/chat/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ message }),
  });

  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });

    const lines = buffer.split("\n");
    buffer = lines.pop() ?? "";

    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      const data = JSON.parse(line.slice(6));
      if (data.type === "token") onToken(data.content);
    }
  }
}

Append tokens immediately — no setTimeout between characters:

let text = "";
streamChat(userMsg, (token) => {
  text += token;
  bubble.textContent = text;
});

EventSource (GET progress / live site updates)

const es = new EventSource("/api/sync/progress");
es.addEventListener("progress", (e) => {
  const { pct, stage } = JSON.parse(e.data);
  updateBar(pct, stage);
});
es.addEventListener("complete", () => es.close());

RAG + streaming

RAG adds a retrieval phase before generation. Only the LLM phase streams.

Phase	Streamed?	Typical duration
Embed query	No	50–200ms
Vector search	No	50–300ms
Rerank (optional)	No	100–500ms
LLM generation	Yes	2–15s

UX pattern: show "Searching docs…" during retrieval, then switch to token stream. Optional multi-event SSE:

event: status
data: {"phase":"retrieval"}

event: token
data: {"content":"Based on"}

event: citation
data: {"source":"docs/api.md","chunk_id":"api_3"}

See How to Build a RAG Chatbot and economical model picks for RAG.

SSE for live site updates (not just chat)

The same transport powers sync progress, deploy logs, and indexing status on a dashboard — not only chat tokens.

event: progress
data: {"stage":"embed","pct":60,"file":"page-b.md"}

event: complete
data: {"duration_s":94,"files_indexed":12}

We document this pattern in Publish a Knowledge Base to GitHub Pages — EventSource('/sync/stream') driving a progress bar while embeddings run in the background.

Common SSE event types — chat streams tokens; sync jobs stream progress percentages

Production checklist

Server

Disable nginx / CDN buffering on stream routes
Flush after every yield
Heartbeat comment lines (:\n\n) on long gaps
Abort provider stream when client disconnects

Client

AbortController for Stop button
Handle event: error gracefully
Don't fake typewriter delay
Debounce markdown re-parse if needed

Observability

Log time-to-first-token (TTFT)
Track tokens/sec and stream duration
Alert on buffer-timeout errors
Rate-limit stream endpoints

FAQ

Is ChatGPT's typing effect fake?

In production streaming APIs, no — text arrives as the model generates. Some marketing demos and tutorial apps fake it with timers. Check whether your backend uses stream: true.

SSE or WebSocket for my chatbot?

SSE (or chunked HTTP) for most chat + RAG bots. WebSocket when you need constant bidirectional traffic on one connection.

Why does nothing appear until the full answer is ready?

Almost always proxy or framework buffering. Add X-Accel-Buffering: no, disable response buffering in your WSGI/ASGI middleware, and verify with curl -N.

Can I stream markdown safely?

Append raw text live; run markdown rendering on a short debounce or after done to avoid broken partial syntax (unclosed ** or code fences).

Does streaming cost more?

Same tokens billed. Streaming improves UX and lets you cancel early — which can reduce cost on aborted long answers.

Glossary

Term	Meaning
SSE	Server-Sent Events — HTTP stream of `data:` lines from server to browser
TTFT	Time to first token — latency until first visible character
Token delta	Incremental text chunk from the LLM's stream
ReadableStream	Browser API to read a fetch response body chunk by chunk
EventSource	Browser API for GET-based SSE subscriptions
Buffering	Proxy holding chunks until response completes — kills live UX

Stream real tokens. Skip the fake typewriter.