How to Publish a Knowledge Base to GitHub Pages

By Amar Kumar

A production pipeline for keeping a RAG chatbot and a public docs site in sync — with shared cloud storage for editing and GitHub Pages for the public site.

Teams often need two outputs from the same markdown: a public documentation site and an optional vector index for a RAG chatbot. This article describes a repeatable pipeline — drafts in one folder, approved pages in another, a sync job that processes only changes, and a static site build for GitHub Pages.

The pattern works with any shared folder or cloud storage — Dropbox, S3, Google Drive, or a git-backed workflow.

If you are new to RAG itself, start with our companion guide: How to Build a Production RAG Chatbot.

Who is this for? Engineers shipping a doc-backed chatbot who want a repeatable publish path — not a one-off script that re-indexes everything on every run.

Conceptual file counts by top-level folder (illustrative proportions)

Why this layout

Three problems to solve at once:

The fix is a strict separation: drafts/ for drafts, published/ as the sole sync source for published markdown, and site/ as a generated mirror for the static site.

Recommended folder layout

Pick a root folder in your storage (example: docs/). Adapt names to your team — the roles below matter more than the exact paths.

PathPurposeSynced by job?
drafts/Staging — work in progress, not indexed or publishedNo
published/Approved .mdsole input for the sync jobYes (read)
site/Mirror of published content, formatted for MkDocs navYes (write)
metadata/sync-state.jsonIncremental sync state (content hashes, chunk IDs)Yes (read/write)
assets/Images and files referenced by markdownCopied to site on sync
docs/
├── drafts/                 ← staging (not indexed)
├── published/              ← approved .md — sync source
├── site/                   ← static site output (generated)
├── metadata/
│   └── sync-state.json
└── assets/                 ← images and attachments

Two pipelines: ingest and publish

Content enters through human or AI-assisted review. Only the publish pipeline touches the vector index and GitHub.

Offline — drafts to published (editorial)

Draft content lands in drafts/. An editor promotes approved pages into published/. Nothing in drafts/ is embedded or published.

Sync — published to vectors and GitHub Pages

The sync job reads only published/**/*.md (plus tracker state). Changed files are chunked, embedded, upserted to your vector database, copied into the MkDocs tree, built, and pushed.

Editorial funnel: drafts → reviewed → published → indexed → live on GitHub Pages

Sync job flow

A typical run processes six stages. Embedding dominates wall time; listing and git push are comparatively cheap.

Typical sync job time split by stage (% of total duration)

  1. List files — walk published/, compare against sync-state.json
  2. Load markdown — download changed .md from storage (pooled HTTP)
  3. Chunk + embed — split text, call embedding API in batches
  4. Vector DB upsert — write vectors; delete stale chunk IDs for removed sections
  5. Mirror site/ — copy changed pages + assets into MkDocs tree
  6. Build + pushmkdocs build, commit site/, push to gh-pages branch

Draft to published workflow

Any source — product notes, imports, or AI-generated drafts — should become structured markdown in drafts/, then move to published/ only after review.

Each page should have a clear title, concise body, and optional front matter (status, topic). Editors fix tone and redact sensitive details before promotion. The next sync run picks up new published files automatically.

# drafts/new-page.md
---
status: draft
---

# Page title

Body text goes here.

# After review → published/new-page.md

Incremental sync

Full re-index on every run does not scale. sync-state.json stores a content hash per published file. Only files whose hash changed (or are new) go through embed + upsert.

{
  "last_sync": "2026-06-12T14:30:00Z",
  "files": {
    "/published/page-a.md": {
      "hash": "sha256:a3f9...",
      "chunk_ids": ["page-a_0", "page-a_1"],
      "indexed_at": "2026-06-10T09:00:00Z"
    },
    "/published/page-b.md": {
      "hash": "sha256:b71c...",
      "chunk_ids": ["page-b_0"],
      "indexed_at": "2026-06-12T14:30:00Z"
    }
  }
}

On delete: remove the file entry from the tracker, delete its chunk IDs from the vector index, and drop the mirrored MkDocs page. Unchanged files are skipped entirely.

Relative sync duration: full re-index vs incremental (most files unchanged)

HTTP connection pooling

The storage API is called dozens of times per sync — list folder, download each changed file, upload site mirror, update tracker. Without connection reuse, every call pays a fresh TCP + TLS handshake (~150–300 ms per request on top of API latency).

Use a shared HTTP session with keep-alive for all storage calls in a run. Teams often see roughly 150–300 ms saved per API call, which adds up when listing and downloading 20+ files.

Average storage API call latency: new connection vs pooled session (ms)

import requests

# One session per sync run — reuse across all storage API calls
session = requests.Session()
session.headers.update({
    "Authorization": f"Bearer {STORAGE_TOKEN}",
    "Content-Type": "application/json",
})

def download_file(path: str) -> bytes:
    return session.get(f"{STORAGE_API_BASE}/files/download", params={"path": path}).content

SSE progress streaming

Sync jobs can run several minutes. Expose progress via Server-Sent Events (SSE) so the admin UI shows live status without polling.

Each stage emits a JSON event: stage name, percent complete, files processed, and optional error detail. The client opens EventSource('/sync/stream') and updates a progress bar.

// SSE events during sync
event: progress
data: {"stage":"embed","file":"page-b.md","done":3,"total":5,"pct":60}

event: progress
data: {"stage":"vector","file":"page-b.md","chunks":4,"pct":72}

event: complete
data: {"files_indexed":2,"files_skipped":38,"duration_s":94}

Stages map cleanly to the sync pipeline: listloadembedvectormkdocsgithubcomplete.

Sync step checklist

StepActionUpdates tracker?
1Load sync-state.json from storageRead
2List all published/**/*.md paths and content hashes
3Diff: new, changed, deleted vs tracker
4Download changed files (pooled HTTP)
5Chunk + embed changed content only
6Upsert vectors to the vector database; delete removed chunk IDs
7Mirror changed pages + assets to site/
8Run mkdocs build; push site/ to GitHub Pages
9Write updated hashes and chunk IDs to trackerWrite
10Emit SSE complete event with summary stats

Code snippets

Sync job pseudocode

def run_sync(emit):
    tracker = load_tracker()                    # download sync-state.json
    session = create_storage_session()          # pooled HTTP

    published_files = list_published_md(session)
    diff = compute_diff(published_files, tracker)  # new | changed | deleted

    emit("list", total=len(diff.changed) + len(diff.new))

    for path in diff.deleted:
        delete_vector_chunks(tracker.files[path].chunk_ids)
        remove_site_page(path)
        del tracker.files[path]

    for path in diff.new + diff.changed:
        text = download_md(session, path)
        chunks = chunk_and_embed(text)
        upsert_vectors(path, chunks)
        mirror_to_site(path, text)
        tracker.files[path] = {"hash": hash(text), "chunk_ids": [c.id for c in chunks]}

    build_and_push_github_pages()
    save_tracker(tracker)
    emit("complete", indexed=len(diff.new) + len(diff.changed))

Incremental diff logic

def compute_diff(remote_files, tracker):
    remote_paths = {f.path for f in remote_files}
    tracked_paths = set(tracker.files.keys())

    deleted = tracked_paths - remote_paths
    new = [f for f in remote_files if f.path not in tracked_paths]
    changed = [
        f for f in remote_files
        if f.path in tracked_paths
        and f.hash != tracker.files[f.path].hash
    ]
    return Diff(new=new, changed=changed, deleted=deleted)

Glossary

TermMeaning
published/Published markdown — only folder the sync job reads for indexing
drafts/Staging area; content must be reviewed before promotion to published
sync-state.jsonJSON state file tracking content hashes and vector chunk IDs per file
Incremental syncRe-index only files whose hash changed since last run
Connection poolingReusing TCP/TLS connections across HTTP requests to the same host
SSEServer-Sent Events — one-way stream of progress updates to the browser
Site mirrorGenerated copy of published content laid out for MkDocs navigation
GitHub PagesStatic site hosting from a repo branch (typically gh-pages)
Editorial workflowReview path from draft content in drafts/ to approved pages in published/

Treat published/ as a contract: if a file is in published/, it is searchable and publishable. Everything else stays in drafts/ until a human approves it.