---
name: local-knowledge-infrastructure
description: "Build and maintain a self-hosted knowledge infrastructure: file server + vector search + note sync + version control. Covers Caddy/Tailscale Funnel for public file serving, ChromaDB/sentence-transformers for local RAG/vector search, Obsidian vault sync, and Git version protection. Particularly useful for servers with restricted internet (China firewall)."
version: 1.0.0
author: Hermes Agent
tags: [knowledge-base, file-server, rag, vector-search, obsidian, caddy, tailscale, chromadb, local-infrastructure]
---

# Local Knowledge Infrastructure

Build a complete self-hosted knowledge system: file browsing → semantic search → note sync → version protection.

## Architecture Overview

```
┌─────────────────────────────────────────────────────┐
│                 Tailscale Funnel                     │
│    https://<hostname>.<tailnet>.ts.net/              │
└──────────────────────┬──────────────────────────────┘
                       │ HTTPS
┌──────────────────────▼──────────────────────────────┐
│                   Caddy (:8080)                      │
│    /output/ → HTML reports (generated)              │
│    /hermes/ → Skills, memories, plans (symlinked)   │
│    /obsidian/ → Notes (synced from WSL)             │
│    /projects/ → Code (synced from WSL)              │
│    /search → Semantic search API                    │
└──────────────────────┬──────────────────────────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
     🧠 ChromaDB   📁 ~/files/   🔄 rsync
     (vector DB)    (file store)  (WSL sync)
```

## Components

| Layer | Tool | Purpose |
|-------|------|---------|
| Web server | Caddy | File browsing, MD serving, path routing, reverse proxy |
| Public exposure | Tailscale Funnel | HTTPS access, mobile-ready, shareable links |
| Vector DB | ChromaDB | Persistent vector storage |
| Embedding | transformers (local) | Local text embedding (all-MiniLM-L6-v2, 384-dim) loaded from disk cache with no network dependency |
| Indexing | Custom Python script | Incremental file scanning via `os.walk(followlinks=True)`, chunking, hash-based change detection |
| Search API | FastAPI + uvicorn | REST API for semantic search, proxied through Caddy to /search |
| Search UI | Vanilla HTML+JS | Dark-themed search page with live preview, score display, file links |
| Sync | rsync cron | Pull Obsidian vault from WSL via `--rsync-path='wsl.exe rsync'`, every 15 minutes |
| Version | Git + GitHub (SSH) | Config/skills/memories backup. SSH is mandatory for servers behind China firewall (HTTPS to github.com blocked) |

## Phase Plan

### Phase 1: File Serving
Install Caddy → Create directory structure → Configure routes → Expose via Funnel

### Phase 2: Vector Search
Install ChromaDB + sentence-transformers → Download model → Write indexer → Full scan

### Phase 3: Search UI
FastAPI + uvicorn search API → ChromaDB query → HTML search page → Caddy reverse proxy to /search

### Phase 4: Obsidian Sync (Cloud Pull Mode)
rsync via Windows SSH + `--rsync-path='wsl.exe rsync'` → cron job every 15 minutes → Indexer picks up new files

### Phase 5: Version Protection
Git init → GitHub repo (create manually at github.com/new then git push via SSH) → cron auto-commit → Recovery script

## Storage Tiering

| Tier | Content | Location | Sync | Accessible |
|------|---------|----------|------|------------|
| L1 | Skills, config, memories | Cloud + WSL | Git | ✅ Browse + Search |
| L2 | Obsidian notes (.md) | Windows + Cloud copy | rsync realtime | ✅ Browse + Search |
| L3 | Code projects (light) | WSL + Cloud copy | rsync periodic | ✅ Browse + Search |
| L4 | PDFs, datasets, large projects | WSL only | None | ⚠️ Via proxied query |
| L5 | Generated HTML reports | Cloud only | N/A | ✅ Browse + Search |

## Restricted-Network Workarounds

Servers in mainland China or behind restrictive firewalls cannot reach:
- `huggingface.co` — Use `hf-mirror.com` via `HF_ENDPOINT=https://hf-mirror.com`
- `github.com` (HTTPS port 443) — `api.github.com` REST API usually works, but git protocol (push/clone) over HTTPS often times out from China. **SSH always works as an alternative.**
- `ollama.com` — Installation scripts may time out; use sentence-transformers as alternative

See `references/hf-mirror-workaround.md` for detailed setup.

### GitHub Push Workaround (When HTTPS is Blocked)

When `git push` over HTTPS times out but `api.github.com` responds instantly:

**Diagnose the network first:**
```bash
# Test API (usually works)
curl -s -o /dev/null -w '%{http_code} %{time_total}s' https://api.github.com && echo " API ok"

# Test git protocol (often blocked)
GIT_TERMINAL_PROMPT=0 git ls-remote origin HEAD  # times out? HTTPS is blocked

# Test SSH (usually works)
ssh -T -o ConnectTimeout=10 git@github.com
# If it connects but says "Permission denied", SSH is viable — just need to add the key
```

**Solution: Switch to SSH authentication**

```bash
# 1. Generate SSH key on the cloud server
ssh-keygen -t ed25519 -f ~/.ssh/github_hermes -N '' -C 'hermes-agent'

# 2. Add the public key to GitHub (MANUAL — fine-grained PATs can't manage keys via API)
cat ~/.ssh/github_hermes.pub
# Go to https://github.com/settings/keys → New SSH Key → Authentication Key → paste

# 3. Update remote URL
git remote set-url origin git@github.com:user/repo.git

# 4. Push (now works because SSH protocol bypasses HTTPS filtering)
git push -u origin main
```

**Pitfalls:**
- Fine-grained PATs (`github_pat_...`) cannot create repos or manage SSH keys via API even with all permissions. Must create repo manually at github.com/new and add SSH keys through the web UI.
- When testing SSH, verify the key is an **Authentication Key** (not Signing Key) in the GitHub settings.
- After switching to SSH, the existing credential store (`~/.git-credentials`) is unused — git uses the SSH key directly.

### Fine-Grained PAT Capabilities

GitHub fine-grained personal access tokens (`github_pat_...`) have limited default permissions:

| Operation | Classic PAT (repo scope) | Fine-grained PAT |
|-----------|-------------------------|-----------------|
| `gh repo create` | ✅ Works | ❌ Needs "Repository creation: write" in Account permissions |
| Push via git protocol | ✅ Works | ✅ Works (with Contents: write on the repo) |
| Push via `/api/repos/*/contents/*` | ✅ Works | ❌ Needs "Contents: write" on each repo |
| Manage SSH keys via API | ✅ Works | ❌ Cannot manage user keys |
| Create repos via API | ✅ Works | ❌ Needs Account permission |

**Recommendation:** For automation servers, generate a **classic PAT** with `repo` scope. Fine-grained PATs require too many manual permission tweaks and still can't do key management via API.

## Search API (FastAPI + Caddy Reverse Proxy)

The search UI is served by a FastAPI backend (port 8900) with Caddy acting as reverse proxy:

```
/search/ → Caddy handle_path /search/* → reverse_proxy 127.0.0.1:8900 → FastAPI serves HTML
/api/*   → Caddy handle /api/*        → reverse_proxy 127.0.0.1:8900 → FastAPI serves REST
```

### Search Backend

The `search_api.py` FastAPI server:
- Loads the same ChromaDB collection and transformer model as the indexer
- Provides `/api/search?q=query&top_k=20` — returns ranked results with path, title, score, preview
- Provides `/` — serves the static `index.html` search UI
- CORS is enabled for all origins

### Search UI

Vanilla HTML + CSS + JS, single-file page at `/home/ubuntu/files/search-ui/index.html`:
- GitHub-dark theme
- Semantic search with score display
- Clickable file paths (linked to the Caddy file browser)
- URL-based query persistence (`?q=...`)
- Loading spinner and error states

### Caddy Config for Search

```caddy
# Search UI
handle_path /search/* {
    reverse_proxy 127.0.0.1:8900
}
redir /search /search/

# Search API
handle /api/* {
    reverse_proxy 127.0.0.1:8900
}
```

**⚠️ Pitfall:** `handle_path` strips the matched prefix before proxying, while `handle` passes the full path through. If the FastAPI routes expect `/search/` → root, use `handle_path`. If they expect `/api/search`, use `handle` to preserve the path.

**⚠️ Pitfall:** Chinese characters in search queries may get garbled by the shell. When testing, use `--data-urlencode` with curl:
```bash
curl -s -G 'http://localhost:8080/api/search' --data-urlencode 'q=中文' --data-urlencode 'top_k=3'
```

```bash
# Start Caddy
HOME=/home/ubuntu caddy run --config ~/files/Caddyfile

# Reload Caddy config
HOME=/home/ubuntu caddy reload --config ~/files/Caddyfile

# Expose with Funnel
tailscale funnel --bg 8080

# Run indexer
HF_ENDPOINT=https://hf-mirror.com /path/to/venv/bin/python ~/files/indexer.py

# Check funnel status
tailscale funnel status

# View chromadb count
/path/to/venv/bin/python -c "
import chromadb
c = chromadb.PersistentClient('/home/ubuntu/files/chroma_db')
print(c.get_collection('hermes_files').count())
"
```

## Support Files

| File | Type | Purpose |
|------|------|---------|
| `scripts/indexer.py` | Script | Incremental file scanner → chunker → ChromaDB indexer. Uses `os.walk(followlinks=True)`, hash-based change detection, 512-char chunks with 64-char overlap. |
| `templates/Caddyfile` | Template | Working Caddyfile with /output/ /hermes/ /obsidian/ /projects/ /search/ routes. |
| `references/hf-mirror-workaround.md` | Reference | Network-restricted setup guide: hf-mirror, Ollama alternatives, timeout tuning. |

## Pitfalls

### Indexing

- **Chunking large files** — Axolotl API references etc. can generate 300+ chunks; set batch_size ≤ 32 for memory stability
- **Incremental indexing timeouts** — The status file is only written at the END of a successful run; partial runs (timeout) will re-index from scratch next time. Use `notify_on_complete=true` with background terminal, or save status incrementally if timeout is likely
- **Symlinks in Python** — `Path.rglob()` does NOT follow symlinks by default; use `os.walk(followlinks=True)` in `get_text_files()` instead
- **File counting inflated by symlinks** — Using `os.walk(followlinks=True)` traverses symlinked directories like `hermes/ → ~/.hermes/`. The file count shown at startup (e.g., 500 vs 829) increases significantly as symlinked content is included
- **Metadata `path` and `title` fields** — ChromaDB metadata MUST include `path` (relative file path) and `title` (file stem) for the search UI to display clickable file links and titles. Without these, results show empty paths
- **Metadata backward compatibility** — The chroma_status.json format may change between versions (v2: `hash:str`, v3: `hash:{"hash":str,"ids":[]}`). On upgrade, check `isinstance(v, str)` and convert old format
- **ChromaDB readonly errors** — If two indexer processes run simultaneously against the same DB file, ChromaDB throws "attempt to write a readonly database" (SQLite locking). Always `pkill -f indexer.py` before clearing/restarting
- **HF_HOME / TRANSFORMERS_CACHE** — Set these environment variables in the indexer script to ensure the model is found in the local cache:
  ```python
  os.environ.setdefault('HF_HOME', '/root/.cache/huggingface')
  os.environ.setdefault('TRANSFORMERS_CACHE', '/root/.cache/huggingface')
  ```

### Offline Model Loading (China Firewall Workaround)

When `sentence-transformers` hangs on import or tries to download `processor_config.json` (which doesn't exist for text-only models like all-MiniLM-L6-v2), switch to loading `transformers` directly from a cached snapshot:

```python
from transformers import AutoTokenizer, AutoModel
import torch

MODEL_DIR = Path('/root/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/<hash>/')
tokenizer = AutoTokenizer.from_pretrained(str(MODEL_DIR), local_files_only=True)
model = AutoModel.from_pretrained(str(MODEL_DIR), local_files_only=True)

def embed(texts):
    encoded = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
    with torch.no_grad():
        output = model(**encoded)
    # Mean pooling
    attention_mask = encoded['attention_mask']
    token_embeddings = output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return torch.nn.functional.normalize(embeddings, p=2, dim=1)
```

**Why this works:** `sentence-transformers` v3+ integrates `transformers` AutoProcessor which tries to download absent configs (`preprocessor_config.json`, `processor_config.json`, `adapter_config.json`) even when the model already exists in cache. Loading `transformers` directly uses only the files that actually exist in the snapshot (`config.json`, `model.safetensors`, `tokenizer.json`, etc.).

**Pitfall:** Find the snapshot hash first:
```bash
ls /root/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/
```
Then verify the files exist:
```bash
ls /root/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/<hash>/
```

### Search API (FastAPI + Caddy Reverse Proxy)

- **Caddy route ordering matters** — More specific `handle`/`handle_path` blocks must come BEFORE the catch-all `handle { file_server browse }`. Caddy evaluates `handle` blocks top-to-bottom and stops at the first match
- **`handle` vs `handle_path`** — `handle_path /search/*` strips the `/search/` prefix before proxying (so FastAPI sees `/`). `handle /api/*` preserves the full path (FastAPI sees `/api/search?q=...`). Choose based on where your backend routes are defined
- **URL encoding** — When testing Chinese queries through the shell, use `curl -G --data-urlencode 'q=中文' ...` instead of inline URL parameters. FastAPI handles Unicode correctly, but the shell/curl layer may mangle it
- **CORS** — FastAPI must have `CORSMiddleware` with `allow_origins=['*']` for the separate web server origin
- **Start order** — Start the search API (`uvicorn search_api:app --host 127.0.0.1 --port 8900`) BEFORE Caddy, so Caddy can verify the backend is reachable on startup

### Obsidian Sync (Cloud Pull Mode)

Instead of pushing from WSL (which requires running a service on the user's Windows machine), use a cron job on the cloud server to pull:

```bash
rsync -avz --delete --rsync-path='wsl.exe rsync' \
  ookii@windows-tailscale-ip:"/mnt/c/Users/ookii/Documents/ObsidianVault/" \
  /home/ubuntu/files/obsidian-vault/
```

**Setup steps:**
1. Ensure the cloud server has passwordless SSH access to the Windows machine (key in `~/.ssh/authorized_keys` or `administrators_authorized_keys`)
2. Create the sync script
3. Set up a cron job via `hermes cron create "*/15 * * * *" --name "obsidian-sync" --prompt "..." --deliver origin`
4. After each sync, clear `chroma_status.json` to trigger re-indexing of new/changed files

**Pitfall:** WSL must have `rsync` installed: `sudo apt install rsync`. The Windows machine does NOT need rsync — only WSL does, since `--rsync-path='wsl.exe rsync'` routes through WSL.

**Pitfall:** The cron job's heredoc prompt must specify the exact rsync command. If the path or IP changes, the cron job needs to be recreated.

### Caddy & Web Server
- **Caddy needs HOME set** — When running as root, `HOME=/home/ubuntu` must be set
- **System Caddy conflicts** — `apt install caddy` starts a :80 service; stop and disable it

### Python & Environment
- **Python venv issues** — hermes-agent's venv is at `~/.hermes/hermes-agent/venv/`; use it to avoid system package conflicts
- **`$HOME` not set in non-interactive shells** — Commands like `git config --global` and `caddy` fail without it. Explicitly export `HOME=/home/ubuntu` before running.
- **ChromaDB import hangs** — First import may download ONNX runtime; this can take 60+ seconds or fail silently. Use `timeout 60` for testing

### Indexing
- **Chunking large files** — Axolotl API references etc. can generate 300+ chunks; set batch_size ≤ 32 for memory stability
- **Incremental indexing timeouts** — The status file is only written at the END of a successful run; partial runs (timeout) will re-index from scratch next time. Use `notify_on_complete=true` with background terminal, or save status incrementally if timeout is likely
- **Symlinks in Python** — `Path.rglob()` does not follow symlinks by default; use `os.walk(followlinks=True)` instead

### Network (China Firewall)
- **Chinese server network** — hf-mirror works but model download is slower; set generous timeouts (300s+)
- **Installers that time out** — Ollama's install script (`curl | sh`) 504s from China. Fall back to `sentence-transformers + ChromaDB`, which works fully offline after model download

### Git & GitHub
- **Fine-grained PAT (github_pat_...) cannot create repos** — GitHub fine-grained tokens need explicit "Repository creation: write" permission in the GitHub UI. A classic PAT with `repo` scope avoids this. Workaround: create the repo manually at github.com/new then set remote origin.
- **`gh repo create` fails silently** — If `gh auth login` succeeded but the token lacks scope, `gh` falls back to re-auth prompt. Use `export GH_TOKEN=...` to force the token, then check the error message for permission details.
- **`--push` requires at least one commit** — Run `git add && git commit` before `gh repo create --push --source .`
- **Multi-account confusion** — gh may register under a different account than the PAT owner. Check with `gh auth status` and always prefer `export GH_TOKEN=...` for deterministic auth.
