# Debug Snapshot Infrastructure

## Overview

A debug snapshot system that captures full system state (processes, logs, config, metrics, network) with one click. Designed for web services where bugs are hard to describe verbally — the snapshot provides all context needed to locate the root cause without manual reproduction.

## Architecture

```
User clicks "报告问题" in UI
          ↓
    Modal: describe the bug
          ↓
    POST /api/bug-report  { description }
          ↓
    Backend runs debug-snapshot.sh (subprocess, 30s timeout)
          ↓
    Saves: bug_<timestamp>.txt (description)
           snapshot_<timestamp>.tar.gz (system state)
          ↓
    Returns: { id, message }
          ↓
User tells agent the ID → agent reads files & locates root cause
```

## What the Snapshot Contains

| Module | Command / Source | Purpose |
|--------|-----------------|---------|
| System status | `uptime`, `free`, `df`, `ps` | CPU/MEM/DISK baseline |
| Process health | `pgrep`, `ss -tlnp` | Are services running? |
| API health | `curl /health`, `curl /api/search?q=test` | Is the service responding? |
| Config files | Caddyfile, search_api.py, indexer.py (redacted) | Current configuration audit |
| Index/DB status | ChromaDB size, file counts | Data integrity check |
| Network state | ports, connectivity to github/baidu | Network issues? |
| Recent changes | 24h file modifications | What changed before the bug? |
| Caddy logs | journalctl / log files | Request-level debugging |

## Backend Implementation (FastAPI)

### Required Endpoints

```python
@app.get('/health')
async def health():
    return {'status': 'ok', 'collection': N, 'model': True}

@app.get('/api/stats')
async def stats():
    return {'total_chunks': N, 'status': 'ready'}

@app.post('/api/bug-report')
async def bug_report(report: BugReport):
    """Accept description, run snapshot, return report ID."""
```

### Key Implementation Details

- Use `subprocess.run(['bash', snapshot_script_path], timeout=30)` with `capture_output=True`
- Save description + snapshot output to a `.txt` file in a `bug_reports/` directory
- Copy the generated `.tar.gz` to `bug_reports/` alongside the description
- Return `{ok: true, id: timestamp}` — agent uses `id` to locate files
- Add `.gitignore` entry for `bug_reports/`

### The Snapshot Script

See the companion `debug-snapshot.sh` template in the `templates/` directory.

### Frontend Implementation

- Place a "🐛 报告问题" button in the sidebar footer
- Click opens a modal with:
  - Textarea for bug description (auto-resize)
  - Cancel + Submit buttons
  - Loading spinner during snapshot generation
- On success, show: report ID, download link, "close" button
- ESC key and overlay-click close the modal
- Enter/Cmd+Enter submits from the textarea

### Post-Report Workflow

When user reports a bug ID to you (the agent):

1. Read the description: `cat bug_reports/bug_<id>.txt`
2. Extract the snapshot: `tar -xzf bug_reports/snapshot_<id>.tar.gz -C /tmp/bug_<id>/`
3. Read key files: system state, logs, config diffs
4. Apply systematic-debugging Phase 1-4 to locate root cause

## Status Tracking

After the basic snapshot system, add a status tracking layer so the user sees report lifecycle.

### Architecture

A single `reports.json` file in the `bug_reports/` directory tracks all reports:

```json
{
  "20260515_140702": {
    "id": "20260515_140702",
    "description": "点击搜索按钮后页面空白...",
    "status": "received",
    "status_label": "📩 已收到",
    "snapshot": "/home/ubuntu/files/bug_reports/snapshot_20260515_140702.tar.gz",
    "created_at": "2026-05-15 14:07:02",
    "updated_at": "2026-05-15 14:07:02"
  }
}
```

### Additional API Endpoints

```python
class StatusUpdate(BaseModel):
    status: str  # 'received' | 'fixing' | 'fixed'

STATUS_MAP = {
    'received': '📩 已收到',
    'fixing': '🔧 正在修复',
    'fixed': '✅ 已完成',
}

@app.put('/api/bug-report/{report_id}/status')
def update_bug_status(report_id: str, update: StatusUpdate):
    """Called by agent when starting/finishing work on a bug."""
    if update.status not in STATUS_MAP:
        return {'ok': False, 'error': 'invalid status'}
    data = _load_reports()
    data[report_id]['status'] = update.status
    data[report_id]['status_label'] = STATUS_MAP[update.status]
    data[report_id]['updated_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    _save_reports(data)
    return {'ok': True, 'report': data[report_id]}

@app.get('/api/bug-reports')
def list_bug_reports():
    """Returns all reports sorted newest-first with statuses."""
```

### Helper Functions for reports.json Management

```python
BUG_DIR = BASE / 'bug_reports'
STATUS_FILE = BUG_DIR / 'reports.json'

def _load_reports():
    if STATUS_FILE.exists():
        return json.loads(STATUS_FILE.read_text(encoding='utf-8'))
    return {}

def _save_reports(data):
    STATUS_FILE.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding='utf-8')

def _init_report(report_id, description, snapshot_path):
    data = _load_reports()
    data[report_id] = {
        'id': report_id,
        'description': description,
        'status': 'received',
        'status_label': '📩 已收到',
        'snapshot': str(snapshot_path or ''),
        'created_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'updated_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    }
    _save_reports(data)
```

### Agent Status Update (from CLI)

```bash
# When starting to investigate
curl -X PUT /api/bug-report/20260515_140702/status \
  -H 'Content-Type: application/json' \
  -d '{"status":"fixing"}'

# When fix is complete
curl -X PUT /api/bug-report/20260515_140702/status \
  -H 'Content-Type: application/json' \
  -d '{"status":"fixed"}'
```

### Frontend History Modal

Add a sidebar button "📋 报告记录" that opens a modal. The modal:

- Fetches `GET /api/bug-reports` on open
- Renders each report with: status icon, description (2-line clamp), colored status badge, timestamp, report ID
- Includes a manual refresh button
- Closes on ESC key or overlay-click

CSS for status badges:

```css
.hi-status.received {
  background: rgba(70,130,200,.12);
  color: #7ab0e8;
}
.hi-status.fixing {
  background: rgba(220,180,60,.12);
  color: #e8c54a;
}
.hi-status.fixed {
  background: rgba(126,198,153,.12);
  color: #7ec699;
}
```

### Migration: Adding to Existing System

When adding status tracking to an existing `bug_reports/` directory that already has `bug_*.txt` files:

```python
# One-time migration
for f in sorted(BUG_DIR.glob('bug_*.txt')):
    rid = f.stem.replace('bug_', '')
    if rid in data: continue
    # parse description from .txt
    desc = f.read_text().split('描述:\n', 1)
    desc_text = desc[1].split('\n\n---')[0].strip() if len(desc) > 1 else ''
    data[rid] = {
        'id': rid,
        'description': desc_text,
        'status': 'fixed',  # historical = already resolved
        'status_label': '✅ 已完成',
        ...
    }
```

## Pitfalls

- **Log file not found**: uvicorn/FastAPI logs go to stderr by default. Always redirect: `&> logs/search-api.log` when starting.
- **Script timeout**: The snapshot script must complete within 30s. If the system is under heavy load, some checks may hang. Add `timeout` to individual commands inside the script.
- **Sensitive data**: Config files containing API keys must be redacted before saving. Use `sed` to replace `sk-*`, `ghp_*`, and `key = "..."` patterns.
- **Bottleneck startup race**: If the API auto-restarts, the snapshot may catch it during loading. Add retries to health check.
- **Caddyfile not at expected path**: The script must search for the actual Caddyfile location; don't hardcode.
- **async onsubmit pitfall**: When using `async function doSearch()` in HTML `onsubmit="return doSearch()"`, the async function returns a Promise (always truthy), so the browser submits the form and reloads the page. Fix: pass `event` to the handler and call `event.preventDefault()` at the top of the function.
- **Key repeat existing report**: When `reports.json` already exists but a new report is submitted, the `_init_report` code must merge rather than overwrite. Use `_load_reports()` → add entry → `_save_reports()`.

## Related

- See `systematic-debugging` skill for the 4-phase process this tool supports.