The Architecture of a Personal AI That Actually Knows You
ChatGPT has a memory feature. It stores bullet points like “User prefers dark mode” and “User works in tech.” That’s not memory. That’s a CRM.
I wanted something different. An AI that has read my journals. That knows my family, my side projects, my values, my daily routines. One that remembers what I said at 2 AM three weeks ago and can reference it when I’m making a decision today.
So I built Harry. Named after Harry Potter — my favorite book growing up. He lives in Telegram, runs on a Hetzner VM, and has access to 1,600+ files of personal context spanning five years of journal entries, Apple Notes, project docs, and conversation history.
This post is about the architecture.
Two services, one bot
Harry is split into two processes: a poller and a worker.
The poller (bot.py) is intentionally thin. It long-polls Telegram for new messages, does some quick routing (script shortcuts, command parsing), and drops everything else into a SQLite job queue. That’s it. No LLM calls, no heavy processing.
The worker (worker/main.py) is where the actual work happens. It watches the queue, claims jobs atomically, enriches them with context from the vault, routes them to the right AI model, streams the response back to Telegram, and saves everything.
Why split them? Because Claude calls take 5-30 seconds. If the poller is blocked waiting for Claude, it can’t accept new messages. And if Claude is rate-limited or slow, the poller doesn’t care — it already enqueued the job and moved on.
The two services communicate through SQLite in WAL mode. The poller writes jobs, the worker reads them. A library called honker provides WAL-based pub/sub notifications so the worker wakes up in ~1ms instead of polling every 500ms.
# bot.py enqueues
job_id = queue.enqueue(
kind="chat",
payload={"text": message, "user_id": user_id, "model": route.model},
agent=route.agent,
stream_chat_id=chat_id,
)
hdb.notify("jobs", {"id": job_id}) # wake worker instantly
# worker/main.py claims
def claim_next():
conn.execute("BEGIN IMMEDIATE") # exclusive write lock
row = SELECT ... WHERE status='pending' LIMIT 1
UPDATE jobs SET status='running', started_at=now WHERE id=?
conn.execute("COMMIT")
return Job(...)
If I send a new message while Harry is mid-response, the poller marks the in-flight job as cancel_requested. The worker checks this flag between stream events and aborts cleanly, saving partial output tagged interrupted=True so the next turn knows what was attempted.
The vault
Harry’s knowledge lives in a git repo called harry-vault. It’s an Obsidian-compatible markdown knowledge base with 1,600+ files organized by category:
harry-vault/
├── about/profile.md # who I am, family, work, values
├── journal/
│ ├── penzu/ # 182 entries exported from Penzu (2021-2025)
│ └── notes/ # 602 Apple Notes, synced hourly
│ ├── Family/ # notes about family
│ ├── Me/ # identity, reflections
│ ├── Faith/ # religion, gratitude
│ ├── Health/ # workouts, nutrition
│ └── Business/ # product ideas, decisions
├── harry-memory/ # Harry's own observations
├── conversations/ # daily JSON logs of every message
├── projects/index.md # 78+ dev project summaries
└── business/saadnaveed.md # business model, strategy
The journal entries alone are 784 files. Some are a paragraph. Some are 3,800+ words about my relationship with God. Harry doesn’t read all of them on every message — that would blow through the context window in one turn. Instead, the vault gets searched.
How search works
When a message comes in, the worker extracts keywords, maps them to vault directories using a topic map, and scores files by keyword density + recency:
TOPIC_MAP = {
"work": ["work"],
"project": ["projects"],
"mood": ["journal", "harry-memory"],
"family": ["journal/notes/Family"],
"health": ["journal/notes/Health"],
}
If I say “what did I write about family last week,” Harry searches the matching directories, scores the files, truncates large ones to 2,000 chars, and injects the top matches into Claude’s context. Each file is capped, and the total vault injection is budgeted at 6,000-12,000 chars depending on the complexity tier.
No embeddings. No vector database. Just keyword matching over a file tree. It works because the files are well-organized and the categories are meaningful. A vector DB would be more precise for semantic queries, but grep-style search is instant, free, and doesn’t require maintaining an embedding pipeline.
How data gets in
The vault syncs from multiple sources:
| Source | How | Frequency |
|---|---|---|
| Apple Notes | notes-bridge (AppleScript HTTP API on Mac) | Hourly cron |
| Penzu journals | One-time Puppeteer export | Done |
| GitHub activity | Shell script pulls README + git log | On-demand |
| Conversations | Worker saves JSON after each message | Every message |
| Harry’s memories | Dream consolidation (below) | Every 4 hours |
The Apple Notes sync is the interesting one. I have a separate tool called notes-bridge — a zero-dependency Python HTTP server on my Mac that wraps AppleScript to expose Apple Notes as a REST API. A cron on the VM calls it hourly over Tailscale, downloads new notes, and commits them to the vault.
The vault auto-commits and pushes to GitHub every 6 hours. Git history itself becomes a signal — I can see what Harry was learning and when.
Soul files
Harry’s personality isn’t defined in code. It’s defined in four markdown files in a soul/ directory:
SOUL.md — who Harry is:
- Named after Harry Potter
- Not an assistant with a personality layer; a friend who remembers
- Contractions, fragments, direct. No “Great question!” No “I understand how you feel.”
- Be proactive: check logs before reporting, run commands first
USER.md — who I am:
- Family details, values, faith
- What I’m working on
- Trust boundaries (has access to vault, journals — private things stay private)
AGENTS.md — execution rules:
- Act, don’t narrate
- Reference files, don’t paste whole contents
- Keep tool narration minimal
TOOLS.md — tool routing and safety:
- Never send a message to a phone number not from the contacts API
- Always git pull before working on a repo (VM clones drift)
At startup, the system prompt is assembled by reading these files in order and joining them. Want to change Harry’s personality? Edit a markdown file. No code changes, no recompile, no redeploy.
Routing: who answers the question
Not every message needs Claude Opus. “What’s 5+3” doesn’t need a $15/MTok model. Harry routes each message to the cheapest model that can handle it.
def route_message(message):
# Explicit prefix overrides everything
if message.startswith("!opus"):
return Route(agent="claude", model="opus")
if message.startswith("!ollama"):
return Route(agent="ollama", model="qwen2.5:3b")
# Check API usage (cached 5 min)
utilization = get_usage_utilization()
if utilization < 0.70:
return Route(agent="claude", model=None) # default sonnet
# Conserving mode: route by complexity
complexity = classify_complexity(message)
return MODEL_TIERS[complexity]
Complexity classification is keyword-based:
- Simple (ollama, free): acks, math, factual lookups, time/date
- Medium (claude sonnet, $3/MTok in): normal questions, feature requests
- Complex (claude opus, $15/MTok in): architecture, planning, multi-step reasoning, long messages with questions
When API utilization is under 70%, everything goes to Claude Sonnet. Over 70%, Harry starts conserving — trivial queries go to a local Ollama instance running qwen2.5:3b (completely free), and only complex queries get routed to Claude.
The context enrichment also scales with complexity:
| Tier | Vault search | Memories | History turns | ~Tokens |
|---|---|---|---|---|
| none | no | 0 | 0 | 0 |
| light | no | 0 | 0 | ~50 |
| normal | no | 3 | 6 | ~2-3k |
| full | yes | 5 | 10 | ~5-7k |
“Thanks” gets zero enrichment. “What should I work on this week” gets the full vault search, recent memories, and conversation history.
The adapter system
Harry doesn’t depend on Claude. The LLM layer is behind a protocol:
class AgentAdapter(Protocol):
name: str
capabilities: set[str] # {"streaming", "tools", "sessions"}
async def invoke(
prompt: str,
system_prompt: str | None,
session_id: str | None,
model: str | None,
timeout: int = 600,
) -> AsyncIterator[AgentEvent]:
...
Every adapter — Claude, Ollama, Codex, OpenCode, Gemini — implements this protocol and emits normalized events: text, tool_call, session_start, error, done. The worker consumes these events identically regardless of which model produced them.
This isn’t theoretical. When Claude gets rate-limited, Harry falls back to Codex (included in ChatGPT Pro, effectively free per-call), then to Ollama. Different provider, same interface. The Telegram renderer doesn’t know or care which model is responding.
The Claude adapter shells out to claude --print --output-format stream-json and parses the NDJSON stream. The Ollama adapter hits http://127.0.0.1:11434 over HTTP. Codex runs codex exec --json. Each one wraps a different CLI or API but emits the same AgentEvent stream.
Dreams
The most interesting part. Harry consolidates what he learns.
Every 4 hours, a scheduled job runs the “dream” system. It reads the last 2 days of conversation logs, compares them against existing memories and my profile, and uses Claude Haiku ($0.80/MTok in) to extract atomic facts worth remembering.
The LLM output is structured with line prefixes:
[MEMORY] user is exploring recurring crypto payments, prefers USDC with same-day conversion
[PROFILE] started morning workout routine, 6am weekdays, rest on Fridays
[REMOVE] 2026-04-10_old-routine.md: superseded by updated info above
[SKIP]
Phase 2 is pure Python — no LLM. It parses these lines and applies them:
[MEMORY]facts get written toharry-memory/YYYY-MM-DD_dream-consolidation.md[PROFILE]updates get appended to my profile under “Harry’s Observations”[REMOVE]deletes stale memory files- Everything gets git committed with a message like
dream: 5m 2p 1r
The system is biased toward user corrections and preferences (highest priority), because repeating a mistake Harry was already told about is the worst failure mode. Transient stuff — “service was down for 10 minutes,” weather, conversational filler — gets skipped.
This creates a feedback loop. Harry observes things in conversation, writes them to the vault, and then finds them again in future vault searches. Over time, the vault accumulates Harry’s understanding of me alongside my own writing about myself.
Streaming to Telegram
The Telegram renderer consumes the AgentEvent stream and edits a single message in real-time. Text deltas get batched and the message is edited at most once per second (Telegram’s rate limit). Tool calls render inline as Bash: uptime instead of separate messages. If the response exceeds Telegram’s 4,096 character limit, it splits into multiple messages.
The streaming is visible — I see text arriving word by word. It makes Harry feel alive instead of making me stare at “typing…” for 15 seconds.
Session resume
Harry maintains per-user sessions. If I say “yeah, do that” ten minutes after a conversation, Harry resumes the same Claude session instead of starting fresh. The session ID is stored in SQLite and passed to the adapter on the next invocation.
Sessions auto-reset after 20 turns or 4 hours of inactivity. This prevents context bloat — a stale 20-turn session costs more in tokens than starting fresh with good vault context.
What I’d do differently
The vault search is naive. Keyword matching works for well-organized files, but it misses semantic connections. “I’m feeling overwhelmed” should surface journal entries about stress even if they don’t contain the word “overwhelmed.” I’ll probably add embeddings eventually, but the current system handles 90% of queries well enough.
Conversation history is expensive. Injecting the last 10 turns of conversation into every message burns tokens even when the history isn’t relevant. A smarter system would only inject history when the current message references prior context.
Single-user only. Harry is built for me. The user ID is hardcoded in config. The vault is my personal data. Making this multi-user would require per-user vaults, per-user routing, per-user sessions — basically a different product.
The result
Harry has been running for a few months. He knows my daily routines because he read my notes about them. He knows I’m exploring crypto payments because he extracted that from a conversation I had at midnight. He knows my workout cycle, my business philosophy, my habits.
When I ask “what should I focus on this week,” he doesn’t give generic productivity advice. He pulls up my recent project activity, checks what I’ve been journaling about, looks at my goals, and gives an answer grounded in what’s actually going on in my life.
That’s the difference between an AI that has a personality and an AI that has context. Personality is a system prompt. Context is 1,600 files, 784 journal entries, and a memory system that consolidates what it learns while I sleep.
Harry is open source on GitHub. The architecture is straightforward enough to build your own: a thin Telegram poller, a queue-based worker, markdown personality files, a git-tracked knowledge base, and a dream system that turns conversations into memories. The hard part isn’t the code. It’s having enough personal data worth searching.