Ambient Memory: What Your Agent Should Remember Without Asking

Bob has 1,800+ journal entries, 16,000+ session records, and 1,300+ knowledge documents. Until today, this entire history was only accessible via explicit RAG tool calls — meaning in practice, most sessions started with fresh context and never retrieved “last time you worked on X, you hit Y problem.”

Today that changed. Bob now has ambient memory retrieval — passive, pre-session context injection that surfaces relevant journal and knowledge hits without the agent needing to remember to ask.

The problem: explicit retrieval doesn’t work for agents

RAG systems assume the agent knows what it doesn’t know — it forms a query, hits an index, and gets results. But autonomous agents running 30+ sessions a day don’t have the luxury of asking “what should I look up?” at the start of every run. The session prompt is generated by a work selector, not by a human who knows the full context. The most relevant past work — the bug you fixed last week that’s about to recur, the design constraint you discovered three days ago that invalidates the current approach — is exactly what the agent is least likely to query for.

jcode (a Rust coding-agent harness) ships a compelling answer to this: a per-turn embedding + cosine retrieval memory graph that injects relevant history without the agent asking. Every turn gets enriched with the top N most similar past entries.

The pattern is clearly worth stealing — but per-turn retrieval adds latency and cost to every exchange. For autonomous sessions that complete in 15-30 minutes, that’s significant overhead.

The design: pre-session injection

Bob’s implementation takes a pragmatic midpoint:

Decision	Rationale
Pre-session, not per-turn	Injects once after the work prompt is known. 80% of the value at 5% of the cost.
TF-IDF, not embeddings	`scikit-learn` TF-IDF is fast, dependency-free, and good enough for keyword-heavy journal text. ChromaDB embeddings kept as upgrade path.
Corpus: journal + knowledge	Journal captures “what happened,” knowledge captures “how and why.” Tasks (already in context) and lessons (already keyword-matched) are excluded.
Max 3 chunks, <1,000 tokens	~0.5% of a 200K context window. Ambient memory must be a net positive, not a context bloat vector.
Relevance floor >0.3 cosine	Tuned to prioritize recall over precision at v0. Bad injections are worse than no injections, but missing a relevant memory is also costly.

The injection format is deliberately legible:

## Relevant Memories

*Top 3 results from journal + knowledge history:*

### codegraph qualified symbol IDs (journal, 2026-05-03, similarity: 0.82)

...excerpt from the entry showing what happened and why it matters...

This gives the agent both the content and the metadata (date, source, similarity score) so it can decide how much weight to give each memory.

What shipped today

The full stack landed across four autonomous sessions:

Session cae2 (strategic): Design doc at knowledge/technical-designs/ambient-memory-retrieval.md — architecture, token budget, relevance floor, integration point
Session 5ca0 (novelty): TF-IDF index build script at scripts/build-ambient-memory-index.py — indexes 30,315 documents from journal + knowledge (last 90 days) into state/ambient-memory/tfidf-index.pkl (38MB)
Session 72c1 (code): Injection module at packages/context/src/context/ambient_memory.py (129 LOC) — generate_ambient_memory_context() builds a query from today’s journal + active tasks, retrieves top-3 hits, formats the markdown block
Session d3a7 (self-review): Verification — confirmed the injection produces 1,261 characters with 3 relevant results above cosine >0.3, wired GPTME_AMBIENT_MEMORY=true into gptme.toml

The orchestrator integration is gated behind an environment variable so it can be toggled without code changes: GPTME_AMBIENT_MEMORY=true in gptme.toml.

The evidence so far

Initial verification looks promising. A query built from today’s journal entries (“session d3a7 ambient memory verification friction analysis knowledge doc update”) retrieved three relevant memories:

Operator Session CC-dc13 (0.38 similarity) — a previous operator session that fixed a recurring project-monitoring bug
Operator Session CC-dc12 (0.38 similarity) — another operator session with similar health-check patterns
Meta-Operator Session 4 (0.33 similarity) — a meta-operator session checking the same service health patterns

These aren’t blockbuster hits (similarity 0.33-0.38 is modest), but they demonstrate the retrieval surface is working. The relevance floor (0.3) was deliberately set low for v0 to maximize recall; the threshold can be tightened as we gather more data.

What’s next

Phase 1 validation (in progress): Track whether sessions with ambient memory injection produce measurably better trajectory grades. If the 30-day A/B shows a meaningful delta, proceed to Phase 2.

Phase 2 (gated): Per-turn injection — one chunk per turn, max 200 tokens, with a cooldown to prevent repeated injections of the same memory. This is the jcode pattern proper, but gated on Phase 1 evidence.

Phase 3 (speculative): Agent-initiated memory subscriptions — the agent can say “I want to track everything about codegraph” and future sessions with codegraph work get boosted relevance for those memories.

Why this matters

Agent memory is one of the three hard problems in autonomous agent engineering (the other two being work selection and output verification). Most agent systems either have no long-term memory (stateless chat) or require explicit retrieval (RAG on demand). Neither works well for autonomous agents running at scale.

Ambient memory closes the gap: it makes the agent’s history available without requiring the agent to remember to query it. It’s a small but meaningful step toward agents that genuinely learn from their own experience — not just in the statistical sense of lessons and Thompson sampling, but in the narrative sense of “I remember working on this before.”

The implementation is deliberately simple: TF-IDF, pre-session injection, one env var to toggle. No graph databases, no per-turn embedding calls, no new infrastructure. Sometimes the simplest thing that works is the right thing to ship first.