Understanding AI Memory Systems: A Guide to LLM Recall, Retrieval, and Real-World Performance

Large language models have become remarkably capable, but they fundamentally lack persistent memory. A conversation that ends today is gone tomorrow unless an external system stores and retrieves the relevant details. This challenge has produced an entire category of software, AI memory systems, dedicated to giving language models the ability to recall users, conversations, facts, and preferences across sessions.

This guide examines how these systems work, the architectural choices they involve, the benchmarks used to evaluate them, and the gap between benchmark claims and real-world performance. The discussion draws on publicly available research and a head-to-head benchmark of the memory system powering Tizenegy, an AI companion platform that runs entirely on edge infrastructure.

The Memory Problem in Modern Language Models

Context Windows and Their Limits

Every language model operates within a context window, a fixed maximum amount of text the model can attend to at once. Frontier models such as Claude Opus 4.7 and GPT-5.4 currently support around one million tokens. While impressive, this capacity falls short of what is required for long-running relationships between users and AI systems. Six months of daily conversation easily exceeds a million tokens.

Why External Memory Is Necessary

External memory architectures store conversational content outside the model and retrieve only the most relevant fragments at query time. The approach combines two technologies:

  • Vector embeddings: numerical representations of text that capture semantic similarity
  • Vector databases: specialized storage systems that find the most similar embeddings to a query embedding

The combination allows the system to surface relevant memories without loading the entire history into the prompt.

Standard Memory Pipeline Architecture

Most production memory systems follow a similar pipeline.

Stage 1: Extraction

After each conversation turn, important content is extracted from the exchange. Extraction can be performed by a dedicated language model or by rule-based systems. Common extracted entities include:

  • People and their relationships
  • Places and locations
  • Organizations and roles
  • Events and dates
  • Preferences and stated opinions
  • Factual statements about the user

Stage 2: Embedding

Extracted content (or in some systems, the raw conversation text) is passed through an embedding model that converts it into a fixed-length vector. The choice of embedding model has an outsized impact on retrieval quality.

Stage 3: Storage

Vectors are stored in a vector database, typically with metadata fields such as user ID, companion ID, timestamp, and category. Modern systems include:

  • ChromaDB
  • Pinecone
  • Vectorize (Cloudflare)
  • Weaviate
  • pgvector

Stage 4: Retrieval

When a new query arrives, it is embedded and the database returns the most similar stored vectors. Some systems apply additional ranking based on recency, importance, or metadata.

Stage 5: Prompt Augmentation

Retrieved memories are inserted into the language model’s prompt, typically in a system message that precedes the user’s current query.

Production Architecture: A Case Study

Tizenegy implements the pipeline described above on Cloudflare’s edge platform. Its architecture provides a useful reference point for the design choices that shape a real system.

Asynchronous Extraction

Memory extraction runs in a queue consumer rather than inline with the chat response. After every message pair, a small language model (Llama 3.1 8B) extracts entities, facts, preferences, and a summary. Because the extraction is asynchronous, it does not affect the chat-response latency the user experiences.

Contradiction Detection

When extracted facts contradict existing facts (a user changes jobs, moves cities, or updates a preference), the older facts are invalidated with a timestamp. The new fact supersedes the old one in retrieval. This temporal awareness avoids the awkward situation where a system references outdated information about a user’s life.

Embedding and Storage

Memories are embedded with the bge-base-en-v1.5 model at 768 dimensions. The vectors are stored in Cloudflare Vectorize, namespaced by user identifier and filtered by companion identifier.

Composite Ranking

At retrieval time, the system applies a composite score to each candidate:

  • 50% semantic similarity (cosine distance between query and memory embeddings)
  • 30% recency (exponential decay with a six-day half-life)
  • 20% importance (a stored value that reflects the perceived relevance of the memory)

The composite score allows a moderately similar but recent memory to outrank a more semantically similar but older one. This pattern mirrors how human memory tends to weight recency.

Benchmark Methodologies

Several public benchmarks evaluate AI memory systems. Each has its own methodology, strengths, and limitations.

LongMemEval

LongMemEval is a benchmark introduced in an ICLR 2025 paper. It contains roughly 500 questions across six categories:

  • Single-session factual questions
  • Multi-session reasoning questions
  • Temporal reasoning questions
  • Knowledge update questions (where information changes mid-corpus)
  • Preference questions
  • Abstention tests (questions the system should refuse to answer)

Each question is paired with a corpus of approximately 53 conversation sessions, of which only some contain the relevant information.

Two metrics are commonly reported:

  • Recall@k: whether the correct session appears in the top-k retrieved results
  • NDCG@k: a ranked-quality metric that accounts for the position of the correct result

LoCoMo

LoCoMo is a smaller benchmark that evaluates memory across very long conversations. It is useful as a complementary measure but carries methodological caveats: the conversations max out at 32 sessions, which means a high top_k value can render retrieval trivial.

What Benchmarks Do Not Measure

Benchmark scores capture retrieval quality but not the full user-facing behavior of a memory system. Specifically:

  • Whether the system uses the retrieved memory correctly in its response
  • Whether the system avoids contradictions with previous statements
  • Whether the system proactively recalls relevant context the user did not ask about
  • Whether retrieval latency is acceptable in production

The MemPalace Phenomenon

In April 2026, an open-source project called MemPalace, created by actress Milla Jovovich together with Ben Sigman, attracted significant attention. The repository accumulated more than 5,400 GitHub stars in its first 24 hours, with publicity reaching over 15 million people.

The Architectural Idea

MemPalace borrows from the ancient Method of Loci. Memories are organized into a hierarchical structure:

  • Wings: high-level groupings, typically people or projects
  • Rooms: topics within a wing
  • Halls: memory types such as facts, preferences, or events

Rather than searching all stored memories with vector similarity, the system first identifies the relevant wing and room, then searches only within that scope.

The Claimed Performance

The project’s headline claims included:

  • 96.6% recall on LongMemEval
  • 100% recall on LoCoMo
  • Zero API calls (no commercial language-model dependency)
  • A 34% retrieval boost from metadata filtering

The Methodological Critique

A detailed technical analysis identified several concerns:

  • The 100% LoCoMo score used top_k=50 against a corpus of 32 sessions, meaning the retrieval returned more items than existed
  • The 96.6% LongMemEval number measures retrieval-only and does not evaluate whether the system can answer the question correctly
  • The “knowledge graph” component implements exact-match deduplication rather than the temporal contradiction detection the marketing described
  • The project’s own AAAK compression format scored 84.2% versus 96.6% for raw text, a 12.4 percentage-point drop

The repository’s internal BENCHMARKS.md file disclosed several of these limitations honestly. The disconnect was between that internal documentation and the public-facing claims.

Independent Benchmark Results

Running LongMemEval-S against a production memory system, with honest methodology, produces results that complicate the headline narrative.

Tizenegy with bge-base-en-v1.5

ModeRecall@5Recall@10NDCG@10
Raw text storage96.0%97.6%91.3%
Full extraction pipeline96.0%97.6%91.3%
Palace-style room filtering94.6%96.2%90.0%

Tizenegy with all-MiniLM-L6-v2 (the model MemPalace used)

ModeRecall@5Recall@10NDCG@10
Raw text storage73.6%73.6%71.2%
Palace-style room filtering72.4%72.4%70.1%

Comparative Context

Other memory systems with reported LongMemEval performance:

SystemRecall@5Notes
Tizenegy (bge-base, raw)96.0%Production baseline, no API calls
MemPalace (claimed, MiniLM)96.6%Not reproduced in independent tests
Mem0~85%Commercial cloud service
Zep~85%Commercial, Neo4j-backed
Mastra (GPT-based)94.9%Requires GPT API calls
Supermemory ASMR~99%API-based service

These numbers come from different runs with different methodologies and should not be compared strictly. The variation itself is the central lesson: methodology shapes the result more than the architecture.

Key Findings from Comparative Testing

Embedding Model Selection Dominates

The same architecture scoring 96.0% with bge-base-en-v1.5 scored only 73.6% with all-MiniLM-L6-v2. The 22 percentage-point gap from a single configuration choice exceeds every other architectural decision tested. For practitioners, this means embedding-model selection is the highest-leverage decision in a memory system.

Extraction Does Not Improve Retrieval

Running the full extraction pipeline (entities, facts, summaries) produced identical retrieval scores to embedding raw text directly. The extraction is valuable for the structured data it produces (which feeds the prompt) but does not boost vector search.

Metadata Filtering Has Real Costs

Palace-style room filtering reduced recall by approximately 1.3 percentage points relative to flat search. Keyword-based room detection occasionally misclassifies a query, and the fallback to unfiltered search recovers some but not all of the loss.

Viral Numbers Need Independent Replication

The claimed 96.6% recall with MiniLM was not reproducible. Likely explanations include inflated top_k values, different chunking strategies, or pure cosine ranking without composite scoring.

Practical Guidance for AI Memory Systems

For Evaluators

When assessing an AI memory product or paper:

  • Identify the embedding model and dimension count
  • Determine whether benchmarks measured retrieval only or end-to-end answer accuracy
  • Check top_k against corpus size
  • Request methodology documentation
  • Run the system on representative internal data before procurement

For Builders

When designing a memory system:

  • Test multiple embedding models against actual workload data
  • Default to storing raw text; treat LLM-extracted summaries as prompt-enrichment data rather than retrieval input
  • A/B test metadata-filtered retrieval against flat search with real users
  • Implement temporal awareness for changing facts
  • Run extraction asynchronously to avoid impacting chat latency

For Buyers

Subscription pricing in this category ranges from roughly $19 to $249 per month. The right vendor depends on:

  • The volume of conversations stored
  • Whether the embedding model is configurable
  • Whether the system supports custom retrieval logic
  • The latency profile under expected load
  • Data residency and privacy guarantees

What Benchmarks Cannot Capture

Retrieval recall is necessary but not sufficient. A companion that loads a structured one-hundred-token identity summary at the start of each session (“works in technology, has a sister named Emma, recently went through a breakup, loves hiking”) feels qualitatively different from one that starts cold and relies entirely on vector search.

The difference does not appear in retrieval scores. It appears in whether users describe the experience as “this AI knows me” or “this AI keeps forgetting things.” Optimizing exclusively for benchmark recall risks shipping a system that scores well on paper and feels hollow in practice.

Tizenegy is currently A/B testing flat retrieval against palace-style structured retrieval with real users, on the hypothesis that a small recall reduction is an acceptable cost for richer prompt context.

Conclusion

AI memory systems sit at a genuinely difficult intersection of machine-learning research, software engineering, and product design. The benchmarks that exist are useful but partial. The headline numbers reported in marketing materials frequently do not survive methodological scrutiny.

For the next wave of AI products, the practical path forward is clear: pick the best embedding model, store enough raw context to support good retrieval, layer in temporal awareness for changing facts, and measure user-perceived quality alongside the benchmark numbers. The space is moving quickly, and the gap between honest measurement and marketing claims will be one of the defining stories of AI memory in 2026.