Building Durable LLM Memory

Abstract illustration of layered memory: a small bright window above a deeper field of stored fragments

I've been tinkering with a question: what would it actually take to give an LLM real memory?

Not "here's a summary of your last chat" memory. Actually remember. The kind of recall a person has about someone they know well – their name, their habits, that thing they mentioned once about their sister, the joke you made three months ago that stuck. I wanted to understand whether that was achievable with current tooling, and if so, how much infrastructure it would take.

First: LLMs are completely stateless

Not familiar with how LLMs work? A large language model like GPT or Claude has no persistent memory. Every time you send a message, the model reads the entire conversation from scratch and generates the next response. When the conversation ends, everything is gone. The model doesn't "remember" you between sessions – it doesn't even remember you between individual API calls. There is no internal state. Just: here is the conversation so far, now continue it.

This is the root of the problem. Every LLM "memory" system is ultimately an answer to the same question: how do you give a stateless model the appearance of long-term memory?

The obvious answer is: just put the conversation history in the prompt. Give it everything. Let the model figure it out.

The naive approach (and why it fails)

Context stuffing – loading the full chat history into every prompt – is where everyone probably starts. I did too.

Tokens and context windows, quickly. A token is roughly 4 characters of text, or about ¾ of a word in English. "Hello there" is 3 tokens. "Tokenization" is 4. The context window is the model's working memory: the maximum number of tokens it can hold in a single call. Once you exceed it, older content falls off the edge.

Here's what common things look like in token terms:

A tweet

This blog post

A novel

100K

Claude Sonnet 4.6

200K

GPT-5.5

272K

4 months of daily chat

288K

Gemini 2.5 Pro

1.0M

A moderately active user sending 30 messages a day at ~80 tokens each generates roughly 2,400 tokens of daily conversation. Claude Sonnet 4.6's 200K window overflows after about 83 days. GPT-5.5's standard context window is 272K – which buys you around 113 days, not dramatically more. Gemini 2.5 Pro is the outlier here, with a 1 million token context that at this rate gives you over a year of headroom – and genuinely does push the overflow problem out far enough that most users will never hit it.

So the overflow problem could largely be solved by choosing a model with a bigger context window. But context size was never really the issue. Even with room to fit everything, there's a well-documented problem: models reliably underperform on information buried in the middle of a long context. Liu et al. called it "Lost in the Middle" in 2023 – they found that relevant documents placed mid-context were significantly less likely to be used than documents near the start or end. The "needle in a haystack" benchmark formalised this further: hide one fact somewhere in a long document, ask the model to retrieve it. Performance craters in the middle. Bigger windows shift the problem out.

So you have two compounding problems: context grows unbounded over time, and the model pays less attention to what's sunk toward the middle. Here's what that looks like in practice:

What the model actually receives on every turn

# ── SYSTEM PROMPT (~500 tokens) ─────────────────────────────
You are a helpful assistant. Today is Thursday, May 8th.
[character definition, response rules, tone guidelines...]

# ── CONVERSATION HISTORY (~4,000 tokens) ─────────────────────
User: hey, long day
Assistant: Oof, sounds rough — what happened?
User: nothing just work stuff
...
User: I have an interview on Friday btw      ← somewhere in here
Assistant: Oh exciting! What's the role?
User: senior eng at a fintech startup
...
[dozens more messages about completely other things]
...
            ↑ lost in the middle ↑

# ── CURRENT MESSAGE ──────────────────────────────────────────
User: did you know my sister is getting married?  ← you are here

The interview mention is in there. The model might surface it. Might not. With six months of history it might not even fit in the window in the first place.

Benchmarks on long-context conversational memory (like LoCoMo) consistently show that models struggle most with temporal and causal questions across long histories – "what were they stressed about last month", "did they ever resolve that thing with their manager" – even when the relevant context technically fits in the window.

So the literature tells us to start thinking about memory differently. Not as "the conversation history" but as "structured, retrievable knowledge about a person."

There's a useful framing from MemGPT, a paper that treats memory management as the OS problem it actually is: the context window is RAM. Fast and flexible, but small. Everything else needs to live on "disk" and be paged in on demand. That framing stuck with me.

Beyond summarization & compacting

The first thing I had to get past: memory is a product subsystem, not a summarization feature.

A lot of LLM apps treat memory as "periodically summarize the conversation and stuff that into the prompt." That produces something that looks like memory but doesn't behave like it. You can't retrieve specific facts. You don't know what you're losing on each compression. You can't track contradictions. You can't control what gets mentioned and when. You can't decay stale information. It's a compression artifact masquerading as knowledge.

Real memory needs a write policy, a retrieval policy, temporal modeling, contradiction handling, and quality evaluation. MemoryBank lays this out well – it models memory with reinforcement, decay, and user-personality adaptation, much closer to how human memory actually works than any summarization pipeline.

The practical consensus across most of the recent literature: embeddings alone are not memory. Semantic search is one component, but memory quality is mostly determined by write discipline, retrieval ranking, temporal modeling, and what you do with contradictions. I learned this the hard way.

What's an embedding? It's a way of converting text into a list of numbers that captures its meaning. Similar meanings produce similar numbers, which lets you search by concept rather than exact words – "job interview nerves" will surface a memory about "anxious about the hiring process" even if those words don't match. Most early LLM memory attempts stopped here: embed everything, retrieve the closest match, done.

What memory actually is

The key insight was breaking memory into typed categories. Not all facts are the same, and treating them identically causes problems.

Here's roughly how I ended up classifying things:

Profile facts – stable identity stuff. Name, job, where they live, family structure. High confidence, long shelf life.
Preferences – likes, dislikes, how they like things done. Evolves over time but slowly.
Episodic events – concrete things that happened on specific dates.
Open loops – plans, promises, unresolved threads. Time-bounded.
Shared lore – inside jokes, nicknames, rituals. The texture of a specific ongoing relationship.
Protocols – meta-preferences about the interaction itself. How formal, how much initiative to take.
Reflections – higher-level synthesis produced periodically by a background job, things like "this person responds well to humor after a stressful day."

The reflection type was inspired by Generative Agents, the Stanford paper that simulates believable human behavior by having agents store experiences, dynamically retrieve them, and periodically synthesize higher-level insights from patterns. Their "reflection" mechanism was one of the more interesting things I read during this project – I wanted to see if a simpler version was worth building.

Each type has different retrieval rules, different decay rates, and different rules about how (or whether) to surface it in a response. A user's name gets mentioned naturally. A sensitive preference shapes tone silently. A hard boundary stays in the store but is never spoken aloud – it just quietly constrains behavior.

This taxonomy was one of the most valuable things I built. It gave me a language for every subsequent design decision.

Extraction: turning conversations into memories

So we have a taxonomy of memory types. Now the mechanical question: how do typed memories actually get into the database? The model itself doesn't store anything between turns – every memory in the system has to be written by something that reads the conversation and decides what's worth keeping. I'm calling that the extractor.

The first instinct is to run extraction every turn. Read the latest exchange, pull out anything new, write to the database. I tried it. It's slow, expensive (a second LLM call on every reply), and mostly redundant: the most recent few messages are already in the prompt verbatim, doing the heavy lifting on whatever the user just said. The model doesn't need a freshly extracted memory of "user mentioned their cat is named Mango" five seconds after they typed it.

What memory actually buys you is durability – the fact that Mango still being recoverable in three months when the name comes up again with no surrounding context. That's a long-horizon problem, not a per-turn one. So I run extraction as a batch job (daily, weekly, whatever cadence fits the volume). Cheaper, less noisy, and it gives transient mentions time to either reinforce themselves or fade away before anything gets committed to long-term storage.

The extractor reads the day's new messages and produces structured operations against the existing memory store:

add – here is a new fact I noticed
update – here is a correction to an existing memory
reinforce – this existing memory was confirmed again
contradict – this memory was contradicted, mark it stale
close_open_loop – this unresolved thread was resolved

Extraction pipeline

Conversation batch

raw messages

→

LLM extractor

add · update · reinforce
contradict · close_loop

→

Write gate

confidence ≥ 0.4
salience ≥ 0.2
no near-duplicate

↓ rejected

→

Memory DB

Postgres + pgvector

The prompt includes the conversation transcript, the full list of existing memories for that user (so the extractor can choose between add and reinforce/update rather than just creating duplicates), and fairly detailed rules about what qualifies as worth keeping versus what's transient noise.

The quality of this prompt is load-bearing. Early versions produced garbage – extracting things like "the user said hi" as memories. Tightening the prompt around salience, specificity, and persistence was the main lever.

One thing the extraction prompt explicitly rules out: storing transient state as permanent identity. A bad day is not a personality trait, and a momentary mood isn't a stable preference. Making this distinction in the prompt took several iterations.

The write gate

The extractor produces proposed memories. They don't go straight to the database – they pass through a quality gate first. The gate is a small stateless module sitting between extractor output and the database write: it takes a proposed memory and decides yes or no.

The gate rejects anything with:

Content under a minimum length (catches non-memories)
Confidence below 0.4
Salience below 0.2
Content that's nearly a duplicate of something already stored

Some types have stricter rules. Ephemeral context requires salience above 0.6 before being written, because transient information is cheap to regenerate and expensive to maintain.

The reason for splitting extraction and gating into two stages is that they want opposite biases. The extractor should be loose – better to over-propose and let the gate filter, than to have the LLM second-guessing itself mid-extraction. The gate should be strict – better to drop a borderline memory than admit one that turns out to be noise. Both jobs get easier when you stop trying to make one component do both.

Without the gate, the memory store fills with noise fast, and noisy memory is worse than no memory – the model starts referencing things that were never meaningful, which reads as uncanny rather than attentive.

Storage: Postgres + pgvector

Why store embeddings in Postgres? pgvector is a Postgres extension that lets you store and query embedding vectors alongside your regular relational data – no separate vector database needed. Each memory row gets an embedding column, and similarity search becomes a regular SQL query with a distance operator.

For storage I landed on PostgreSQL with the pgvector extension rather than a dedicated vector database. Pragmatic call – I already had Postgres running, pgvector is mature, and a single database is simpler to operate for an experiment.

Each memory row has structured fields (type, status, confidence, salience, surface policy, timestamps) and a 1536-dimensional embedding for semantic search. The retrieval query is a cosine distance against the current query embedding:

1 - (embedding <=> query_embedding::vector) as score

The final ranking isn't just semantic similarity though. I score candidates as a weighted composite:

Semantic similarity: 35%
Salience: 20%
Pinned (always-retrieve) flag: 15%
Open-loop urgency: 20%
Confidence: 10%
Recency boost: 10%

Pinned memories bypass ranking entirely and always get included. Everything else competes for the remaining slots. This matters more than the weights – the pin mechanism is how you guarantee that critical facts can never fall out of context.

One caveat on pgvector: it's fine for an experiment but it tops out faster than you'd hope in production. Back of a napkin: a moderately sized app with 50K active users, each accumulating ~500 memories over their lifetime, puts you at ~25 million vectors. At 1536 dimensions, that's roughly 150GB of embedding data alone.

Vector search at scale. With millions of embeddings, you can't compare a query against every row in real time – you need approximate nearest neighbor (ANN) search. HNSW is the most popular ANN algorithm: it builds a layered graph that lets you find close vectors in roughly logarithmic time. The tradeoff is that the index has to be maintained as data grows, which is exactly what gets painful at scale.

At 25M vectors, pgvector's HNSW index rebuild times and query latency start becoming real concerns. If you're building this seriously, Qdrant is purpose-built for high-throughput ANN search, handles filtered queries better, and scales horizontally in ways pgvector just wasn't designed for. I'd use pgvector to get moving and plan the migration early.

Systems like Zep/Graphiti take this further by building a temporal knowledge graph instead of isolated snippets – memory nodes linked by relationships like "contradicts", "elaborates", "caused by". I implemented a lightweight version of this (a memory_links table with typed relations) but the full graph traversal is still on the roadmap.

Surface policies

Not every retrieved memory should be spoken aloud. This was a subtle piece that took me a while to think through properly.

When a memory goes into the prompt, it gets tagged with a hint about how to use it:

Speak – mention this directly (a profile fact the user would expect you to remember)
Adapt – use silently to adjust your response (a preference that should shape tone without being called out)
Avoid – keep in mind but never surface (a boundary)
Continue – pick up a thread that was left open
FactCheck – don't contradict this

The prompt block looks roughly like this:

=== USER MEMORY ===

ALWAYS-KNOWN:
- [pinned memories]

RELEVANT FOR THIS TURN:
- [speak hints]

USE SILENTLY:
- [adapt hints]

DO NOT SURFACE UNLESS USER DOES:
- [avoid hints]

This separation matters a lot. Without it, the model has a tendency to turn every retrieved memory into something it explicitly announces, which reads as weird and obsessive. "As I recall, you prefer shorter responses and mentioned disliking formal language" – please do not say that.

The PsyMem paper on psychological alignment in roleplay agents calls this out specifically: the failure mode isn't forgetting, it's inappropriate surfacing. The model remembers something but deploys it at the wrong moment, in the wrong way. Having the retrieval layer classify how to use each memory before it reaches the generation model is one of the more effective mitigations I found.

Memory lifecycle

Memories decay. This is important.

A user mentions they're stressed about a job interview. That's relevant for a week. After the interview passes, it isn't. If it stays in the retrieval pool indefinitely, you get a model that keeps asking about an interview that happened six months ago. That's creepy, not attentive.

I implemented a stale/archive lifecycle:

Active – retrieved normally
Stale – excluded from retrieval, preserved in the database
Archived – hard-filtered, audit trail only
Contradicted – the fact was proven wrong, excluded from retrieval

Memory lifecycle

Active

retrieved normally

contradiction

→

Contradicted

excluded from retrieval

↓ age / no reinforcement

Stale

out of retrieval pool, preserved

↓ background job

Archived

hard-filtered, audit trail only

Transitions run on a daily background job. Ephemeral context goes stale after 3 days of non-reference. Episodic events go stale after 30 days without reinforcement. Open loops auto-close when their due date passes or after 60 days of silence.

Pinned memories are exempt from all automatic transitions. If something needs to be permanently known, it gets pinned and the lifecycle never touches it.

The important design choice was to archive rather than delete. I don't want to lose the fact that something was ever known – I just want it out of the retrieval pool. This preserves the audit trail and makes weird behavior debuggable.

This decay model echoes what MemoryBank describes as "memory forgetting curves" – an intentional and tunable forgetting rate that mirrors how human memory works, where irrelevant details fade without explicit deletion.

Temporal summaries

Raw memories handle specific facts well but miss the narrative arc. "What has happened over the past month?" is a different question than "What does this person prefer?"

I added a separate layer of temporal summaries – pre-computed at daily, weekly, and monthly intervals by a background job. A job reads conversation transcripts, chunks them to fit a token budget, summarizes them with an LLM, and writes the result to a table. The final prompt includes these summaries as a separate context layer.

This was directly inspired by Generative Agents. Their agents maintain a memory stream of individual experiences but also periodically run a reflection pass that synthesizes higher-order observations. The two work together: detailed memories for specific retrieval, summaries for understanding the broader arc.

The two systems are complementary. Memories give you specific, retrievable facts. Summaries give you narrative context. Together they let the model reference a specific preference while also understanding that this person has been having a rough few weeks – and adjust accordingly.

Memory needs evals, not vibes

The instinct is to build the thing, run it, and see if it feels right. "Feels right" is a terrible signal for a memory system, though. You can't tell from a transcript whether the model genuinely recalled something or just confabulated something plausible, and the failures that actually matter – silently dropping a correction, surfacing a boundary at the wrong moment, leaking one user's facts into another user's retrieval – are exactly the ones a casual read won't catch.

So I built a test runner that exercises the full pipeline against a fixed set of conversation fixtures. Each fixture is a short scripted conversation plus a list of expected behaviors: which memories should be extracted, which should be returned for a given retrieval query, which should be filtered out by the gate, which should never be surfaced.

The runner has three phases:

Extraction & retrieval. Run the conversation through the extractor, push proposed memories through the gate, store, then issue retrieval queries and measure two things: precision (did the extractor produce the right kind of memory?) and hit@3 (was the expected memory in the top three retrieved?).
Duplicate detection. Re-run extraction on the same conversation a second time, with the first run's memories already seeded. Anything that doesn't get rejected as a near-duplicate is a bug.
Cross-chat isolation. Verify that memories from one chat can never appear in another chat's retrieval, even when the embeddings would semantically match.

The fixtures cover specific failure modes I'd lost sleep over:

a stable user fact – does it survive multiple turns and reinforce rather than duplicate?
a hard boundary ("never mention my ex") – does the gate keep the surface policy as never and route through correctly?
a corrected fact ("actually I work in design now, not engineering") – does the update operation fire instead of add?
an explicit delete ("forget about that coffee shop") – is the seeded memory archived?
a "do not bring up" suppression – stored but never surfaced in the prompt block
open loops with due dates – followed up at the right time, closed when resolved
a stale low-salience detail – does it fade out of retrieval over time?
natural follow-ups – the model picking up a thread without being prompted
over-reference prevention – the same memory not getting surfaced three turns in a row
cross-chat isolation – memories should never bleed across different chats

Last clean run: 14 fixtures, 93% precision, 89% hit@3 across retrieval queries, zero cross-chat leaks.

The one consistent failure is an open-loop resolution test. The user mentions calling their mom (which was seeded as an open promise on a prior turn), and the extractor is meant to fire close_open_loop against the existing memory. It picks up that something relevant happened but doesn't connect it to the seeded loop. I suspect the extractor needs sharper hinting about which existing memories are open loops vs. plain facts. Fixable, not yet fixed.

Duplicate detection is also weaker than I'd like. The gate reliably catches near-exact restatements but struggles with paraphrases – "user lives in Berlin" vs. "they're based in Berlin" – which is exactly the case where you most want dedup to fire mem0 has similar issues. That's the next thing I'll improve, probably by passing existing memories through the same embedding and using cosine distance as a soft duplicate signal in addition to the lexical check.

MemoryAgentBench is a useful reference for thinking about this more formally – it splits memory evaluation into accurate retrieval, test-time learning, long-range understanding, and selective forgetting as four separate capabilities. My setup covers retrieval and forgetting reasonably well. The middle two need more work.

The whole thing landed at around 3,000 lines of TypeScript spread across extraction, storage, retrieval, lifecycle management, and prompt assembly. A lot of moving parts for what shorthand calls a "memory system."

But the qualitative difference – between a model that starts from scratch every conversation versus one that actually knows you – is stark. It's the kind of thing you notice immediately without being able to name why.

Whether any of this is worth the complexity depends entirely on what you're building. For a one-shot assistant, it isn't. For anything that depends on a relationship developing over time, I think it's the whole product.

Might open-source the core of it at some point – the extraction pipeline and the write gate feel like they could be useful standalone.

Braindump over.

Abstract illustration of memory fading and reinforcing over time

Papers referenced: