A Deep Dive into AI Agent Memory: Benchmarks and Open Source Frameworks

# A Deep Survey on AI Agent Memory Benchmarks and Open Source Frameworks

# I. Introduction

The memory capability of AI Agents is becoming a core element determining their practical value. While current Large Language Models are powerful, they often exhibit "amnesia" in long-term interaction scenarios—failing to effectively retain and utilize historical context, which forces users to repeatedly reiterate their requirements. This limitation has catalyzed the rapid development of evaluation benchmarks and technical frameworks specifically designed for Agent Memory. This report will systematically review the benchmark landscape and open-source framework ecosystem within the AI Agent Memory domain, revealing the actual state and development trends of this field through multi-dimensional technical analysis.

# II. The Ecosystem Panorama of AI Agent Memory Benchmarks

# 2.1 Three Dimensions of the Evaluation System

Current Agent Memory benchmarks do not rely on a single evaluation standard; instead, they have evolved into three complementary evaluation dimensions. The first dimension is conversational memory capability, represented by LoCoMo and LongMemEval, which focuses on evaluating an agent's ability to retain and retrieve information during long-term, multi-session conversations [1][2]. The design philosophy of this type of benchmark stems from real-world scenarios: user interactions with AI assistants often span days or even weeks, and a single conversation may contain hundreds of turns. LoCoMo constructs ultra-long conversation scenarios averaging 9,000 tokens and 35 sessions. The challenge lies in the dispersed nature of information and its temporal correlations—the agent needs to recall details mentioned in the 3rd conversation during the 20th conversation and understand the causal relationship between the two.

The second dimension is comprehensive memory capability evaluation. MemoryAgentBench pioneered four core capability dimensions: Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Conflict Resolution [3]. The value of this framework lies in breaking the simplified perception that "retrieval equals memory." For example, Test-Time Learning requires agents to dynamically learn new skills and apply them flexibly during interactions, while Conflict Resolution tests whether the agent can identify and update contradictory information—when a user declares in the 10th turn, "I changed my mind; I live in Beijing now, not Shanghai," the agent must correctly update the user's residence information in the knowledge graph. Experimental data from MemoryAgentBench reveals a stark reality: in conflict resolution tasks, GPT-4o achieves only 60% accuracy in single-hop scenarios, while in multi-hop scenarios, the accuracy of all tested methods is below 7%. This implies that current memory systems fail almost completely when dealing with dynamically changing knowledge.

The third dimension is scenario-specific memory evaluation. MEMTRACK focuses on multi-platform collaboration scenarios, simulating task tracking across Linear, Slack, and Git in software development [4]. FindingDory targets the long-term memory of embodied agents, evaluating how robots integrate large-scale visual data (hundreds of images) to complete object manipulation and navigation tasks over multi-day missions [5]. The emergence of these specialized benchmarks reflects a key insight: memory capability is not an abstract general ability but is deeply coupled with specific application scenarios—the memory requirements of a chat assistant differ fundamentally from those of an embodied robot.

# 2.2 Technical Dilemmas Revealed by Benchmarks

The test results of these benchmarks expose three major dilemmas facing current technical approaches. First is the paradox of RAG (Retrieval-Augmented Generation) effectiveness. Experiments from MemoryAgentBench show that in the NIAH-MQ task, simple BM25 keyword matching achieved 100% accuracy, while complex systems relying on GPT-4o-mini achieved only 22.8%[3]. This contrast reveals a truth widely ignored by the industry: in retrieval tasks, the "comprehension ability" of LLMs actually becomes a disturbing factor—when the model attempts to semantically understand the question, it deviates from precise literal matching. Letta's research further confirms this; by using only file system tools like grep and search_files, they achieved 74.0% accuracy on LoCoMo, surpassing Mem0's graph model variant (68.5%)[6]. The reason behind this lies in the fact that file manipulation tools are widely present in LLM training data, making the model more "familiar" with how to use them, whereas complex knowledge graph tools perform poorly due to insufficient training data.

Second is the false promise of long context. Although GPT-4.1-mini claims to support a context of millions of tokens, its accuracy in the ∞Bench-QA task is only 45.8%, and the computational cost grows linearly with context length[3]. This result shatters the myth that "longer context is always better." The deep-seated reason lies in the "lost-in-the-middle" phenomenon of the attention mechanism in long-context models—the model remembers information at the beginning and end relatively well, but its retrieval ability for information in the middle drops dramatically. Furthermore, long context faces the "distracting noise" problem: when the context contains a large amount of irrelevant information, the model's retrieval precision actually decreases. Data from LongMemEval shows that the memory accuracy of commercial chat assistants drops by 30% during continuous interaction[2], implying that even if long context is technically supported, memory degradation in practical applications remains inevitable.

Third is the asymmetry of computational efficiency. MemoryAgentBench testing found that Mem0's memory construction time is 20,000 times that of BM25, and Cognee requires 3.3 hours to process a single sample[3]. This data reveals a fatal flaw in current memory systems: to achieve "intelligent" memory management (such as automatic fact extraction and knowledge graph construction), systems require massive amounts of LLM calls and graph computations, causing latency to skyrocket. In a production environment, this means that for every message a user sends, the system may take seconds or even tens of seconds to update its memory—a latency unacceptable in real-time interaction scenarios. A deeper contradiction lies in the fact that the more "intelligent" the memory system, the slower its response speed; conversely, the simpler the retrieval mechanism, the poorer its adaptability. A fundamental solution to this dilemma has not yet been found.

# 2.3 Blind Spots in Benchmark Design

Although these benchmarks have contributed significantly to advancing the field, their design itself contains blind spots worth noting. First is the disconnect between static evaluation and dynamic interaction. Although LoCoMo constructs long conversation scenarios, its evaluation is essentially a static mode of "record conversation history first, then ask questions"[1]. This differs fundamentally from real-world scenarios: in actual applications, user questions dynamically influence the conversation flow, whereas conversation history in benchmarks is pre-generated fixed text. Letta's research exploited this very blind spot—file system tools perform excellently in static retrieval tasks but may expose limitations in scenarios requiring dynamic memory updates.

Second is the "hackability" of benchmarks. When Emergence AI claimed to achieve 91%+ accuracy on LongMemEval[7], the industry's first reaction was not applause, but skepticism: does this mean the benchmark itself lacks difficulty? In fact, while LongMemEval's 500 questions are carefully curated, their scale and diversity fall far short of real-world complexity. Can a system specifically optimized for a benchmark support the demands of a production environment with its generalization capability? This issue is common in the machine learning field—SOTA models on ImageNet perform excellently on specific datasets but may suffer a sharp performance drop in real scenarios due to distribution shift.

Third is the incomparability of multi-dimensional capabilities. The four capability dimensions proposed by MemoryAgentBench seem comprehensive, but how should the weights between different dimensions be allocated? Is a system scoring 90 in accurate retrieval but 30 in conflict resolution better than a system with a balanced score of 60 across all dimensions? This issue is rarely discussed in benchmark design but is crucial in practical applications. For example, in customer service scenarios, conflict resolution capability (such as identifying order information changed by the user) might be more important than accurate retrieval (such as recalling historical conversation details). A single total score ranking masks the heterogeneity of capability distribution, leaving developers lacking a sufficient basis for decision-making during selection.

# 3. Technical Division and Trade-offs of Open Source Frameworks

# 3.1 Three Major Schools of Technical Approaches

Current open-source Agent Memory frameworks present three major schools of thought in their technical approaches, with each school implicitly harboring a different understanding of the fundamental question: "What is memory?"

The first school is the Knowledge Graph-driven approach, represented typically by Zep/Graphiti. Graphiti's core innovation lies in the Temporal Knowledge Graph, designed using a Bi-Temporal model: explicitly tracking the time an event occurred (valid time) and the time data was ingested (transaction time) [8]. This design solves a fatal flaw of traditional knowledge graphs—the inability to handle the dynamic evolution of knowledge. For example, if a user says "I work at Google" on Day 1, and "I have left to join Microsoft" on Day 10, the system needs to retain both records rather than simply overwriting, while returning the latest information when queried for the "user's current employer." Graphiti implements this capability through graph databases like Neo4j, and by combining semantic embeddings, BM25 keyword matching, and hybrid graph traversal retrieval, it achieved a 94.8% accuracy rate on the DMR benchmark, surpassing MemGPT's 93.4% [8].

However, the advantage of knowledge graphs comes at a cost. Building and maintaining graphs requires extensive entity recognition, relation extraction, and consistency verification, operations that rely on frequent LLM calls. Although Zep's paper claims a 90% reduction in response latency, this is relative to baseline models—its absolute latency remains significantly higher than simple vector retrieval. A deeper issue lies in the "over-structuring" of graphs: forcing natural language into entity-relation triplets inevitably loses linguistic nuances and contextual dependencies. For instance, how is the statement "I like coffee but I don't like espresso" expressed in a graph? Forcibly splitting it into "User-Likes-Coffee" and "User-Dislikes-Espresso" loses the transitional semantics of the original sentence.

The second school is the Vector + Fact Extraction approach, represented by Mem0. Mem0's design philosophy is to automatically extract structured facts from conversations via LLMs and store them in vector form [9]. The advantage of this route is its simplicity and efficiency—it does not require complex graph computation, relying solely on semantic search in vector databases for memory retrieval. Mem0 claims to be 91% faster than OpenAI Memory and reduces Token usage costs by 90% [9]; this performance advantage stems from its "lightweight" architectural design.

However, MemoryAgentBench tests mercilessly revealed a fatal flaw in this route: fact extraction leads to severe context loss [3]. When an LLM compresses a conversation into the fact "User likes Italian food," the situational information from the original dialogue (such as whether the user mentioned it while discussing travel plans or answering a dietary preference survey) is completely erased. This results in the system being unable to understand the applicable context of this fact during subsequent retrieval. Even worse, fact extraction itself relies on the "understanding" capability of LLMs, which suffer from systemic biases in interpreting user intent—they might interpret sarcasm literally or misjudge temporary statements as long-term preferences. Mem0 achieved only 68.5% accuracy on LoCoMo [6], largely due to this design flaw.

The third school is the Hybrid Architecture approach, represented by Cognee and LangMem. The ECL (Extract-Cognify-Load) pipeline proposed by Cognee attempts to combine the strengths of vector search and graph databases: preserving original data during the Extract phase, building semantic relationship graphs during the Cognify phase, and supporting multi-modal queries during the Load phase [10]. LangMem adopts a "tiered memory" strategy, distinguishing between semantic memory (facts and knowledge), procedural memory (behavioral strategies), and episodic memory (specific events), while implementing user data isolation through a namespace mechanism [11].

The theoretical advantage of hybrid architecture lies in "having the best of both worlds," but the challenge in practice lies in complexity management. Cognee requires 3.3 hours to process a single sample [3]; this latency exposes the fundamental contradiction of hybrid architecture: to achieve multi-modal querying, the system must simultaneously maintain vector indices, graph structures, and original text, requiring synchronization across three storage layers for every update. Furthermore, ensuring consistency between different storage layers is a difficult problem—when vector retrieval and graph traversal return inconsistent results, which one should the system trust?

# 3.2 The Hidden Costs of Framework Selection

When selecting technology, developers are often attracted by GitHub star counts and benchmark rankings, but what truly determines a framework's applicability is its hidden costs.

Mem0's hidden cost is "Black-boxing." Its 45.4k stars and automated memory management are highly attractive[9], but behind this automation lies a complete encapsulation of memory organization methods. Developers cannot precisely control which information is extracted as facts and what is discarded. In scenarios requiring granular memory control (such as medical diagnosis or legal consulting), this black-boxing may lead to the loss of critical information. More seriously, when the system makes memory errors (e.g., mistaking "User is allergic to peanuts" for "User likes peanuts"), developers find it difficult to trace the source of the error because the entire fact extraction process is a "black box operation" by the LLM.

Zep/Graphiti's hidden cost is "Infrastructure Dependency." Its 21.9k stars and SOTA performance are impressive[12], but achieving the performance described in the paper requires deploying a Neo4j graph database, configuring vector storage, setting up LLM APIs, and performing complex parameter tuning. For small teams or prototype development, this threshold can be prohibitive. Furthermore, the operational maintenance costs of graph databases (such as backup, scaling, and monitoring) are significantly higher than simple key-value storage, potentially consuming massive engineering resources in actual production.

Letta's hidden cost is "Agent Dependency." Its 20.6k stars and "Memory-First" philosophy are highly forward-looking[13], but Letta is essentially a complete agent framework rather than a pure memory layer. This means using Letta requires accepting its entire architectural design—message routing, tool invocation, state management, etc. For teams with existing agent systems, migrating to Letta may require refactoring the entire tech stack. Additionally, while Letta's file system tools perform excellently in benchmarks, they may struggle in scenarios requiring complex reasoning (such as multi-step logic chains), as file tools are essentially string matching and cannot understand semantic associations.

LangMem's hidden cost is "Ecosystem Lock-in." As part of the LangChain ecosystem, LangMem's deep integration with LangGraph is a strength, but it also means developers must adopt the LangChain tech stack[11]. For teams using other frameworks (like Haystack or AutoGPT), introducing LangMem may lead to tech stack inconsistency. Moreover, while LangMem's managed service lowers deployment barriers, it introduces dependencies on third-party services—risks such as service interruptions and privacy leaks need to be taken into account.

# 3.3 Experimental Exploration of Emerging Frameworks

Beyond mainstream frameworks, some emerging projects are exploring the boundaries of memory systems. A-MEM introduces the Zettelkasten method, treating memory entries as knowledge cards and building a mesh knowledge network through dynamic indexing and linking[14]. The philosophical basis of this design stems from human note-taking systems: each card exists independently but forms an organic whole through bidirectional links. A-MEM's innovation lies in allowing the memory system to autonomously decide how to organize knowledge, rather than relying on predefined hierarchical structures. However, this freedom also brings the risk of "knowledge explosion"—as interactions increase, the number of links grows exponentially. How can the system be prevented from falling into the chaos of "over-association"?

ReMe focuses on procedural memory, distinguishing between four dimensions: task memory, personal memory, tool memory, and working memory[15]. This layered design reflects insights from cognitive science: different types of memory have different retention mechanisms and retrieval patterns. For example, task memory (e.g., "how to solve a certain type of programming problem") should be reused across users, while personal memory (e.g., "a user's programming style preference") should be isolated per user. Although ReMe's 867 stars are far fewer than mainstream frameworks, its modular design allows developers to combine memory types as needed, offering greater flexibility.

Cognee's ECL pipeline attempts to shift memory management from "retrieval optimization" to "knowledge modeling"[10]. Its core viewpoint is that the failure of traditional RAG lies not in poor retrieval algorithms, but in the lack of structure in the documents themselves. Through knowledge graph construction in the Cognify stage, the system transforms unstructured text into structured knowledge, making subsequent retrieval more precise. However, the fundamental problem with this route is: Who defines a "good knowledge structure"? If relying on LLM for automatic construction, the quality of the structure depends on the LLM's understanding capabilities; if relying on manual definition, the system's generalization ability will be limited. Cognee has not yet resolved this dilemma.

# 4. The Overlooked Core Question: What Is the Essence of Memory?

# 4.1 The Paradox of Memory as Compression

Currently, almost all memory systems are based on an implicit assumption: memory is the compression of information. LoCoMo's 9000-token conversation needs to be compressed into retrievable memory units, and Mem0's fact extraction is essentially an extreme compression of the dialogue. However, this assumption ignores the dual nature of memory: compression improves efficiency, but it also inevitably loses information. Information theory tells us that the limit of lossless compression is determined by entropy—for high-entropy natural language, the lossless compression rate is extremely limited. This means that any system attempting to compress long conversations into brief facts must face the difficult choice of "what to keep" and "what to discard."

Letta's file system tool works precisely because it does not compress—the raw dialogue is preserved intact as files, and retrieval is performed by searching the full text via grep[6]. This "anti-compression" strategy performs excellently in static retrieval tasks, but its cost is the linear growth of storage. When conversation history reaches millions of tokens, full-text storage and search become unsustainable. At that point, the system must introduce compression mechanisms, and once compressed, it returns to the dilemma of "what gets lost."

Knowledge Graphs attempt to achieve "lossy but controllable" compression through structuring—retaining only entities and relationships while discarding the modifiers of original sentences. However, practice shows that this "controllability" is illusory. MemoryAgentBench's conflict resolution task reveals that when new information contradicts old information, systems often fail to correctly identify the contradictory relationship[3]. The reason lies in the fact that graph structuring assumes the certainty of knowledge—"user lives in Shanghai" and "user lives in Beijing" are seen as mutually exclusive relationships, but in reality, the user might have homes in both places or be in the process of moving. Forcing uncertainty into a deterministic structure inevitably leads to semantic distortion.

# 4.2 The Curse of Retrieval Precision

Retrieval systems face a classic "precision-recall" trade-off: improving precision means returning fewer but more relevant results, while improving recall means returning more but potentially less relevant results. In the Agent Memory scenario, this trade-off evolves into a dilemma of "over-retrieval" versus "under-retrieval".

LongMemEval's multi-session reasoning task requires agents to integrate information scattered across different sessions[2]. If the retrieval system returns only the most similar Top-1 result, critical information may be missed; if it returns the Top-10, significant noise may be introduced. Experiments from MemoryAgentBench show that in accurate retrieval tasks, when BM25's Top-K increases from 2 to 10, accuracy rises from 49.5% to 61%, but the impact on learning tasks is limited[3]. This implies that the optimal K-value is task-dependent, and there is no universal "best configuration".

A deeper issue lies in the fact that retrieval itself assumes query clarity. When a user asks, "What was the name of that restaurant I mentioned last time," the system can explicitly retrieve memories related to "restaurant" and "name." But when a user asks, "Help me summarize recent progress," what defines "progress"? What is the time frame? These ambiguous queries rarely appear in benchmarks but are extremely common in real-world applications. Current memory systems generally lack the ability to handle ambiguous queries because both vector similarity and keyword matching rely on explicit retrieval targets. The essence of ambiguous queries is the need for the system to "understand intent," which transcends the scope of memory systems and enters the domain of reasoning systems.

# 4.3 The Consistency Trap of Updates

The core difference between memory systems and databases is that databases pursue ACID consistency, while memory systems need to tolerate "progressive consistency." When a user says "I like coffee" on Day 1 and "I've quit coffee recently" on Day 5, how should the system update its memory? If the memory from Day 1 is completely overwritten, the historical trajectory of the user's preference change is lost; if both memories are retained, contradictions may arise when answering "what does the user drink."

Zep's temporal knowledge graph attempts to solve this problem via a bitemporal model[8], but essentially, it transforms the "consistency problem" into a "temporal query problem"—the system no longer asks "what does the user like," but "what did the user like at time T." While this transformation is elegant, it introduces new complexities: how to determine the temporal context of the query? When a user asks, "Recommend a drink for me," should the system use the current time or historical time? If the user does not specify, the system's default behavior might lead to unexpected results.

More critically, memory updates may trigger cascading modifications. When a user corrects, "I actually don't live in Shanghai," the system not only needs to update the user's residence but also re-examine all inferences related to "Shanghai"—for example, the previous inference that "the user likely enjoys Benbang cuisine" based on "the user lives in Shanghai" should also be discarded. However, current memory systems generally lack this "inference chain backtracking" capability, resulting in logical silos in the updated memory graph. The conflict resolution task in MemoryAgentBench targets this exact issue, and test results show that even GPT-4o cannot reliably perform cascading updates[3].

# 4.4 The Double-Edged Sword of Tool Use

Letta's research indicates that an agent's ability to call tools is more important than the memory retrieval mechanism itself [6]. This finding overturns traditional understanding in the field: while we have been optimizing "how to store and retrieve memory," the real bottleneck may be "whether the agent knows when memory is needed."

MEMTRACK's experiments reveal that even with the integration of advanced memory systems like Mem0 and Zep, agent performance improvements are not significant [4]. The reason lies in the agent's failure to effectively utilize memory tools—it either forgets to call the memory system, calls it at inappropriate times, or misinterprets the retrieved results. This phenomenon exposes a fundamental flaw in current agent architectures: memory systems are designed as "passive components" waiting for the agent to initiate calls, rather than "actively prompting" the agent when memory is required.

The ideal memory system should be "intrusive"—when a user mentions a previously discussed topic, the system should automatically inject relevant memories into the agent's context instead of waiting for the agent to inquire. However, this design brings new risks: excessive memory injection can pollute the agent's context, leading to information overload. Finding the balance between "active prompting" and "on-demand retrieval" is a core issue that the next generation of memory systems must resolve.

# 5. Decision Framework for Technology Selection

# 5.1 Scenario-Driven Selection Principles

Faced with a multitude of frameworks, developers should base their selection on the core requirements of specific scenarios rather than benchmark rankings or the number of GitHub stars.

For customer service scenarios, the core need is "contextual coherence"—users may discuss the same issue across multiple sessions. In this case, LangMem's tiered memory design (distinguishing semantic and episodic memory) might be the best choice[11], as it allows the system to retain both the user's question history (episodic memory) and general knowledge extracted from it (semantic memory). Mem0's fact extraction carries higher risk in this scenario because customer service conversations often contain a lot of transient information (like order numbers, logistics status), which is ill-suited to be extracted as "long-term facts."

For personal assistant scenarios, the core need is "preference learning"—the system needs to understand user habits and likes over time. Zep/Graphiti's temporal knowledge graph offers significant advantages in this scenario[8], as it can track the evolutionary trajectory of user preferences (e.g., "the user used to like coffee, but recently switched to tea"). However, deployment and maintenance costs can be prohibitive—for small teams, using Mem0's managed service might be a more pragmatic choice, despite its slightly inferior memory quality.

For knowledge Q&A scenarios, the core need is "precise retrieval"—users expect the system to accurately recall details from historical conversations. Letta's filesystem tools might be surprisingly effective in this scenario[6], as grep's literal matching has extremely high accuracy when handling explicit queries (e.g., "What was that number I mentioned last time"). But for vague queries requiring semantic understanding (e.g., "Summarize the key points we discussed"), filesystem tools fall short, necessitating support from vector search or knowledge graphs.

For embodied agent scenarios, the core need is "multimodal memory"—the system needs to integrate historical information from various modalities like vision and audio. Research from FindingDory indicates that current VLMs have bottlenecks when processing large-scale image data[5], and existing frameworks (like Mem0, Zep) have not yet been optimized for multimodal scenarios. This means that in the field of embodied agents, memory systems are still in an early exploratory stage, and developers may need to develop their own solutions.

# 5.2 Phased Evolution Strategy

The selection of a memory system should not be a "one-time decision" but should adopt a phased evolution strategy, optimizing gradually as the application matures.

Prototype Phase: Prioritize solutions that are simple to deploy and quick to validate. Mem0's managed service or LangMem's SDK are ideal choices because they allow developers to rapidly integrate memory functionality with just a few lines of code. The goal of this phase is to validate "whether memory actually improves user experience," rather than pursuing the ultimate in memory quality. If user feedback indicates that memory functionality is irrelevant, then investing significant resources in optimizing the memory system is a waste.

MVP Phase: When the product enters market validation, focus needs to shift to memory reliability. At this point, Letta's filesystem tools might be an underrated choice—although its technical approach isn't "cool" enough, grep's reliability is far higher than complex systems dependent on LLMs. In a production environment, predictable mediocrity is superior to unpredictable excellence. Developers should collect real user data, analyze which types of memory retrievals are most frequent and which are most prone to errors, thereby providing data support for the next stage of optimization.

Scaling Phase: When user volume and data volume reach a certain scale, performance bottlenecks begin to emerge. At this point, high-performance solutions like Zep/Graphiti can be considered, but avoid a complete replacement. A safer strategy is to adopt a "dual-track system": continue using lightweight vector retrieval for high-frequency, simple queries (e.g., "What is the user's name"); for low-frequency, complex queries (e.g., "Analyze the trend of user preference evolution"), invoke the knowledge graph system. The key to this strategy lies in containing complexity within system boundaries, avoiding making the entire system pay the price for a minority of complex scenarios.

Optimization Phase: Once the system is running stably, targeted optimizations can be made based on actual data. For example, analyze failed memory retrieval cases to determine if it is a problem with the retrieval algorithm (in which case, optimize the vector model or graph traversal algorithm) or a problem with memory organization (in which case, redesign memory chunking or indexing strategies). Optimization at this stage should be data-driven rather than hypothesis-driven—do not blindly migrate just because a paper claims knowledge graphs are better; instead, make decisions based on the characteristics of your own data.

# 5.3 Real-World Cost-Benefit Considerations

Beyond idealistic technical discussions, business decisions must confront real-world cost-benefit considerations.

Computational Costs: Mem0 claims to reduce Token usage costs by 90%[9], but this figure requires careful interpretation. The premise of cost reduction is a comparison against the baseline of "stuffing the entire conversation history into the context," whereas in practical applications, few adopt such an extreme approach. A more realistic comparison should be "the cost of adopting a memory system" versus "the cost of not adopting a memory system." If a memory system requires 3 LLM calls per retrieval (fact extraction, similarity calculation, reranking), while not using a memory system requires only 1 LLM call (direct answer), then the memory system actually increases costs. Only when the user experience improvement brought by memory can be translated into commercial value (such as reducing churn rate, increasing conversion rate) is the cost of the memory system justified.

Engineering Costs: The deployment of Zep/Graphiti requires configuring multiple components such as Neo4j, vector databases, and LLM APIs[12]; a small team might take weeks or even months to complete a production-grade deployment. In contrast, Mem0's managed service takes only minutes to integrate. This difference is particularly important for startups—the ability to quickly validate hypotheses is often more valuable than the technical sophistication of the solution. If a team spends 3 months deploying Zep, only to find that users do not need such complex memory features, then those 3 months of investment become sunk costs.

Privacy Costs: Using managed services means uploading user data to third-party platforms, which may be unacceptable in sensitive fields like healthcare and finance. In such cases, self-hosted open-source solutions (such as Letta, Cognee) are the only option[13][10]. However, self-hosting also means the team must assume full operational responsibility for data security, backup and recovery, monitoring, and alerting. Privacy protection and operational convenience are a zero-sum game, and developers need to make trade-offs based on the nature of the business.

Opportunity Costs: Investing resources in optimizing memory systems means being unable to invest them in other features. When a product is in its early stages, users may care more about the perfection of core features rather than the precision of memory. For example, for a coding assistant, users might prefer the system to generate code accurately rather than remembering the programming language preference mentioned last week. Premature optimization of memory systems may lead to "putting the cart before the horse"—pursuing "nice-to-have" features before the core value has been validated.

# 6.1 From Passive Memory to Active Cognition

Current memory systems are essentially a "passive storage + on-demand retrieval" architecture, while next-generation systems may evolve into active cognitive systems. This means the system no longer waits for explicit user requests to recall memories, but actively identifies when memory injection is needed and which memories to inject.

Zep's Temporal Knowledge Graph has already demonstrated early signs of this direction—by tracking the evolution of entity relationships, the system can proactively discover contradictory information and prompt the user [8]. However, true active cognition requires deeper reasoning capabilities: the system needs to understand the deep intent of the conversation and judge which historical information is critical to the current task and which is distracting. This capability currently far exceeds the reasoning level of LLMs and may require combining technologies such as symbolic reasoning and causal modeling.

The core challenge of active cognition lies in the balance between "excessive intervention" and "omitting key information". If the system injects memories too aggressively, the user may feel interrupted; if it is too conservative, the significance of active cognition is lost. Future research may need to introduce reinforcement learning mechanisms to let the system learn optimal intervention strategies through user feedback.

# 6.2 The Dilemma of Multimodal Memory

Research on FindingDory reveals that current memory systems perform poorly in multimodal scenarios [5]. Embodied agents need to integrate historical information from visual, auditory, tactile, and other modalities, whereas existing frameworks are primarily optimized for text dialogue. The fundamental challenge of multimodal memory lies in cross-modal alignment: how to associate an image seen by the user on Day 1 with a linguistic description on Day 5?

Current multimodal models (such as GPT-4V, Gemini) already possess certain cross-modal understanding capabilities, but limitations remain in long-term memory scenarios. For example, when a user says, "Bring me the object I took a photo of last time," the system needs to retrieve historical images, identify objects, and understand spatial relationships—a process involving image retrieval, object detection, and semantic association. Failure in any single link will lead to task failure.

Multimodal memory may require new storage paradigms. Text can be compressed into vectors, but how can the semantic information of images be effectively compressed? Directly storing raw images leads to an explosion in storage costs, but extracting image features may lose key details (such as the color or texture of an object). Future research may need to explore "hierarchical encoding"—assigning different storage strategies to information at different levels of abstraction, where frequently queried coarse-grained features reside in memory, while infrequently queried fine-grained information is loaded on demand.

# 6.3 The Gap Between Benchmarks and the Real World

All benchmarks analyzed in this report share a fundamental flaw: they evaluate system performance on static tasks rather than adaptability in dynamic interactions. LoCoMo's conversations are pre-generated, and LongMemEval's questions are manually curated; these scenarios cannot reflect the "randomness" and "creativity" of real users [1][2].

Real-world users do not behave according to the expectations of benchmark designers. They may suddenly switch topics during a conversation, use ambiguous pronouns, or pose query types never seen in benchmarks. High scores of current memory systems on benchmarks do not guarantee reliability in real-world scenarios. This gap is not uncommon in the machine learning field—ImageNet SOTA models fail miserably on adversarial examples, and NLP models topping the GLUE leaderboard still make elementary errors on real text.

Future benchmarks may need to introduce adversarial evaluation—testing not only the system's best performance but also its robustness in extreme situations. For example, when a user deliberately provides contradictory information, can the system identify it and refuse to update? When a user uses dialects, slang, or typos, does the system's retrieval capability remain effective? These "stress tests" may reflect system production readiness better than benchmark rankings.

# 6.4 Ethical Boundaries of Memory

As memory systems become more capable, a question overlooked by the technical community surfaces: Should AI possess the "right to forget"? If a system remembers a casual comment made by a user five years ago and cites it in a current conversation, is this "thoughtful" or "intrusive"?

The EU's GDPR grants users the "Right to be Forgotten," yet current memory systems generally lack the capability for "selective forgetting." Zep's Temporal Knowledge Graph can mark information as outdated but cannot completely delete it—because deleting an entity might lead to the collapse of the graph structure [8]. Mem0's fact extraction is even more untraceable—once facts are extracted and embedded into vector space, developers cannot precisely pinpoint "which vectors encode the user's sensitive information."

Future memory systems may need built-in "forgetting mechanisms"—supporting not only physical deletion (removal from the database) but also semantic deletion (ensuring deleted information no longer influences system reasoning and decision-making). Achieving this capability requires memory systems to possess stronger interpretability—the system must be able to explain "why this decision was made" and "which historical memories influenced the decision," so that the impact scope can be assessed after deleting memories.

# VII. Conclusion

The field of AI Agent Memory is currently in the nascent stage of rapid evolution. Technical pathways have not yet converged, benchmarking systems are still being refined, and the open-source ecosystem presents a flourishing state of diversity. Through this in-depth survey, we have discovered a significant theory-practice gap in this domain: technical dilemmas revealed by benchmarks may be amplified in actual applications, while excellent performance of frameworks on benchmarks does not guarantee reliability in production environments.

For developers, the core advice is "Scenario-First, Gradual Evolution". Do not be misled by GitHub star counts or SOTA results in papers; instead, choose technical solutions based on the core requirements of specific application scenarios. Prioritize rapid validation during the prototyping stage, and only invest resources in performance optimization during the scaling stage. The value of a memory system lies not in its technical complexity, but in whether it truly improves the user experience.

For researchers, the core challenge is "Moving Beyond the Retrieval Mindset". Current memory systems are essentially still extensions of retrieval systems, whereas true memory should possess capabilities for active cognition, dynamic evolution, and multi-modal integration. The conflict resolution and test-time learning capabilities proposed by MemoryAgentBench are in the right direction, but how to implement these capabilities in engineering practice remains an unsolved problem[3].

The most fundamental insight is: Memory is not a technical problem, but a cognitive one. Human memory is not perfect information storage, but a complex cognitive process that is emotionally colored, context-dependent, and involves selective forgetting. When we attempt to endow AI with memory, we should not simply pursue "perfect recall," but rather consider "what kind of memory is necessary for a specific task." The reason Letta's filesystem tool is effective is not because it is technically advanced, but because it precisely matches the requirements of the benchmark tasks[6]. This lesson reminds us: while pursuing technical innovation, do not forget to return to the essence of the problem.

Future AI Agent Memory systems may not represent the victory of a single technical route, but rather the organic integration of multiple technologies—flexibly selecting the most appropriate memory mechanism under different scenarios, stages, and constraints. Realizing this vision requires the joint efforts of benchmark designers, framework developers, and application practitioners, and even more so, the continuous exploration of the fundamental question: "What is good memory?"

# References

  1. LoCoMo: Long Conversational Memory Dataset
  2. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
  3. MemoryAgentBench Dataset
  4. Introducing MEMTRACK: A Benchmark for Agent Memory
  5. FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
  6. Benchmarking AI Agent Memory: Is a Filesystem All You Need?
  7. Emergence AI Broke the Agent Memory Benchmark
  8. Zep: A Temporal Knowledge Graph Architecture for Agent Memory
  9. Mem0 GitHub Repository
  10. Cognee GitHub Repository
  11. LangMem SDK Launch
  12. Graphiti GitHub Repository
  13. Letta GitHub Repository
  14. A-MEM: Agentic Memory for LLM Agents
  15. ReMe GitHub Repository