CVE-2025-6211: CWE-440 Expected Behavior Violation in run-llama run-llama/llama_index

Severity: mediumType: vulnerabilityCVE-2025-6211

A vulnerability in the DocugamiReader class of the run-llama/llama_index repository, up to version 0.12.28, involves the use of MD5 hashing to generate IDs for document chunks. This approach leads to hash collisions when structurally distinct chunks contain identical text, resulting in one chunk overwriting another. This can cause loss of semantically or legally important document content, breakage of parent-child chunk hierarchies, and inaccurate or hallucinated responses in AI outputs. The issue is resolved in version 0.3.1.

AI Analysis

Technical Summary

CVE-2025-6211 is a medium-severity vulnerability affecting the run-llama/llama_index repository, specifically the DocugamiReader class up to version 0.12.28. The vulnerability arises from the use of MD5 hashing to generate unique identifiers (IDs) for document chunks. MD5, being a cryptographic hash function with known weaknesses, is used here to hash the text content of document chunks to create IDs. However, structurally distinct chunks that contain identical text produce identical MD5 hashes, leading to hash collisions. This collision causes one chunk to overwrite another in the system, resulting in the loss of semantically or legally important document content. Furthermore, this overwriting breaks the parent-child chunk hierarchies that are essential for maintaining the structural integrity of documents. The consequence is that AI models relying on these document chunks for generating responses may produce inaccurate or hallucinated outputs, undermining the reliability of AI-driven document processing or analysis. The vulnerability does not impact confidentiality but affects integrity and availability of document data. The issue has been addressed in version 0.3.1 of the software, which presumably replaces the MD5 hashing mechanism with a more collision-resistant method or a different approach to ID generation. The CVSS score is 6.5 (medium), reflecting the network exploitable nature without privileges or user interaction, and the impact on integrity and availability but not confidentiality. No known exploits are reported in the wild as of the publication date.

Potential Impact

For European organizations, especially those relying on AI-driven document processing tools like run-llama/llama_index, this vulnerability can lead to significant operational and legal risks. Loss or corruption of document chunks can result in incomplete or misleading data being fed into AI models, causing inaccurate outputs that may affect decision-making, compliance reporting, or legal document handling. Sectors such as legal, financial services, healthcare, and government agencies that process sensitive or regulated documents are particularly at risk. The integrity breach could undermine trust in automated document analysis, potentially leading to regulatory non-compliance or erroneous business outcomes. Additionally, the disruption of document hierarchies may complicate audits or forensic investigations. While the vulnerability does not directly expose confidential data, the loss of data integrity and availability can have cascading effects on business processes and AI reliability.

Mitigation Recommendations

European organizations using run-llama/llama_index should immediately upgrade to version 0.3.1 or later, where the vulnerability is fixed. If upgrading is not immediately feasible, organizations should consider implementing additional validation layers to detect and handle hash collisions, such as appending structural metadata to the hash input or switching to a collision-resistant hashing algorithm like SHA-256 for ID generation. It is also advisable to audit existing document chunk data for signs of overwriting or data loss and to maintain robust backups of original documents. Organizations should monitor AI output quality closely for hallucinations or inaccuracies that may indicate underlying data integrity issues. Finally, integrating integrity checks and version control mechanisms for document chunks can help detect and prevent silent data corruption.