Skip to main content

CVE-2025-3044: CWE-440 Expected Behavior Violation in run-llama run-llama/llama_index

Medium
VulnerabilityCVE-2025-3044cvecve-2025-3044cwe-440
Published: Mon Jul 07 2025 (07/07/2025, 09:54:22 UTC)
Source: CVE Database V5
Vendor/Project: run-llama
Product: run-llama/llama_index

Description

A vulnerability in the ArxivReader class of the run-llama/llama_index repository, versions up to v0.12.22.post1, allows for MD5 hash collisions when generating filenames for downloaded papers. This can lead to data loss as papers with identical titles but different contents may overwrite each other, preventing some papers from being processed for AI model training. The issue is resolved in version 0.12.28.

AI-Powered Analysis

AILast updated: 07/07/2025, 10:27:25 UTC

Technical Analysis

CVE-2025-3044 is a medium-severity vulnerability identified in the ArxivReader class of the run-llama/llama_index repository, affecting versions up to v0.12.22.post1. The vulnerability arises from the use of MD5 hashing to generate filenames for downloaded academic papers. MD5 is known to be susceptible to hash collisions, meaning that two different inputs can produce the same hash output. In this context, papers with identical or similar titles but differing content can produce the same MD5 hash, leading to filename collisions. As a result, when these papers are downloaded and saved, one file can overwrite the other, causing data loss. This overwriting prevents some papers from being processed correctly during AI model training, potentially degrading the quality and completeness of the training data. The vulnerability does not impact confidentiality or availability directly but compromises data integrity by allowing unintended overwrites. The issue has been addressed and resolved in version 0.12.28 of the software. The CVSS v3.0 score is 5.3 (medium), reflecting that the vulnerability can be exploited remotely without authentication or user interaction but only impacts data integrity without affecting confidentiality or availability. There are no known exploits in the wild at this time.

Potential Impact

For European organizations utilizing the run-llama/llama_index tool for AI model training, particularly those relying on academic paper datasets from sources like Arxiv, this vulnerability could result in incomplete or corrupted training datasets. This data integrity issue may lead to less accurate or biased AI models, which can affect research outcomes, product development, or decision-making processes that depend on these models. Organizations in academia, research institutions, and companies developing AI solutions could face setbacks in model performance or require additional verification steps to ensure dataset completeness. While the vulnerability does not expose sensitive data or cause system downtime, the loss of critical training data can have downstream effects on AI-driven services and products. Given the growing reliance on AI in various sectors across Europe, including healthcare, finance, and manufacturing, compromised training data integrity could indirectly impact operational effectiveness and innovation.

Mitigation Recommendations

European organizations should promptly upgrade run-llama/llama_index to version 0.12.28 or later, where the vulnerability is fixed. Until the upgrade is applied, organizations can implement additional safeguards such as: 1) Modifying the filename generation logic to use collision-resistant hashing algorithms like SHA-256 instead of MD5, ensuring unique filenames for different papers. 2) Implementing checksums or content-based verification to detect and prevent overwriting files with different content. 3) Maintaining versioned backups of downloaded papers to recover any overwritten data. 4) Incorporating logging and alerting mechanisms to detect unexpected file overwrites during data ingestion. 5) Conducting regular audits of datasets to identify missing or corrupted files before training AI models. These measures will help maintain data integrity and minimize the risk of data loss until the patched version is deployed.

Need more detailed analysis?Get Pro

Technical Details

Data Version
5.1
Assigner Short Name
@huntr_ai
Date Reserved
2025-03-31T12:26:26.971Z
Cvss Version
3.0
State
PUBLISHED

Threat ID: 686b9cd16f40f0eb72e2e22d

Added to database: 7/7/2025, 10:09:21 AM

Last enriched: 7/7/2025, 10:27:25 AM

Last updated: 8/15/2025, 11:54:00 AM

Views: 11

Actions

PRO

Updates to AI analysis are available only with a Pro account. Contact root@offseq.com for access.

Please log in to the Console to use AI analysis features.

Need enhanced features?

Contact root@offseq.com for Pro access with improved analysis and higher rate limits.

Latest Threats