CVE-2024-35931: Vulnerability in Linux Linux
In the Linux kernel, the following vulnerability has been resolved: drm/amdgpu: Skip do PCI error slot reset during RAS recovery Why: The PCI error slot reset maybe triggered after inject ue to UMC multi times, this caused system hang. [ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume [ 557.373718] [drm] PCIE GART of 512M enabled. [ 557.373722] [drm] PTB located at 0x0000031FED700000 [ 557.373788] [drm] VRAM is lost due to GPU reset! [ 557.373789] [drm] PSP is resuming... [ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset [ 557.547067] [drm] PCI error: detected callback, state(1)!! [ 557.547069] [drm] No support for XGMI hive yet... [ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter [ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations [ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered [ 557.610492] [drm] PCI error: slot reset callback!! ... [ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI [ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu [ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023 [ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu] [ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00 [ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202 [ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0 [ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010 [ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08 [ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000 [ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000 [ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000 [ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0 [ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 560.843444] PKRU: 55555554 [ 560.846480] Call Trace: [ 560.849225] <TASK> [ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.867778] ? show_regs.part.0+0x23/0x29 [ 560.872293] ? __die_body.cold+0x8/0xd [ 560.876502] ? die_addr+0x3e/0x60 [ 560.880238] ? exc_general_protection+0x1c5/0x410 [ 560.885532] ? asm_exc_general_protection+0x27/0x30 [ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.904520] process_one_work+0x228/0x3d0 How: In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure.
AI Analysis
Technical Summary
CVE-2024-35931 is a vulnerability identified in the Linux kernel, specifically within the AMD GPU driver component (amdgpu) related to PCI error handling during RAS (Reliability, Availability, and Serviceability) recovery. The issue arises when the PCI error slot reset is triggered multiple times due to injected uncorrectable errors (UE) in the UMC (Unified Memory Controller). This repeated reset attempt can cause the system to hang, leading to a denial of service condition. The vulnerability is rooted in the RAS recovery mechanism where a mode-1 reset is issued from fatal error handling, expecting all nodes in a hive to reset once. However, the flawed logic causes redundant mode-1 resets during the recovery process, which is unnecessary and destabilizes the system. The kernel logs indicate GPU resets succeeding but followed by general protection faults and system crashes, highlighting the severity of the hang. This vulnerability affects Linux kernel versions containing the specified commit hashes and is relevant to systems using AMD GPUs with the affected driver. No known exploits are currently reported in the wild, and no CVSS score has been assigned yet. The vulnerability is technical in nature, involving low-level hardware error recovery and PCIe error handling, which are critical for system stability in environments relying on AMD GPUs under Linux.
Potential Impact
For European organizations, the impact of CVE-2024-35931 can be significant, especially for those relying on Linux servers or workstations equipped with AMD GPUs for critical workloads such as scientific computing, data analytics, virtualization, and cloud services. The vulnerability can cause system hangs and crashes during GPU error recovery, potentially leading to downtime, loss of availability, and disruption of business operations. This is particularly impactful for data centers, research institutions, and enterprises using AMD GPU-accelerated computing. Additionally, the instability may affect service reliability and could complicate incident response and maintenance activities. While the vulnerability does not appear to allow privilege escalation or direct data compromise, the denial of service effect could indirectly impact confidentiality and integrity by interrupting security monitoring or backup processes. The lack of known exploits reduces immediate risk, but the vulnerability's presence in widely used Linux kernels means that unpatched systems remain vulnerable to accidental or malicious triggering of PCIe errors, which could be exploited in targeted attacks or cause inadvertent outages.
Mitigation Recommendations
To mitigate CVE-2024-35931, organizations should prioritize applying the Linux kernel patches that address the amdgpu driver's PCI error handling logic. Since the vulnerability involves kernel-level GPU error recovery, updating to the latest stable Linux kernel versions containing the fix is critical. System administrators should monitor vendor advisories and Linux distribution security updates for patched kernel releases. Additionally, organizations should implement robust hardware monitoring to detect and preemptively address GPU or PCIe errors before they escalate to system hangs. Employing hardware with firmware updates that improve error reporting and recovery can also reduce risk. For environments where immediate patching is not feasible, consider isolating AMD GPU workloads or using alternative hardware to minimize exposure. Regular backups and high-availability configurations can mitigate downtime impact. Finally, enabling detailed kernel logging and monitoring can help detect early signs of PCIe errors and facilitate faster incident response.
Affected Countries
Germany, France, United Kingdom, Netherlands, Sweden, Finland, Poland, Italy, Spain
CVE-2024-35931: Vulnerability in Linux Linux
Description
In the Linux kernel, the following vulnerability has been resolved: drm/amdgpu: Skip do PCI error slot reset during RAS recovery Why: The PCI error slot reset maybe triggered after inject ue to UMC multi times, this caused system hang. [ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume [ 557.373718] [drm] PCIE GART of 512M enabled. [ 557.373722] [drm] PTB located at 0x0000031FED700000 [ 557.373788] [drm] VRAM is lost due to GPU reset! [ 557.373789] [drm] PSP is resuming... [ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset [ 557.547067] [drm] PCI error: detected callback, state(1)!! [ 557.547069] [drm] No support for XGMI hive yet... [ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter [ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations [ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered [ 557.610492] [drm] PCI error: slot reset callback!! ... [ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI [ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu [ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023 [ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu] [ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00 [ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202 [ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0 [ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010 [ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08 [ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000 [ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000 [ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000 [ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0 [ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 560.843444] PKRU: 55555554 [ 560.846480] Call Trace: [ 560.849225] <TASK> [ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.867778] ? show_regs.part.0+0x23/0x29 [ 560.872293] ? __die_body.cold+0x8/0xd [ 560.876502] ? die_addr+0x3e/0x60 [ 560.880238] ? exc_general_protection+0x1c5/0x410 [ 560.885532] ? asm_exc_general_protection+0x27/0x30 [ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.904520] process_one_work+0x228/0x3d0 How: In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure.
AI-Powered Analysis
Technical Analysis
CVE-2024-35931 is a vulnerability identified in the Linux kernel, specifically within the AMD GPU driver component (amdgpu) related to PCI error handling during RAS (Reliability, Availability, and Serviceability) recovery. The issue arises when the PCI error slot reset is triggered multiple times due to injected uncorrectable errors (UE) in the UMC (Unified Memory Controller). This repeated reset attempt can cause the system to hang, leading to a denial of service condition. The vulnerability is rooted in the RAS recovery mechanism where a mode-1 reset is issued from fatal error handling, expecting all nodes in a hive to reset once. However, the flawed logic causes redundant mode-1 resets during the recovery process, which is unnecessary and destabilizes the system. The kernel logs indicate GPU resets succeeding but followed by general protection faults and system crashes, highlighting the severity of the hang. This vulnerability affects Linux kernel versions containing the specified commit hashes and is relevant to systems using AMD GPUs with the affected driver. No known exploits are currently reported in the wild, and no CVSS score has been assigned yet. The vulnerability is technical in nature, involving low-level hardware error recovery and PCIe error handling, which are critical for system stability in environments relying on AMD GPUs under Linux.
Potential Impact
For European organizations, the impact of CVE-2024-35931 can be significant, especially for those relying on Linux servers or workstations equipped with AMD GPUs for critical workloads such as scientific computing, data analytics, virtualization, and cloud services. The vulnerability can cause system hangs and crashes during GPU error recovery, potentially leading to downtime, loss of availability, and disruption of business operations. This is particularly impactful for data centers, research institutions, and enterprises using AMD GPU-accelerated computing. Additionally, the instability may affect service reliability and could complicate incident response and maintenance activities. While the vulnerability does not appear to allow privilege escalation or direct data compromise, the denial of service effect could indirectly impact confidentiality and integrity by interrupting security monitoring or backup processes. The lack of known exploits reduces immediate risk, but the vulnerability's presence in widely used Linux kernels means that unpatched systems remain vulnerable to accidental or malicious triggering of PCIe errors, which could be exploited in targeted attacks or cause inadvertent outages.
Mitigation Recommendations
To mitigate CVE-2024-35931, organizations should prioritize applying the Linux kernel patches that address the amdgpu driver's PCI error handling logic. Since the vulnerability involves kernel-level GPU error recovery, updating to the latest stable Linux kernel versions containing the fix is critical. System administrators should monitor vendor advisories and Linux distribution security updates for patched kernel releases. Additionally, organizations should implement robust hardware monitoring to detect and preemptively address GPU or PCIe errors before they escalate to system hangs. Employing hardware with firmware updates that improve error reporting and recovery can also reduce risk. For environments where immediate patching is not feasible, consider isolating AMD GPU workloads or using alternative hardware to minimize exposure. Regular backups and high-availability configurations can mitigate downtime impact. Finally, enabling detailed kernel logging and monitoring can help detect early signs of PCIe errors and facilitate faster incident response.
Affected Countries
For access to advanced analysis and higher rate limits, contact root@offseq.com
Technical Details
- Data Version
- 5.1
- Assigner Short Name
- Linux
- Date Reserved
- 2024-05-17T13:50:33.129Z
- Cisa Enriched
- true
- Cvss Version
- null
- State
- PUBLISHED
Threat ID: 682d9828c4522896dcbe21dd
Added to database: 5/21/2025, 9:08:56 AM
Last enriched: 6/29/2025, 8:12:06 AM
Last updated: 7/28/2025, 5:04:04 PM
Views: 8
Related Threats
CVE-2025-8937: Command Injection in TOTOLINK N350R
MediumCVE-2025-8936: SQL Injection in 1000 Projects Sales Management System
MediumCVE-2025-5942: CWE-122 Heap-based Buffer Overflow in Netskope Netskope Client
MediumCVE-2025-5941: CWE-125 Out-of-Bounds Read in Netskope Netskope Client
LowCVE-2025-0309: Vulnerability in Netskope Netskope Client
MediumActions
Updates to AI analysis are available only with a Pro account. Contact root@offseq.com for access.
External Links
Need enhanced features?
Contact root@offseq.com for Pro access with improved analysis and higher rate limits.