AF_UNIX MSG_OOB UAF & SKB-based kernel primitives
Tip
学习和实践 AWS 黑客技术:
HackTricks Training AWS Red Team Expert (ARTE)
学习和实践 GCP 黑客技术:HackTricks Training GCP Red Team Expert (GRTE)
学习和实践 Azure 黑客技术:
HackTricks Training Azure Red Team Expert (AzRTE)
支持 HackTricks
- 查看 订阅计划!
- 加入 💬 Discord 群组 或 Telegram 群组 或 在 Twitter 🐦 上关注我们 @hacktricks_live.
- 通过向 HackTricks 和 HackTricks Cloud GitHub 仓库提交 PR 来分享黑客技巧。
TL;DR
- Linux >=6.9 introduced a flawed
manage_oob()refactor (5aa57d9f2d53) for AF_UNIXMSG_OOBhandling. Stacked zero-length SKBs bypassed the logic that clearsu->oob_skb, so a normalrecv()could free the out-of-band SKB while the pointer remained live, leading to CVE-2025-38236. - Re-triggering
recv(..., MSG_OOB)dereferences the danglingstruct sk_buff. WithMSG_PEEK, the pathunix_stream_recv_urg() -> __skb_datagram_iter() -> copy_to_user()becomes a stable 1-byte arbitrary kernel read; withoutMSG_PEEKthe primitive incrementsUNIXCB(oob_skb).consumedat offset0x44, i.e., adds +4 GiB to the upper dword of any 64-bit value placed at offset0x40inside the reallocated object. - By draining order-0/1 unmovable pages (page-table spray), force-freeing an SKB slab page into the buddy allocator, and reusing the physical page as a pipe buffer, the exploit forges SKB metadata in controlled memory to identify the dangling page and pivot the read primitive into
.data, vmemmap, per-CPU, and page-table regions despite usercopy hardening. - The same page can later be recycled as the top kernel-stack page of a freshly cloned thread.
CONFIG_RANDOMIZE_KSTACK_OFFSETbecomes an oracle: by probing the stack layout whilepipe_write()blocks, the attacker waits until the spilledcopy_page_from_iter()length (R14) lands at offset0x40, then fires the +4 GiB increment to corrupt the stack value. - A self-looping
skb_shinfo()->frag_listkeeps the UAF syscall spinning in kernel space until a cooperating thread stallscopy_from_iter()(viamprotect()over a VMA containing a singleMADV_DONTNEEDhole). Breaking the loop releases the increment exactly when the stack target is live, inflating thebytesargument socopy_page_from_iter()writes past the pipe buffer page into the next physical page. - By monitoring pipe-buffer PFNs and page tables with the read primitive, the attacker ensures the following page is a PTE page, converts the OOB copy into arbitrary PTE writes, and obtains unrestricted kernel read/write/execute. Chrome mitigated reachability by blocking
MSG_OOBfrom renderers (6711812), and Linux fixed the logic flaw in32ca245464e1plus introducedCONFIG_AF_UNIX_OOBto make the feature optional.
Root cause: manage_oob() assumes only one zero-length SKB
unix_stream_read_generic() expects every SKB returned by manage_oob() to have unix_skb_len() > 0. After 93c99f21db36, manage_oob() skipped the skb == u->oob_skb cleanup path whenever it first removed a zero-length SKB left behind by recv(MSG_OOB). The subsequent fix (5aa57d9f2d53) still advanced from the first zero-length SKB to skb_peek_next() without re-checking the length. With two consecutive zero-length SKBs, the function returned the second empty SKB; unix_stream_read_generic() then skipped it without calling manage_oob() again, so the true OOB SKB was dequeued and freed while u->oob_skb still pointed to it.
Minimal trigger sequence
char byte;
int socks[2];
socketpair(AF_UNIX, SOCK_STREAM, 0, socks);
for (int i = 0; i < 2; ++i) {
send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &byte, 1, MSG_OOB);
}
send(socks[1], "A", 1, MSG_OOB); // SKB3, u->oob_skb = SKB3
recv(socks[0], &byte, 1, 0); // normal recv frees SKB3
recv(socks[0], &byte, 1, MSG_OOB); // dangling u->oob_skb
Primitives exposed by unix_stream_recv_urg()
- 1-byte arbitrary read (repeatable):
state->recv_actor()ultimately performscopy_to_user(user, skb_sourced_addr, 1). 如果悬挂的 SKB 被重新分配到攻击者可控的内存(或可控别名,例如 pipe page),每次recv(MSG_OOB | MSG_PEEK)都会从由__check_object_size()允许的任意内核地址复制一个字节到用户态而不会崩溃。保持MSG_PEEK打开可以保留悬挂指针以进行无限读取。 - Constrained write: 当
MSG_PEEK清除时,UNIXCB(oob_skb).consumed += 1会增加偏移0x44处的 32 位字段。在 0x100 对齐的 SKB 分配上,这个位于一个 8 字节对齐字的上方四字节,把该原语转换为对偏移0x40处承载的 64 位值执行 +4 GiB 的增量。要把它变成内核写,需要将敏感的 64 位值放置在该偏移处。
Reallocating the SKB page for arbitrary read
- Drain order-0/1 unmovable freelists: 映射一个巨大的只读匿名 VMA 并对每页造成缺页以强制分配页表(order-0 unmovable)。用页表填满约 10% 的 RAM 可以确保一旦 order-0 列表耗尽,后续
skbuff_head_cache分配会从新鲜的 buddy 页面获取内存。 - Spray SKBs and isolate a slab page: 使用数十个 stream socketpair,并在每个 socket 上排队数百个小消息(每个 SKB 约 ~0x100 字节)来填充
skbuff_head_cache。释放选定的 SKB,使目标 slab 页面完全落在攻击者控制下,并通过新出现的读取原语监视其struct page引用计数。 - Return the slab page to the buddy allocator: 释放页面上的每个对象,然后执行足够多的额外分配/释放以将页面从 SLUB 的 per-CPU partial 列表和 per-CPU page 列表中推动出去,使其成为 buddy freelist 上的 order-1 页面。
- Reallocate as pipe buffer: 创建数百个 pipe;每个 pipe 至少保留两个 0x1000 字节的数据页(
PIPE_MIN_DEF_BUFFERS)。当 buddy allocator 拆分一个 order-1 页面时,其中一半会重用已释放的 SKB 页面。为了定位哪个 pipe 以及哪个偏移与oob_skb别名,向分布在 pipe 页面中的伪 SKB 写入唯一标记字节,并重复发出recv(MSG_OOB | MSG_PEEK)调用直到返回标记。 - Forge a stable SKB layout: 在别名的 pipe 页面中填充一个伪
struct sk_buff,其data/head指针和skb_shared_info结构指向感兴趣的任意内核地址。因为 x86_64 在copy_to_user()内部禁用了 SMAP,用户态地址可以作为暂存缓冲直到内核指针被确认。 - Respect usercopy hardening: 该复制会成功针对
.data/.bss、vmemmap 条目、per-CPU vmalloc 范围、其他线程的内核栈以及不跨越更高阶 folio 边界的 direct-map 页面。针对.text或被__check_heap_object()拒绝的专用缓存的读取会简单地返回-EFAULT而不会杀死进程。
Introspecting allocators with the read primitive
- Break KASLR: 从固定映射
CPU_ENTRY_AREA_RO_IDT_VADDR(0xfffffe0000000000)读取任意 IDT 描述符,并减去已知的 handler 偏移以恢复内核基址。 - SLUB/buddy state: 全局
.data符号揭示kmem_cache基址,而 vmemmap 条目暴露每页的类型标志、freelist 指针和所属缓存。扫描 per-CPU vmalloc 段可以发现struct kmem_cache_cpu实例,从而使关键缓存(例如skbuff_head_cache,kmalloc-cg-192)的下一次分配地址变得可预测。 - Page tables: 不直接读取
mm_struct(被 usercopy 屏蔽),而是遍历全局pgd_list(struct ptdesc)并通过cpu_tlbstate.loaded_mm匹配当前的mm_struct。一旦知道根pgd,该原语就可以遍历每一级页表以映射 pipe buffer、页表和内核栈的 PFN。
Recycling the SKB page as the top kernel-stack page
- Free the controlled pipe page again and confirm via vmemmap that its refcount returns to zero.
- Immediately allocate four helper pipe pages and then free them in reverse order so the buddy allocator’s LIFO behavior is deterministic.
- Call
clone()to spawn a helper thread; Linux stacks are four pages on x86_64, so the four most recently freed pages become its stack, with the last freed page (the former SKB page) at the highest addresses. - Verify via page-table walk that the helper thread’s top stack PFN equals the recycled SKB PFN.
- Use the arbitrary read to observe the stack layout while steering the thread into
pipe_write().CONFIG_RANDOMIZE_KSTACK_OFFSETsubtracts a random 0x0–0x3f0 (aligned) fromRSPper syscall; repeated writes combined withpoll()/read()from another thread reveal when the writer blocks with the desired offset. When lucky, the spilledcopy_page_from_iter()bytesargument (R14) sits at offset0x40inside the recycled page.
Placing fake SKB metadata on the stack
- Use
sendmsg()on an AF_UNIX datagram socket: the kernel copies the usersockaddr_uninto a stack-residentsockaddr_storage(up to 108 bytes) and the ancillary data into another on-stack buffer before the syscall blocks waiting for queue space. This allows planting a precise fake SKB structure in stack memory. - Detect when the copy finished by supplying a 1-byte control message located in an unmapped user page;
____sys_sendmsg()faults it in, so a helper thread pollingmincore()on that address learns when the destination page is present. - Zero-initialized padding from
CONFIG_INIT_STACK_ALL_ZEROconveniently fills unused fields, completing a valid SKB header without extra writes.
Timing the +4 GiB increment with a self-looping frag list
- Forge
skb_shinfo(fakeskb)->frag_listto point to a second fake SKB (stored in attacker-controlled user memory) that haslen = 0andnext = &self. Whenskb_walk_frags()iterates this list inside__skb_datagram_iter(), execution spins indefinitely because the iterator never reachesNULLand the copy loop makes no progress. - Keep the recv syscall running inside the kernel by letting the second fake SKB self-loop. When it’s time to fire the increment, simply change the second SKB’s
nextpointer from user space toNULL. The loop exits andunix_stream_recv_urg()immediately executesUNIXCB(oob_skb).consumed += 1once, affecting whatever object currently occupies the recycled stack page at offset0x40.
Stalling copy_from_iter() without userfaultfd
- Map a giant anonymous RW VMA and fault it in fully.
- Punch a single-page hole with
madvise(MADV_DONTNEED, hole, PAGE_SIZE)and place that address inside theiov_iterused forwrite(pipefd, user_buf, 0x3000). - In parallel, call
mprotect()on the entire VMA from another thread. The syscall grabs the mmap write lock and walks every PTE. When the pipe writer reaches the hole, the page fault handler blocks on the mmap lock held bymprotect(), pausingcopy_from_iter()at a deterministic point while the spilledbytesvalue resides on the stack segment hosted by the recycled SKB page.
Turning the increment into arbitrary PTE writes
- Fire the increment: Release the frag loop while
copy_from_iter()is stalled so the +4 GiB increment hits thebytesvariable. - Overflow the copy: Once the fault resumes,
copy_page_from_iter()believes it can copy >4 GiB into the current pipe page. After filling the legitimate 0x2000 bytes (two pipe buffers), it executes another iteration and writes the remaining user data into whatever physical page follows the pipe buffer PFN. - Arrange adjacency: Using allocator telemetry, force the buddy allocator to place a process-owned PTE page immediately after the target pipe buffer page (e.g., alternate between allocating pipe pages and touching new virtual ranges to trigger page-table allocation until the PFNs align inside the same 2 MiB pageblock).
- Overwrite page tables: Encode desired PTE entries in the extra 0x1000 bytes of user data so the OOB
copy_from_iter()fills the neighbouring page with attacker-chosen entries, granting RW/RWX user mappings of kernel physical memory or rewriting existing entries to disable SMEP/SMAP.
Mitigations / hardening ideas
- Kernel: Apply
32ca245464e1479bfea8592b9db227fdc1641705(properly revalidates SKBs) and consider disabling AF_UNIX OOB entirely unless strictly needed viaCONFIG_AF_UNIX_OOB(5155cbcdbf03). Hardenmanage_oob()with additional sanity checks (e.g., loop untilunix_skb_len() > 0) and audit other socket protocols for similar assumptions. - Sandboxing: Filter
MSG_OOB/MSG_PEEKflags in seccomp profiles or higher-level broker APIs (Chrome change6711812now blocks renderer-sideMSG_OOB). - Allocator defenses: Strengthening SLUB freelist randomization or enforcing per-cache page coloring would complicate deterministic page recycling; pipeline-limiting of pipe buffer counts also reduces reallocation reliability.
- Monitoring: Expose high-rate page-table allocation or abnormal pipe usage via telemetry—this exploit burns large amounts of page tables and pipe buffers.
References
- Project Zero – “From Chrome renderer code exec to kernel with MSG_OOB”
- Linux fix for CVE-2025-38236 (
manage_oobrevalidation) - Chromium CL 6711812 – block
MSG_OOBin renderers - Commit adding
CONFIG_AF_UNIX_OOBprompt
Tip
学习和实践 AWS 黑客技术:
HackTricks Training AWS Red Team Expert (ARTE)
学习和实践 GCP 黑客技术:HackTricks Training GCP Red Team Expert (GRTE)
学习和实践 Azure 黑客技术:
HackTricks Training Azure Red Team Expert (AzRTE)
支持 HackTricks
- 查看 订阅计划!
- 加入 💬 Discord 群组 或 Telegram 群组 或 在 Twitter 🐦 上关注我们 @hacktricks_live.
- 通过向 HackTricks 和 HackTricks Cloud GitHub 仓库提交 PR 来分享黑客技巧。
HackTricks

