Introduction

In my last blog post, I discussed an improvement to Stephen Fewer’s reflective DLL injection technique. The improvement entailed dynamically creating a bootstrap shellcode that allowed passing additional parameters to the reflective loader in our DLL. The reflective loader would then, after calling DllMain, call a chosen exported function of the DLL based on the parameters passed in the shellcode.

A limitation of this technique is that you could only inject into processes of the same architecture of the calling process. So 32-bit processes could only inject into other 32-bit processes, and 64-bit processes could only inject into 64-bit processes. In this post, I’ll describe the changes needed to do cross-architecture injection from 32-bit to 64-bit processes.

WoW64

In order to understand how to inject code into a 64-bit process from a 32-bit process, we have to understand the WoW64 subsystem. WoW64 (which stands for Windows on Windows64) is a compatibility layer that was introduced in the Windows NT 5.2 kernel to allow 32-bit processes to be run on a 64-bit Windows OS. It basically sets up a 32-bit execution environment, including alternate versions of the registry and portions of the file system, so that 32-bit processes can run unmodified.

The most interesting aspect of WoW64 for our purposes is how it handles system calls to the 64-bit kernel from a 32-bit process. The first thing to notice are the modules loaded by the process. Although the process will have kernel32.dll loaded, and likely other libraries such as user32.dll, these libraries are loaded from the Windows syswow64 directory, indicating that these are 32-bit versions of the libraries. In addition, the libraries wow64.dll, wow64win.dll, and wow64cpu.dll are loaded.

When a function is called from kernel32.dll on a 32-bit Windows OS, it will generally call into the kernel’s native interface, ntdll.dll, which will make a call to the kernel with a SYSENTER instruction. When the same process is run in WoW64 mode on a 64-bit OS, you’ll see that calls into the kernel are made via proxy, through a function in wow64cpu.dll. See here for the full details.

So let’s say our 32-bit process, running in WoW64 mode (also called “compatibility mode”, as opposed to a 64-bit process which runs in “long mode”), makes a call to a function in ntdll32.dll, the 32-bit version of the Windows native kernel interface. Tracing the call in a debugger, you will eventually run into the following instruction:

call dword ptr fs:[0c0h]

For 32-bit processes, the FS register holds the address of the Thread Environment Block (TEB), and the offset 0c0h gives us the address of a function X86SwitchTo64BitMode in the wow64cpu.dll library. Continuing into this function leads us, in turn, to another interesting instruction:

jmp 0033:wow64cpu!CpupReturnFromSimulatedCode: (74FAB4D0)

A far jump with a segment selector of 0x33. What does this mean, exactly? Here we need to understand a little bit about memory segmentation.

Back before paging became the dominant memory management scheme, memory segmentation was the sole method used by Intel processors for addressing memory locations larger than the maximum size allowed by a 16-bit register (64 KB). If we have a pointer in the form of [selector]:[offset], such as in our far jump above, the segment selector component points to an entry in a global structure called the Global Descriptor Table (GDT). The entry it points to, called a segment descriptor, is 8 bytes long and contains some flags as well as the physical base address and size of a code segment. We then take the offset component of our pointer and add it to the base address in the segment descriptor, and we now have an address in physical memory.

Although memory segmentation is still used by Intel x86 and x86_64 processors, paging was added as a second layer of memory address translation and is the way modern operating systems do memory mangement. However, you’ll still see all the remnants of memory segmentation in Windows, such as the GDT and segment descriptors, although they are used in slightly different ways. One of the ways is mode switching, which brings us back to our 0x33 segment selector.

Heaven’s Gate

As we’ll see, our special 0x33 segment selector is the key to switching from WoW64 mode to 64-bit mode. Because of this special property, it has been dubbed “Heaven’s Gate” (although this is a slight misnomer since it points to a code segment descriptor rather than a call gate descriptor). If you look up the 0x33 descriptor entry in the GDT, you’ll see that it gives a base address of 0x00000000 and an unlimited size, effectively the entire memory space, which is not surprising since we’re using paging as our memory mangement scheme rather than memory segmentation. But what is interesting about this GDT entry is the flags portion of the descriptor. The flags portion contains 3 important bits, one of which is the L bit which indicates if the code segment pointed to by the base address is 64-bit or not. GDT entry 0x33, of course, sets this particular bit to 1.

Therefore, making a far jump to the code segment pointed to by the 0x33 segment descriptor in the GDT doesn’t actually change the code segment we’re executing from (remember it covers the entire memory space?), but rather just sets a flag indicating that the processor should operate in 64-bit mode. Thus, in our system call example, we see that the boundary between WoW64 mode and 64-bit mode comes from the far jump into wow64cpu!CpupReturnFromSimulatedCode using segment selector 0x33.

Cross-architecture reflective DLL injection

Now we can use this knowledge to perform reflective DLL injection from a 32-bit/WoW64 process into a 64-bit process. Our WoW64 process can’t directly inject into a 64-bit process because the 32-bit version of RtlCreateUserThread from ntdll32.dll simply fails (although the same call from a 64-bit process to a WoW64 process works fine). So we’ll need to write some 64-bit bootstrap code that can, after jumping through Heaven’s Gate, call the 64-bit version of RtlCreateUserThread from our injector process.

Instead of writing this from scratch, I used some shellcode from the Metasploit project that does exactly that. There are two parts to this: the shellcode for switching into 64-bit mode, and the 64-bit shellcode for doing remote thread creation in our target process. For brevity, I won’t explain the latter in depth, but basically it sets up 64-bit stack alignment, puts the parameters for RtlCreateUserThread into the proper registers for the x64 calling convention, and calls RtlCreateUserThread.

The more interesting shellcode is for mode switching:

[BITS 32]

WOW64_CODE_SEGMENT	EQU 0x23
X64_CODE_SEGMENT	EQU 0x33

start:
    push ebp 					; prologue, save EBP...
    mov ebp, esp				; and create a new stack frame
    push esi					; save the registers we shouldn't clobber
    push edi					;
    mov esi, [ebp+8]				; ESI = pFunction
    mov ecx, [ebp+12]				; ECX = dwParameter
    call delta					;
delta:
    pop eax					;
    add eax, (native_x64-delta)			; get the address of native_x64
    
    sub esp, 8					; alloc some space on stack for far jump
    mov edx, esp				; EDX will be pointer our far jump
    mov dword [edx+4], X64_CODE_SEGMENT		; set the native x64 code segment
    mov dword [edx], eax			; set the address we want to jump to (native_x64)
    
    call go_all_native				; perform the transition into native x64 and return here when done.
    
    add esp, (8+4+8)				; remove the 8 bytes we allocated + the return address which was never popped off + the qword pushed from native_x64
    pop edi					; restore the clobbered registers
    pop esi					;
    pop ebp					; restore EBP
    retn (4*2)					; return to caller (cleaning up our two function params)
    
go_all_native:
    mov edi, [esp]				; EDI is the wow64 return address
    jmp dword far [edx]				; perform the far jump, which will return to the caller of go_all_native
    
native_x64:
[BITS 64]						; we are now executing native x64 code...
    xor rax, rax				; zero RAX
    push rdi					; save RDI (EDI being our wow64 return address)
    call rsi					; call our native x64 function (the param for our native x64 function is allready in RCX)
    pop rdi					; restore RDI (EDI being our wow64 return address)
    push rax					; simply push it to alloc some space
    mov dword [rsp+4], WOW64_CODE_SEGMENT	; set the wow64 code segment 
    mov dword [rsp], edi			; set the address we want to jump to (the return address from the go_all_native call)
    jmp dword far [rsp]				; perform the far jump back to the wow64 caller...

In the start function above, we see the usual function prologue: saving non-volatile registers and creating a stack frame. Next, it puts the address of our 64-bit function (which will be the address of our remote thread creation shellcode) and its single 4-byte parameter into the ESI and ECX registers, respectively. Putting the parameter is ECX is strategic, because the x64 calling convention puts the first parameter to a function call into the RCX register (which is simply ECX extended to 64-bits). So once we get to our 64-bit function call, our parameter will already be in place.

It then gets the absolute address in memory of the 64-bit native_x64 function in preparation for our mode-switching far call. It calculates this address using a little trick: the absolute address of the instruction immediately following the delta label is pushed to the stack when delta is called. This address is then popped from the stack into EAX, to which it then adds the offset between native_x64 and delta, thus calculating the absolute address of native_x64.

Next, it prepares the stack with the arguments for our far call. 8 bytes are added to the stack; the first four bytes are filled with the address of native_x64 we just calculated, and the next 4 contains our special 0x33 segment selector. go_all_native is then called so that a return address for after the 64-bit code is put onto the stack. This address is then saved in the EDI register (I’m guessing since we can’t preserve the stack across the x86/x64 boundary?), and the far jump is executed.

We are now in 64-bit mode, with 64-bit registers. This part of the code simply calls the address of our 64-bit function (remember our parameter was previously saved to ECX?), and again a far jump back to our return address is prepared and executed. Note here that it uses the special segment selector 0x23 to return to WoW64, which if you look up this entry in the GDT, is the same as the 0x33 entry but with a code segment size of 4GB (max size of 32-bit memory address) and the 64-bit flag turned off.

Finally, back in 32-bit mode, our non-volatile registers are restored for the caller, the stack is cleaned up as per __stdcall convention, and we return. Dope.

Putting it all together

Now that we have all the parts for doing cross-architecture reflective DLL injection, we can put it all together. Some other changes to the code that you’ll notice is I extended the Rva2Offset function to do use either 32-bit and 64-bit NT header structures depending on the architecture of the target DLL, and added the inject_via_remotethread_wow64 function from Meterpreter which calls the mode-switching shellcode I described above.

Finally, I modified the bootstrap shellcode I described in Part 1. Previously, the shellcode would call our reflective loader function and then call ExitThread to terminate the injected thread properly. Now it simply calls the reflective loader, and it is the reflective loader function that terminates the thread. This is because in Vista and newer, the function ExitThread in kernel32.dll is actually a forwarded export, which our reflective loader is not able to resolve. The function it forwards to is the undocumented RtlExitUserThread function in ntdll.dll. Therefore, the reflective loader now calculates the addresses of ExitThread, and RtlExitUserThread if it’s available. If the latter is available, it calls that to terminate the thread, and if not it just calls ExitThread.

The bootstrap shellcode is now generated in a separate function, with the option for generating WoW64->x64 shellcode. The only difference is that the opcodes for some of the instructions are different since they take 32-bit addresses instead of 64-bit addresses.

You can see the latest version of the Improved Reflective DLL Injection library on my github: https://github.com/dismantl/ImprovedReflectiveDLLInjection.

References and further reading