Supercard patching: an emulation and static analysis story

Published: June 2024

A few months ago I found myself using a bunch of SuperCard GBA flashcarts. They are disgustingly terrible in so many ways. But even my more expensive EZ Flash Omega died after only a few months of use (they are also of questionable quality...). In the search of a better way to run GBA ROMs I came across some GBAtemp post where people were trying to reverse engineer them. I found it odd that in 2024 there was no alternative firmware for these!

After some other folks started creating some alternative firmware for them I realised that these carts are somewhat unique since they require some special kind of patching: WAITCNT patching.

The SuperCard

To understand why some kinds of patching are required we need to understand how the device works under the hood. A supercard is a relatively complex device and contains seven main components:

Lattice CPLD (like an FPGA but simpler and cheaper)
SDRAM device (32MB to hold the loaded ROM)
SRAM device (128KB where save games are usually stored)
Flash device (512KB where the firmware is stored)
Clock/oscillator
Battery (to power the SRAM when the GBA is off)
SD card (micro/mini/full) socket

Supercard PCB, back side

Supercard PCB, front side

The CPLD is in the middle of all of these components: it connects to the GamePak bus and will redirect the signals to the right component. This is necessary since the GBA bus is time-multiplexed for address and data, so the CPLD needs to juggle all these signals and timings correctly.

Something worth mentioning is that the SDRAM requires several cycles to access data. Like many DRAM devices it has a RAS cycle (Row Address Select) and a CAS cycle (Column Address Select) which complicate things further. These devices are slower than the fastest commercial GBA ROM, probably to save on money (and I suspect that a faster device wouldn't make much of a difference without reprogramming the CPLD).

For this very reason, the SuperCard requires patching games so they can work on this slower "ROM" without crashing. Bad patches or lack thereof commonly cause games to not boot and get stuck in a "white screen", therefore known as White Screen Patches.

The WAITCNT register

Commercial GBA games came in all sorts of hardware configurations. Many had EEPROMs to save game progress, many others saw the use of SRAM or FRAM (in many cases with a battery) and some even shipped flash devices. For the main ROM I cannot really comment since my knowledge is rather minimal, but I suppose there was a variety of devices used with varying speeds and sizes. For this reason the GBA has a programmable register to configure memory access timings.

The WAITCNT register controls the amount of Wait States that the GBA will use when accessing memory on the external bus. From the gbatek documentation we can see it has a bunch of bits to control GamePak and SRAM timing:

 4000204h - WAITCNT - Waitstate Control (R/W)

  Bit   Expl.
  0-1   SRAM Wait Control          (0..3 = 4,3,2,8 cycles)
  2-3   Wait State 0 First Access  (0..3 = 4,3,2,8 cycles)
  4     Wait State 0 Second Access (0..1 = 2,1 cycles)
  5-6   Wait State 1 First Access  (0..3 = 4,3,2,8 cycles)
  7     Wait State 1 Second Access (0..1 = 4,1 cycles; unlike above WS0)
  8-9   Wait State 2 First Access  (0..3 = 4,3,2,8 cycles)
  10    Wait State 2 Second Access (0..1 = 8,1 cycles; unlike above WS0,WS1)
  11-12 PHI Terminal Output        (0..3 = Disable, 4.19MHz, 8.38MHz, 16.78MHz)
  13    Not used
  14    Game Pak Prefetch Buffer (Pipe) (0=Disable, 1=Enable)
  15    Game Pak Type Flag  (Read Only) (0=GBA, 1=CGB) (IN35 signal)

In general the GBA has three ROM address spaces. These are simply three ROM mirrors that have different timing patterns. By default, the base ROM address (0x08000000) has a 4-2 wait state pattern. This means the first access requires 4+1 cycles and any subsequent access only 2+1 cycles. However many games that feature faster ROMs will reconfigure this by lowering the waitstate count. Many will use 2-1 instead of the default 4-2.

Any games updating the WAITCNT register to a faster speed access on a slower device will cause the game to crash (since the data will likely be incorrect or right out garbage), thus causing the "white screen" effect (since they usually perform this task during the initial boot/initialization). Some other games also tweak the SRAM access waitstates if they ship faster/slower chips.

WAITCNT patching

In order to make games work (albeit with slowdowns or some other caveats) we must prevent games from updating the WAITCNT register to program a higher speed memory access. Doing so is not trivial in many cases. In fact the SuperCard firmware doesn't do it and relies on a PC patching software (that we all hate since it's ancient and Windows-only). The patcher also performs other stuff such as patching Flash/EEPROM games to use SRAM (since the SuperCard only ships SRAM!) and some other features.

Good old Supercard Windows patcher screenshot

Supercard Windows patcher screenshot (2)

How do we find out which games update WAITCNT, how and where do they do it, and patch them to avoid it? Well this is where things get interesting!

Hacky hack!

Some homebrew software (such as TWiLight Menu) has encountered similar patching needs. They have a super simple and interesting approach to the problem: patch every 0x04000204 constant with a zero, unless some other obscure conditions are met. The code is pretty cool although obviously wrong.

This works sometimes due to the fact that many code paths (usually in Thumb mode) will load said constant (using LDR rX, [PC + Y]) and use it to write the register. Patching the address to zero causes it to write to a read-only area and therefore nullifying the access. On ARM mode the codepath usually involves loading the 0x04000000 constant instead, which doesn't work.

Emulation based approach

A very simple solution is to use an emulator, make it capture said accesses (just capture any writes to 0x04000204) and somehow do this for every single game we want to patch.

Interestingly enough this is actually feasible since most games patch WAITCNT at startup (so we do not really need to play the game to capture these events). It is also likely to be incomplete for a subset of games, since many also update WAITCNT during SRAM load/save, at a later point in time, or just feature multiple games in one.

This approach, albeit incomplete, can be useful to cross validate any other approach and also used as a final vaidation (we can run patched games and verify they are indeed correctly patched!).

Static-analysis and "emulation"

I had this idea of finding the accesses by performing some sort of static analysis on the ARM/Thumb code. In most cases the register write is pretty simple, usually one of the following possibilities:

A couple of LDR instructions followed by an STR/STRH instruction (Thumb)
A couple of MOV/ORR instructions followed by a STR instruction (ARM)

A simple regex-like pattern finder could be written to find these sequences with some loose-match for register numbers. And it worked well, but only for a subset of the games I tested. After after some deeper inspection I found more an more weird cases, likely hand-written assemlby or poor C:

Use two loads and/or a STR rX, [rY + rZ] instruction.
Load a different address (ie. 0x04000200) with a +4 offset.
Do some more complicated address math (likely to write other I/O regs).
Have a bunch of unrelated code between address loading and register update.

This is non-trivial pattern matching that requires some complex search. I might need to create manual patches for many games, terrible, I know.

With some perseverance and after inspecting many more games, I realized that most of them had something in common: it was rather easy for a human to find the relevant patching site. I would normally do:

Locate interesting sites (I search for a constant between 0x04000000 and 0x04000400 in Thumb mode, or a mov instruction with a 0x04000000 constant in ARM mode).
In case of a pool constant, I would go back to find the LDR instructions that uses it.
"Emulate" the code in my head. This involves tracking the known values, since most registers will have unknown values.

This is something an emulator should be able to do... or almost. The main difference with an emulator is that we do not emulate the binary from the begining, we ignore loops and other flow instructions and just focus on finding a sequence of instructions that writes the desired address. Granted, we will miss some stuff if we ignore loops and function calls, but the original assumption is that the majority of WAITCNT writes are pretty obvious and don't really involve passing many function arguments around.

An attempt at a symbolic emulator

Reverse engineering tools like IDA or Ghidra perform static analysis on binaries and help with reversing. They usually do things like tracking constants, perform constant propagation, guessing function signatures, etc. We need to mimic this in order to find the correct writes, in particular constant propagation is key (since it covers most of our cases). Some of these tools are also used by security researchers to find vulnerabilities in a semi-automated fashion.

To implement this I just went ahead and implemented a simple ARM7TDMI emulator (in Python, cause why not, lost count on how many ARM emulators I've already written) that has some interesting features. The main one is that value propagation and calculation is aware of unknown values. That means that registers and memory locations can have a known or unknown value (I use None as a token for unknown value). Any regular data operation takes input values into consideration, propagating unknown values as expected (this is similar to Verilog's X values and their propagation). It doesn't do fine grained propagation (ie. certain operations produce known values even with unknown inputs, imagine: and r1, r2, 0) since it is not really necessary.

With this "emulator" we can go ahead and find WAITCNT writes:

Find an interesting constant (this is really an optimization) or location.
Execute some instructions (usually 2 to 4 thousand).
Record any target address writes.

This approach leads amazing results! It required obviously some tweaks due to lack of proper flow emulation, conditional instructions and flag emulation, but it does work for most games! Some smart tweaks that I added feature:

Tracking invalid instructions and function returns: This way we do not propagate values from one function to the next (which leads to accidental false positives).
Enabling "backtracking" on certain branches: This means running some code snippets more than once with a different set of initial states (to simulate true/false branches and even potentially function calls with known params).
Guessing instruction validity: in Thumb mode there's not that many invalid instructions and sometimes false positives can occur. I track very suspicious instructions (like str r0, [r1, r1]) to suppress false positives.

I validated this with a patched gpsp that captures WAITCNT write and also can do some frame-to-frame comparison. The result seems to be that all games continue on working normally and that no WAITCNT registers are updated. Fun fact, it takes around 6 hours for the analyzer to go through the ROMs using a single thread (this runtime was significantly less when backtracking was out of the equation).

Next steps

For proper Supercard patching we also need Flash/EEPROM patching as well as other kinds of patches. However these are not as complicated to find and therefore will avoid this complexity.

In the meantime I'm using and testing them with SuperFW and just published the code and patches for anyone to use. Find them here:

https://github.com/davidgfnet/gba-patch-gen

I'm going to use these approach for some other stuff, like finding IRQ routines (games write a pointer to the hardcoded address 0x03007FFC) so I can patch my own IRQ handler. This approach should be more reliable than blindly looking for a constant. Stay tuned!