Garbage collection (Halloween edition)

Florian Weimer

The thing I worried about since I started to work on my own toy garbage collector has happened: I had my first real garbage collector bug.

My first garbage collector bug
Where was GC-a-lot-mode when I needed it?
A whole bag of treats
And the story goes on
How many more bugs remain?

My first garbage collector bug

As expected, my first garbage collector was difficult to track down, needing quite a few hours of staring at stack traces and dumps and even instruction execution traces. And it was not obvious at all that it was related to object allocation and garbage collection. All those are among the reasons why I wrote a while back that garbage collectors are scary.

There is how the story unfolded. While working on a support library for byte strings, one of the exception tests hit a curious bug: a 64-bit integer field in an expected exception was not initialized correctly. Instead of the value -1, it contained a number like -4294967264 (0xffffffff00000020 in hexadecimal). This was rather puzzling because single-stepping through the implementation of the alloc instruction showed that the correct value as written:

        // _long values.
        count = fi & 0x1f000;
        if (count) {
          count >>= 12;
          auto p = align_pointer_and_cast<uint64_t>(ptr);
          for (unsigned i = 0; i < count; ++i) {
            FETCH_SOURCE_REG();
            *p = ul3_regs[source_reg].l;
            ++p;
          }
          ptr = (char *) p;
        }

But yet at the end of the allocation, somehow the object did not have the right data. A hardware watchpoint on the memory address (activated using GDB's watch -l command) did not show any unexpected writes to the location, which made this rather mysterious. The issue persists at the -O0 C++ compiler optimization level, so it is clearly not an aliasing violation.

The answer is related to the pointer alignment step. In my virtual instruction set, the alloc instruction is a bit of an outlier in that it is variable-length: it contains a register list for the initial field values. To execute the instruction, the interpreter iterates through the list, guided by type metadata information that provides the width of each field that is stored into the newly allocated object. If a 64-bit field (a _long field) follows other, smaller fields (or comes directly after the 32-bit object header), there may be be a padding gap. This gap is not encoded in the type metadata. Instead it is handled implicitly. Hence the align_pointer_and_cast function call in the code snippet above. (The whole design of the alloc instruction, with a nested bytecode interpreter of a different kind, is a bit dubious in retrospect; it might deserve an article of its own.)

Of course, this goes wrong rather spectacularly if the start address of the newly allocated object is not itself 64-bit aligned. This is exactly what happened. As part of the byte string library, I had added an internal function that performed an uninitialized byte array allocation. (The usual byte array allocation instruction produces as a zero-initialized array.) It turned out that this new function did not round up the next allocation pointer to the next multiple of 8. Instead it left it pointing right after the end of the newly allocated array. As a result, subsequent allocations might place objects at addresses that are misaligned and not multiples of eight bytes.

Apart from the alloc instruction fiasco, this worked surprisingly well. Especially considering that in the current implementation, object pointers need to have their least-significant bit set, otherwise the garbage collector treats them as non-pointers. (This allows the collector to ignore null pointers and fixnum values at the same time.) Misaligned pointers would not survive garbage collection, which leads to the next question.

Where was GC-a-lot-mode when I needed it?

As mentioned in my previous article, it is possible to run garbage collectors in a GC-a-lot mode, where every allocation triggers garbage collection. Why did GC-a-lot mode not catch this issue sooner? In GC-a-lot mode, after a successful garbage collection, the allocation limit (for the pointer bumping allocator) is set so that only the next allocation (the one that triggered garbage collection) succeeds. The allocation after that will again invoke the garbage collector. I did not consider that the initial heap size is not zero. While the heap is quite small, it is still large enough that garbage collection is never triggered for most tests. As a result, GC-a-lot mode never kicks in.

I fixed this by special-casing heap initialization. If GC-a-lot mode is enabled from the start, the initial heap is as small as possible, containing only the relocated version of the initial heap data stored in the program image.

I also added a check at the start of each garbage collection that the next allocation pointer is a multiple of eight. This assertion cannot be compiled out and is active in regular mode, not just GC-a-lot mode. The test suite now disables tests that are incompatible with GC-a-lot-mode automatically (mostly tests that do not complete in any reasonable amount of time due to the increased execution overhead). Part of the problem was that running the entire testsuite in GC-a-lot-mode was not a routine activity, I think.

A whole bag of treats

As usual, if you think you are testing something, but you actually are not, once you start testing it, all kinds of odd issues show up.

The implementation of the array allocation instructions computed the padded array size in bytes (rounded up to the next multiple of eight) and used it on the fast path. By mistake, the original, unpadded size was used on the slow path. This is essentially the same issue as in the uninitialized byte array allocation function that started all this, just in existing code in the virtual machine implementation.
There were other secondary allocation functions that could result in a misaligned next allocation pointer and heap object misalignment.
Exception allocations for out-of-bound array accesses used the wrong program counter (PC) to derive the stack map at the allocation site, leading to missed pointers during garbage collection (or pointers identified as non-pointers).
Reviewing PC handling during stack unwinding, I found a somewhat-related issue. The exception unwinder used the return address of a call instruction without adjustment when ascending to the callee. This resulted in the exception handler for the instruction following the call instruction being used instead of the handler for the call instruction itself.
PC adjustment for call instructions does not matter directly for garbage collection because call instructions do not alter the stack contents except for the part that is shared with the callee. That part is covered by the callee's stack map information. However, allocating instructions need a precise location for the PC at the time of the allocation, as seen before with exception allocation.
Fixing this in a consistent way revealed that while most instructions update the program counter right after instruction decoding (this simplifies PC changes by the instruction itself), the alloc instruction did not. The fix was to artificially increment the PC around the garbage collector call on the alloc slow path.
There was another defect revealed by a test case with a very large stack frame: an off-by-one error in the bitmap traversal code used for stack pointer identification. The bitmap is encoded as an array of words. Each word has 31 usable bits; one bit is taken up by the bitmap end marker. This means that if the stack map iterator switches words, it has to subtract two from 32 (one for the marker bit, one for the bit that was removed during the iteration), but it only subtracted one. Again, this caused stack pointer misidentification.
I did not correctly migrate the heap optional verifier (a separate function for debugging purposes, not integrated into the garbage collector) to the new pointer encoding that has the least-significant bit set in all object pointers. Most reference arrays would fail verification as a result, but there was no test case that attempted heap verification with reference arrays. I need to maintain an omnibus test which runs heap verification after allocating all object types which have slightly distinct implementation, either in the allocation path or in the produced heap layout.

This is already quite an impressive list for a subsystem that was believed to be relatively bug-free, at least as far as such core correctness problems are concerned.

And the story goes on

This was just the initial bug haul. The issue that triggered all this (the misaligned next allocation pointer) is quite close to the garbage collector core, but the other defects are mostly related to identifying the right stack map and processing it correctly. The bugs did not quite end there, though. During subsequent library development, two more issues showed up:

The allocation instructions in the bytecode contain the value of the type information field that needs to be written to the header of the new object. The array allocation instructions do not, but the implementation still used the right type information value. However, a bunch of secondary allocation functions, for things like exception allocation, debug backtrace generation, or indeed the original helper function for allocating uninitialized byte arrays did not use the right type information. The incorrect values used by those secondary functions worked most of the time, to some degree even in GC-a-lot mode—but some of the values ended up looking like headers of GC-forwarded objects. The fix was to use the right type information field values, and to add a check during heap verification that the heap does not contain any forwarded object pointers.
Longer term, how the assembler and the virtual machine implementation share well-known type information values probably needs to be reworked, as more and more shared types are added. This issue is again somewhat close to the core garbage collection implementation.
The final bug was again related to stack maps. The assembler iterates through a procedure multiple times, gathering more and more type information. Instructions that can be reached through different means (fall-through from the previous instruction, jump to a label on the instruction, or exception handling control of transfer to the instruction) may see different type information in subsequent iterations, but eventually, no type changes are observed, and the type information is complete. (I have sweet spot for such an implementation pattern, producing an ascending chain of states that eventually becomes stationary.) To generate stack maps, the assembler runs the type checker for one final time and copies out the identified types at the start of each instruction. Stack map generation then looks at the types to determine the set of registers that may contain pointers. In this case, the bug was in the assembler: During that final pass, when processing a label, the type information at the label was not propagated forward to subsequent instructions. For example, if a procedure builds a linked list in a loop, the list head starts out as zero (not a pointer), but is overwritten with a pointer (with the latest object allocation) in the loop body. As a result of the bug, the extracted type information at the object allocation in the loop body was based on the original zero register value, and did not reflect that the register during the second and subsequent iterations will be a proper object pointer. The net effect was a wrong stack map that did not treat registers as pointers when it should. It was simple to fix: just use the type information from the label (instead of the previous instruction only) when propagating types for the final type extraction pass.
This issue is another bug related to stack maps. I think addressing the underlying conceptual issue would need a split assembler: one high-level macro assembler that contains a type inference algorithm (as today), but it would not generate bytecode directly. Instead it would generate fully type-annotated input for a second-stage low-level assembler which merely performs one-pass type verification based on type annotations included by the first-stage assembler. This way, the somewhat hairy type inference algorithm could not result in buggy programs.

How many more bugs remain?

At least one of the important bugs was already present when I wrote the initial article about garbage collector scariness. Against my hopes, GC-a-lot-mode did not find reveal the bug at that time, mainly because the mode as implemented did not work as expected for short tests with little heap usage. This leaves an uneasy feel about what types of bugs might still be lurking in the shadows. I also encountered a couple of type safety bugs in my assembler, but that was not unexpected given the ad-hoc nature of its type checker.

At least there is a silver lining. As a part of debugging the initial slew of bugs, I finally wrote some basic debugging information support. Backtraces for failed test assertions now include procedure names, not just program counter values from the executed program image. This makes programming in this system a bit more tractable, and even fun.

Revisions

2024-10-20: published

Florian Weimer
Home Blog (DE) Blog (EN) RSS Feeds Impressum