Franklin Pezzuti Dyer

Home     Posts     CV     Contact     People

Garbage collection by reference counting in a simple Lisp-like language

I first learned to program in Python and Java, and when I started programming in C for the first time and had to deal with manual memory management, it made me start to wonder about all of the bookkeeping that must have been happening behind the scenes in higher-level languages. Since then, I've been wanting to understand garbage collection on a deeper level. This is part of my long journey to deeply understand computer programming at a lower level.

In this post, I'll describe a recent project of mine in which I wrote a very basic programming language that is similar to Lisp/Scheme and uses reference counting for garbage collection. Mind you, this is not the kind of garbage collection algorithm that Java uses. Apparently Python does use reference counting, but it also needs to detect reference cycles, which is a complexity that we won't have to deal with in our language since it doesn't allow mutation. In any case, this project has been one step in the direction of understanding garbage collection more generally.

You can find my source code in this GitHub repo.

The Pairadise language

I was trying to think of a project that would give me some practical experience implementing reference counting, but I don't enjoy writing parsers. So I settled on a very simple Lisp-like language of my own invention whose only values are NULL and pairs of other values. The only built-ins for this language are:

If you're familiar with the formal syntax for describing the statics and dynamics of a programming language (as in Harper's Practical Foundations for Programming Languages), you might find it easier to read than my informal descriptions. If not, you can just skip past the gobbledygook below. I'm only just beginning to learn about this formalism myself, so I might have gotten them wrong... but here is how I would describe the statics of this language:

And here is how I would describe the dynamics (except for the expression in, which I'm not sure how to describe since it is not referentially transparent, drawing its value from input):

If anyone reading this knows type systems better than I do, please correct my attempt. :-)

Anyways. I don't like coming up with funny names for things, but I'm giving this language a name because it's too awkward to keep saying "my toy programming language". For the sake of this blog post I'm (sarcastically) dubbing the language Pairadise because all of its values are based on pairs, and programming in it absolutely does not make you feel like you are in paradise.

To give you an example of what a program looks like in Pairadise (ha), here's a program that takes a number as input and doubles it. In this language, I represent numbers as pair trees that are "heavy on the right", with zero being *, one being (*,*), two being (*,(*,*)), three being (*,(*,(*,*))) and so on.


let(
    in,
    \x0 cond(
        fix(
            pair(x0,empty),
            \x1 cond(
                x1,
                empty,
                \x2,x3 cond(
                    x2,
                    pair(empty,x3),
                    \x4,x5 pair(x5,pair(empty,pair(empty,x3)))
                )
            )
        ),
        empty,
        \x1,x2 x2
    )
)

Yeah, it's ugly, I know. But I like the challenge of constructing programs out of the simplest building blocks possible, which has the added benefit of requiring me to do the least amount of tedious parsing possible! Here's another program, which computes the triangular numbers $1+2+\cdots + n$:


let(
    in,
    \x0 cond(
        fix(
            pair(x0,empty),
            \x1 cond(
                x1,
                empty,
                \x2,x3 cond(
                    x2,
                    x1,
                    \x4,x5 pair(
                        x5,
                        cond(
                            fix(
                                pair(x2,x3),
                                \x6 cond(
                                    x6,
                                    empty,
                                    \x7,x8 cond(
                                        x7,
                                        x6,
                                        \x9,x10 pair(x10,pair(empty,x8))
                                    )
                                )
                            ),
                            empty,
                            \x6,x7 x7
                        )
                    )
                )
            )
        ),
        empty,
        \x1,x2 x2
    )
)

Yowza! If you're looking for a challenge and are willing to indulge this bare-bones language, see if you can write a Pairadise program that takes an arbitrary value as input and mirrors it. For example, (*,(*,*)) should become ((*,*),*), and (*,((*,*),*)) should become ((*,(*,*)),*). (My solution has 405 characters excluding whitespace.)

Algorithm for reference counting

The basic idea behind reference counting is as follows: when a structure is allocated in memory, up until the time when it is freed, a running count is maintained of the number of pointers to it that exist at any given time. When it is created, its reference count equals 1. From that point onwards, no pointer to that structure may be copied without first incrementing its reference count, and no pointer to that structure may be erased without first decrementing its reference count. Once the reference count reaches zero, the structure is deallocated - and this can occur without risk of leaving behind a dangling pointer, because there should be no pointers left to dereference once the reference count becomes zero.

Of course, we cannot really free a piece of memory after the reference count becomes zero, because we need a pointer to a structure in order to dereference it. Instead, we free the memory in the moment when its last remaining pointer would be erased.

In Pairadise, the only values are pairs and the empty value. Each pair may contain pointers to up to two other pairs, meaning that when a pair is finally freed due to its reference count reaching zero, up to two more pointers will be erased. This could result in yet more pairs being freed, which could result in even more pointers being erased and more pairs being freed, and so on.

Here's an example. If we had a value corresponding to the nested pair expression ((*,*),((*,*),(*,*))), the memory allocated for it might look like this:

Fig 1

If the pointer to this value were suddenly deleted, this would trigger a recursive deletion of every structure in this tree. These pair structures would be freed in the following sequence:

Fig 2

Fig 3

In this example, the pairs involved in the structure and their respective pointers comprise a tree. However, I allow memory to be shared between values in my implementation, meaning that values might not always be trees - they could be arbitrary directed acyclic graphs. For example, in the value (((*,*),(*,*)),((*,*),(*,*))), all of the pairs could be stored disjointly in memory:

Fig 4

or the subvalues of shape ((*,*),(*,*)) could overlap in memory:

Fig 5

or all of the subvalues of shape (*,*) could overlap in memory instead:

Fig 6

And so on.

Implementation of reference counting

The reference counting algorithm is straightforward enough to describe, but while implementing it in C, it's easy to lose track of pointers. I had to debug several memory leaks and double-free bugs.

In the end, I found that the best way of keeping track was to write just a few helper functions that manipulate pair pointers while maintaining the correct reference counts, and then use them almost exclusively for handling pair pointers. I used the following three functions:


void delete_pair_pointer(pair_val** pointer);
void assign_pair_pointer(pair_val** pointer, pair_val* pointee)
pair_val* new_pair_val()

The delete_pair_pointer function is used to "delete" a pair pointer passed by reference by replacing its value with NULL and updating the refcount of the corresponding pair, freeing it and recursing when appropriate. The assign_pair_pointer is used to copy a pair pointer to another pair pointer, incrementing the refcount. Finally, new_pair_val is the only function that I allowed myself to use for creating brand new pair pointers, and is the only place where malloc is called to allocate pair values. It initializes the reference count the new pair value to 1, corresponding to the single pointer returned by new_pair_val.

To give you an example, under my self-imposed restriction to use these functions exclusively, the following code would not be allowed:


pair_val* pv1, pv2;
pv1 = new_pair_val();
pv2 = pv1;

because the assignment pv2 = pv1 duplicates the pointer in pv1 without updating the reference count accordingly. Instead, the following code should be used to accomplish the same result while maintaining the correct refcounts:


pair_val* pv1, pv2;
pv1 = new_pair_val();
assign_pair_pointer(&pv2, pv1)

There is a sneaky "gotcha" that makes it a little bit tricky to use these functions exclusively for creation and destruction of pair_val pointers: a pair_val* can be lost just by going out of scope. And if a non-null pair_val*goes out of scope, then it is lost without its refcount being decremented. For example, in this code:


if (condition) {
    pair_val* pv = new_pair_val();
    // Do some things with pv...
}

at the end of the if block, pv goes out of scope, destroying the pointer it holds. This could result in a pair_val residing in memory somewhere with a refcount of $1$ but no pointers pointing to it, causing a memory leak, since this pair_val could never be freed. Instead, non-null pair_val* variables must be deleted using delete_pair_pointer before they go out of scope.


if (condition) {
    pair_val* pv = new_pair_val();
    // Do some things with pv...
    delete_pair_pointer(&pv);
    // Now pv is NULL when it goes out of scope
}

The one exception is when a pair_val* variable goes out of scope in the body of a function that returns its value. The local variable holding the pointer goes out of scope when the function returns, but a new pointer is returned from the function, so when a pair_val* is returned from a function where it goes out of scope, there is a net zero gain/loss of pointers to its memory. For example, if we have a function that looks like this:


pair_val* f() {
    pair_val* pv = new_pair_val();
    // Do some stuff with pv...
    return pv;
}

and when f() is called, a pointer is lost when pv goes out of scope, but one is also gained from pv being returned. Note that this means it is illegal to call f() alone without assigning its value to some variable, because this would result in the return value of f being lost, and hence an incorrect refcount. Instead, whenever f is called, it must appear on the right-hand side of an assignment somevar = f().

I won't go through my code line-by-line, but you can see the most important bits of it here. While working on it, I couldn't help thinking of how nice it would be for all of these constraints to be enforced by the compiler, so that I wouldn't have to spend time debugging segfaults or memory leaks caused by careless violations of these rules. This makes me appreciate languages like Rust that use linear type systems to track ownership at compile-time.

Some pretty graphs of pair value lifetimes

Since every malloc of a pair_val* occurs in new_pair_val, and every free of a pair_val* occurs in delete_pair_pointer, it is very easy to have the interpreter log every allocation and deallocation. Just for shits and giggles, I've written a simple Python script to convert a log of these debug statements into a Gantt chart that shows the lifetimes of each chunk of allocated memory, some of which are allocated and freed multiple times. For example, this is what the chart looks like when my natural-number doubling program is called on (*,(*,(*,*))):

Fig 7

Here's what the malloc lifetimes look like when my triangular-number program is called on the input (*,(*,(*,(*,*)))):

Fig 8

And here's what the malloc lifetimes look like when my tree-reflecting program is called on the input (*,(((*,*),(*,*)),*)):

Fig 9


back to home page