Erlang and the Great Computer Language Shootout

Brief Adventures Optimizing Code in a Very High Level Language

James Hague
October 23, 2002
james [at] dadgum.com

Updated October 28, 2002

Doug Bagley's Great Computer Language Shootout is a neat idea: a collection of little benchmarks written in several dozen languages for the purposes of comparing speed, memory usage, and lines of code. The concurrent functional language Erlang wasn't doing well in the rankings, so I decided to try to improve the Erlang benchmark performance. Now micro-benchmarks can be misleading or even worthless. Why bother computing the primes between 2 and 8192? We already know what they are! And isn't the point to write clean code, and not to trade increased performance for complexity and obscurity?

The "clean code" aspect is what made this interesting for me. Some of the programs written in other very high level languages didn't look at all like they would if the author were presenting them in an introductory textbook. A pretty version of the code might use higher order functions, for example, but the speed-is-everything code uses unchecked array types and a programming style that makes ML look like C. To me, the point of higher level languages is to leave as much to the computer as possible. Dynamic typing? You bet. Garbage collection? Definitely. No worrying about how to represent an integer, even if that integer is too large to fit in a 32-bit word? Exactly. So I don't see it as a failing that Erlang isn't in the top spot. I see it as an amazing achievement that such an expressive language fares so well.

The following notes are about how I approached some of the benchmarks. Not all of these improvements have been submitted to the shootout yet. Note that Doug stopped maintaining his version of the shootout, but a new version lives on.

Array Access Benchmark

This program didn't exist in Erlang, so I wrote one. Array handling is one of the weaker aspects of the current Erlang implementation. Tuples can be indexed and updated like arrays, but the updating is not in place. Changing element 50 of a 1,000,000 element tuple results in a new 1,000,000 element tuple. I suspect this is why no one attempted the array access benchmark in Erlang.

My first attempt was to use the process dictionary, which is just a hash table. The benchmark uses two arrays, x and y, so I chose to represent an array element as {x,5}, for example. This put the benchmark on the map, but unfortunately the performance was down at the bottom of the list with TCL.

The benchmark works like this: fill array x with integers, fill array y with zeroes, then add array x to array y one element at a time. Using the process dictionary, the array addition step required three hashes per element: one to read a value from x, one to read a value from y, and one to write the sum back into y. By switching from the process dictionary to an Erlang Term Storage (ETS) table, I was able to use the ets:update_counter function to reduce this to two hashes per element. This resulted in a 30% speedup, moving Erlang up past Ruby and a few other languages.

Finally, I noticed that the x array was never updated after being initialized. If it were a tuple, then it could be indexed directly with element/2, dropping to only one hash per element. This reduced the y = y + x loop to:

calc(X, 0) -> ok;
calc(X, N) ->
    ets:update_counter(y, N, element(N, X)),
    calc(X, N - 1).
I waffled a bit over whether or not this met the spirit of the benchmark, but I think it does. The critical portion of the benchmark, the above loop, is implemented with indexing, not using lists. This version took only 1/3 the execution time of the original, putting Erlang ahead of Guile, Perl, Icon, and Python, which is respectable for a language with poor array handling.

Hash Access Benchmark

This is an odd benchmark, in that half of the program is a format_hex function, and there are list_to_atom calls in both loops. It took me a while realize that the list_to_atom calls were only there because an atom is presumably the closest analog to a hash key in languages like Perl. But in Erlang you can use any data object as the key for an ETS table. It's faster to use a list like "1234" as a key directly, than it is to convert it to the atom '1234' first, at least in this benchmark. Getting rid of all the atom construction sped up the program by 400%!

Hashes, Part II

The key to this benchmark turned out to be, surprisingly, the following expression:
lists:append("foo_", integer_to_list(I))
lists:append, if you peek into the lists module, is simply defined as:
append(L1, L2) -> L1 ++ L2.
This is because the "++" operator was a later addition to the language. The first expression above can be written directly as:
"foo_" ++ integer_to_list(I)
Though I wouldn't have guessed it, switching to the latter version significantly speeds up the program. In addition to avoiding an inter-module function call, I can only assume that the compiler generates special code to handle appending to a constant list.

Sieve of Erathostenes

This is a particularly interesting benchmark in a functional language, because it is written using lists rather than an array of booleans. Instead of marking all multiples of a number as not prime in an array, a new list is created which simply does not contain those multiples. Not surprisingly, all of the heavy list rewriting makes this benchmark very dependent on the speed of memory management and garbage collection. Other than removing a bit of unnecessary consing, I initially didn't make much progress on speeding this benchmark up. Even some cheating, like just counting primes instead of building a list of them and getting the length of that list at the end, didn't make a noticeable difference. Neither did caching the initial list of 8191 values between iterations. Run times varied widely, too.

A neat realization, though, was that the six line sieve function which removes all multiples of a number from a list, can be replaced with a concise list comprehension:

[X || X <- L, X rem N =/= 0]
with little speed penalty. This, plus a few other tweaks, reduced the number of lines of code from 30 to 17.

A few days later, I started wondering if the initial flurry of list creation, such as the call to lists:seq(2, 8192), was causing the heap size to be exceeded, and the heap reallocated, a number of times in quick succession. The default size for a new process is two hundred and some odd words, but just the initial list of 8191 elements requires 8191*2 words. Even if the heap size repeatedly doubled from 200 to 400 to 800 to 1600, that still wouldn't be enough space for the initial list.

As an experiment, I spawned a new process for the main sieve loop using the spawn_opt function, which includes an option for the new process's minimum heap size. Using a size of 50,000 words decreased the overall runtime by 20%, which is a bit shocking, considering that the sieve loop runs 900 times, and the heap expansion should be finished after the first iteration. Further increases in initial heap size, even large values like 2,000,000 words, resulted only in slight performance increases.

After making this change, I again tried creating the list of 8191 values once and passing the same list into the sieve function for each of the 900 iterations. Is this cheating? After all, the C version has to clear a boolean array of values for each iteration, so why shouldn't the Erlang implementation have to recreate the list of 8191 values? The reason is because the re-initialization is required in C. Maybe creating the array of values once and then using memcpy to put it in another array for each iteration would be faster, but the difference probably doesn't matter much. In Erlang, lists:seq(2, 8192) is much less trivial, and the key point is that there's no need to recreate the list for each iteration. The list isn't destructively updated, so there's no good reason to discard it and create a new one. Computing the list once instead of 900 times, when no longer masked by memory management quirks, shaved another 7-8% off of the original benchmark time.