News: Stay up to date

The Étoilé community is an active group of developers, designers, testers and users. New work is being done every day. Visit often to find out what we've been up to.


Coming Soon to a Compiler Near You

Posted on 14 May 2011 by David Chisnall

I've talked a bit about some of the optimisations and potential optimisations for Smalltalk and Objective-C that are possible with LLVM. The GNUstep runtime has a directory of optimisations, but previously they've been somewhat cumbersome to run. You had to add -emit-llvm to your clang command line, then run opt on all of the emitted bitcode files, then use llc to convert them to native binaries. Persuading GNUstep Make to do this was basically impossible.

The problem is that clang, like most other LLVM front ends, requests a set of default optimisations to run, when generating code. This set is hard-coded in an LLVM header. If you want to make clang run the optimisations, you need to modify this header before compiling clang - not ideal.

Last week, I spent some time hacking on LLVM, and rewriting that code allow plugins (plugs-in?) to modify the default set of passes. This is still pending review before being added to LLVM, but the code using it is in the GNUstep runtime tree already, so once the patch is committed you can use it immediately.

Currently, you still need to specify the path to the plugin, which is not ideal: it should be loaded automatically. Worse, because it's an LLVM plugin, you actually need to pass it to clang's cc1 equivalent. This means that you need to add something like this to your CFLAGS: -Xclang -load -Xclang {llvm/install/path}/

Once you've done this, the plugin is loaded, and the various passes will be added depending on the optimisation level. At -O2, they'll all be run (except the profile-drive ones). This means:

  • Instance variable reference will be made fragile, if doing so will not break the public ABI.
  • Class lookups will be cached
  • Class message lookups will be cached
  • Class methods will be inlined, if possible
  • Message sends in loops will be cached

This kind of list is meaningless without benchmarks, so here's a simple one. This contains a couple of loops, one sending class messages and one sending instance messages. It uses clocK() to record the amount of CPU time take for the entire microbenchmark. Here you can see the results from compiling the program with GCC, with Clang, and with Clang and the plugin:

$ gcc -O3 -std=c99 loop.m -L /Local/Library/Libraries/ -lobjc && ./a.out 
16.648438 seconds.  
$ clang -O3 -fobjc-nonfragile-abi loop.m -L /Local/Library/Libraries/ -lobjc && ./a.out 
15.312500 seconds.  
$ clang -O3 -fobjc-nonfragile-abi -Xclang -load -Xclang `llvm-config --libdir`/ loop.m -L /Local/Library/Libraries/ -lobjc && ./a.out 
3.539062 seconds.  

Don't read too much into the difference between the first two. I just pasted in the results of running each command. Because this was done in a VM, the timing is not 100% accurate, and the jitter between results was about as big as the difference between the clang and the gcc results here.

The big difference, of course, is the final result - less than a quarter of the time taken for the gcc-compiled version to run. This sent 5 times as many class messages as instance messages, and with the first two results the amount of time spent sending each was the same. This was due to the large overhead of calling objc_lookup_class() for every class message. You can see evidence of this in the GNUstep code, which is littered with static variables that cache classes to avoid the lookup overhead.

One of the optimisations cached this lookup automatically, so that overhead was negligible. This drops the cost of class messages to approximately the same cost as instance messaging. Class messages are automatically cached, even if they're not in loops, because the mapping from class message to method rarely changes. We also cached instance method lookups in loops, so the overhead of the message sends was quite low as well. Comparing just the class messages, we have about 7.5 seconds for GCC and about 0.5 seconds with these extra optimisations.

Hopefully, by the time LLVM 3.0 is released, if you use clang and have the GNUstep runtime installed, then this should all happen automatically, and you'll get nice fast code without having to do anything.