The Road to CoreObject Part 1: EtoileSerialise
I've now submitted my PhD thesis and more or less wrapped up my Xen Book, so now I'm taking the summer off (apart from the odd article) to work on Étoilé before looking for some kind of job.
I've mainly been focussing on development of CoreObject, specifically the low level components required for CoreObject, this month.
Many of the ideas here come from a discussion between myself and Nicolas a few years back about how computers suck and how we can make them suck less (a recurring theme, that eventually led to the formation of Étoilé when it turned out that we weren't the only people having this conversation).
CoreObject is intended to be one of the foundation pieces of Étoilé. The current roadmap calls for an experimental version in 0.3, a stable interface in 0.4, and a completely stable version in 0.5.
What is CoreObject? Basically, it's a replacement for a filesystem as a programmer and user interface. Files (in the UNIX sense of the word) never were a good abstraction; an untyped series of bytes is no use to anyone. The operating system needs to deal with things like this, but programmers shouldn't have to.
We already have a much nicer abstraction than a file; the object. Unlike files, objects have all of the structure and introspection that we want in order to be able to interact with them programatically. In Étoilé, we want to treat everything as an object, and objects as first-class citizens.
At the lowest level, CoreObject provides a persistence layer, and that's what the EtoileSerialise framework is for. It turns out that the Objective-C runtime stores the names and types of all of a class's instance variables, making it almost possible to serialise arbitrary objects without any extra code.
Why Almost? Because Objective-C isn't really a language, it's two languages that you are allowed to mix together in the same source file. One of these languages is a close relative of Smalltalk. The other is C. Anything that just uses the Smalltalk components of the language is easy, since the runtime stores all of the type information. Anything that uses the C component is impossible, because there is no runtime information at all. Most code that we are interested in lies on the boundary.
The EtoileSerialise framework doesn't serialise anything other than objects. Objects, however, have instance variables that can be C types. Some of these are easy. An int, or float, for example, has its type encoded in the object's description, and so we just need some special handlers for the various C types (of which there are not many). The same is true of static arrays. If you say something like 'int foo;' in an instance variable description, the runtime encodes the size of the array and we can retrieve that and transparently serialise the array. The same is true of structs, where the runtime stores the type for each field.
Dynamic arrays start to get a bit harder. A dynamic array in C is just a pointer. If you say 'int * foo' then the only information available to the serialiser is that foo is a pointer to an integer. It could be an array, or a pointer to an aliased intrinsic. If it's an array, we have no way of knowing the size. This is not quite true. On Windows and OS X there are extensions to the malloc() family that let us know the size of a block of memory identified by a pointer, but they are non-portable. This doesn't help us at all when we have something like 'int **' though, since we don't know if it's a 2D array, an index array, or something else. At this point, the serialiser just gives up.
Fortunately, Objective-C is a nice dynamic language, so we can fudge this slightly. For objects that use low-level components of C, we introduce an informal protocol that asks the object to manually serialise it. This takes the ivar name and the serialiser back end as arguments, and so anything the serialiser can do, this method can too. On the deserialisation end, two other methods are available for fixups. One is the converse of this, requesting that an object manually deserialise an ivar. The other is invoked with no arguments once an entire object graph has been deserialised. Note that the serialise and deserialise methods don't do any type checking for manually serialised things, so you can serialise an int as a C string if you need to. One reason you might want to do this is for an object wrapping a file, where the ivar would contain a file descriptor (an int). If you just store this and reload it, you will get nonsense, so you might instead store the file name and re-open the file on deserialisation.
There are two other pieces of the puzzle. One is named structures. Some structures need special handling, and it is a bit rubbish to expect the developer of every object that needs to handle them to know about this. Fortunately, the runtime system knows the name of structures that are used. To make use of this, you can register a function that handles serialisation of a named structure. The serialiser will then call this whenever a structure of this form is encountered. The other part is versioning.
OpenStep's NSObject already has a -version and a -setVersion: method. We make use of this with the serialiser by encoding the version with each serialised class (for subclasses, we encode the version of each class in the hierarchy with the instance variables inherited from that class). The manual deserialiser method takes the version as a third argument. If you change the instance variables of a class, it's easy to add support for deserialising the old version by implementing this. It is even possible to do this in reverse, by supplying a category on the old object that loads the new one, or even deserialise an object using an object of a completely different class using the poseAs: mechanism.
We can version classes, but what about objects? It would be nice to have the revision history preserved. We do this by turning a model-controller-view trio into a model-(CoreObject proxy)-controller-view system.
With Objective-C, everything is an object, including the messages you send to objects (the equivalent of method calls in C++/Java). The combination of the message name (selector) and arguments is known as an invocation, represented by the NSInvocation class. Our COProxy object, or a subclass, sits between the model and controller and serialises every message that is sent to the model. This stores the complete revision history of the object. To reload any version of the object, you can just reload and replay the invocations. So this doesn't take too long, the COProxy object periodically serialises a copy of the object. Currently this is done every 100 messages. In future it will be configurable.
I mentioned COProxy subclasses. The reason for needing these is that Objective-C doesn't have the concept of 'const' methods, i.e. methods that are guaranteed not to affect the state of the object. We don't want to bother serialising these, so we will use a subclass for each class we might use as the principle class in our model to automatically pass these through.
We thus have saving, restoring, and versioning of arbitrary objects for free. Since we're greedy, we want more. Let's also have branching and merging. Branching is easy; we just define two objects with the same previous version. What about merging? Well, I think we can do this by re-playing the invocations from the two branches in an interleaved way. This will probably be done with a UI allowing the user to select which invocation should be run next.
Anything else? What about collaboration? Since we are serialising invocations, we can pass them over a network to another user, and they can keep their copy of an object in sync with ours. With a simple locking protocol, we can have bi-directional syncing. The serialiser and deserialiser are split into a front and back end, with the back end defining the storage format. At the moment there is a binary file format and a simple human-readable output-only back end for testing. An XML backend will be added too, allowing objects (including invocations) to be passed over XMPP (Jabber).
Oh, one more thing. This also gives us non-destructive editing of any arbitrary object type, from text through images to video, as long as we have an object encapsulating it. Excited yet?
Does it work?
So, what's the current status? Actually, pretty good. The code contains a few 32-bit x86isms that need fixing. Serialisation works for a lot of objects with no modification, and more with a little tweaking. Deserialisation is a bit less finished. Deserialisation of named structures is not finished, and neither is the special code-path for serialising invocations. The COProxy object works, and serialises invocations properly. It does not yet include a mechanism for re-loading them, but the example back end allows you to see that all the required information is saved.
Currently, the build system creates a test app, rather than a framework. This will be changed towards the end of the month, when the first alpha will be ready. The interfaces are still likely to change, however, so don't start integrating it into your code yet. Étoilé 0.3 will have a fairly stable version for everyone to play with and, hopefully, some of the higher-levels of CoreObject, which will deal with metadata, indexing, and type conversions too.
Idiot! Use Smalltalk!
Some people will be reading this and thinking 'this whole thing would be much easier in Smalltalk.' This is true. In Smalltalk it would be possible to write the completely generic version, and use the garbage collector to track any aliasing, etc. So, why don't we use Smalltalk? Smalltalk's a great language (ask Nicolas why, but only if you've got a long time to listen to the answer), and easy to learn; even small children can pick it up quickly. It doesn't, however, play nicely with other languages (or even GUIs), and there are not many people who know the language well.
Nicolas has described Étoilé as a 'pragmatic Smalltalk.' We sacrifice some of the nice features of Smalltalk, but gain the ability to make use of lots of legacy C code. Objective-C isn't quite as nice as Smalltak, although it's close, but we gain a lot more from the Objective-C frameworks than we lose from the language.