News: Stay up to date

The Étoilé community is an active group of developers, designers, testers and users. New work is being done every day. Visit often to find out what we've been up to.

News

EtoileText, LaTeX, and HTML

Posted on 16 April 2010 by David Chisnall

In a dramatic break with tradition, I've recently been working on some things in Étoilé that are actually useful. When I'm not hacking on Étoilé, the thing I'm actually paid for is writing. I've had two books published, and the third one is currently undergoing technical review.

When I write, I like to use a custom form of semantic markup, using LaTeX syntax. LaTeX, for those who have not encountered it, is the abomination that caused me to lose all respect for Donald Knuth. There is simply no excuse for anyone to design a programming language that does not include concepts like scoping, or any support for structured programming. LaTeX is a programming language that looks like someone (Knuth) thought a Turing Machine was actually a sane programing model, rather than a useful theoretical tool.

So why do I use LaTeX? Two reasons. One is that TeX-style markup is easy to type. Entering XML markup is too distracting, but TeX is very simple and quick to enter, so doesn't interrupt my flow. The other is that the output is really beautiful.

I don't really use LaTeX though, I use a set of custom macros built on top of it, just as LaTeX itself is built on top of TeX. For example, if I write \keyword{Smalltalk}, then the resulting file will have Smalltalk written in the keyword style and also added to the index. If I write \code{NSObject} then NSObject emitted as syntax-highlighted Objective-C code.

I have a few hundred lines of LaTeX code that does this translation. It's the only way to use LaTeX and remain (moderately) sane. Like Lisp, LaTeX is a language for writing languages, rather than a language for using directly.

For my next book, however, the rise of the ePub format (another horrible format, but that's another issue) means that the publisher wants an HTML version of the text.

There are some good tools for mapping LaTeX to XHTML. The best of these is tex4ht, which runs a full TeX virtual machine and then transforms the output into HTML. The problem with this is that it loses all of my nice semantic markup, and styling it is a problem.

I want to be able to define a mapping from my TeX-style semantic markup to XHTML semantic markup. For example, I want \code{NSObject} to become <span class="code">NSObject<span>.

This is where EtoileText fits in. EtoileText is the framework that I've been working on a bit over the last couple of months for editing structured, semantically tagged, rich text. It maintains something conceptually similar to a DOM tree. Text is only stored in leaf nodes of the tree. Each parent node may contain a semantic type and custom presentation attributes.

You can define translators that map the semantic types to AppKit presentation attributes, and plug it directly into NSTextView. Alternatively, you can use the visitor API to generate some other output. The latter is exactly what I've been doing.

I've added a simple (and incomplete) parser for TeX-style markup, which constructs an EtoileText tree from my LaTeX sources, preserving all of my semantic tagging. Then, a visitor walks this tree and emits semantic XHTML. Add to this about half a dozen lines of CSS, and you end up with an XHTML version of the LaTeX sources that looks correct: Not exactly the same as the PDF, but how the markup would look if it were typeset for a browser window instead of a printed page. This uses the ETXMLWriter class from

The TeX parsing classes in EtoileText are not meant to be a general TeX parser (this is actually a non-computable problem; reason #135 why I hate LaTeX). They are intended as a set of tools for building special-purpose parsers for semantic markup languages implemented in TeX. The ones that start with the ETTeX prefix correspond to standard LaTeX commands, the ones with the TRTeX prefix correspond to my own set of macros.

The code is very rough around the edges at the moment. When it's a little more complete, I'll split the TR* classes out into a separate example program.

All of these classes are designed to be usable easily from Smalltalk, which means that you can use Smalltalk to extend it, just as you can extend LaTeX with macros.

Over the summer, I hope to spend some time working on a CoreText implementation and improvements to the GNUstep text system, so hopefully we'll be able to produce typeset PDFs from EtoileText trees that look as beautiful as anything that LaTeX produces by the end of the year. Nicolas keeps promising to work on a structured text editing component for Étoilé too, so hopefully I'll be able to work entirely inside Étoilé soon...