
Reimplementing TeX's Algorithms: Looking Back at Thirty Years of Programming - sdesimone
http://www.infoq.com/news/2015/01/implementing-tex-in-clojure
======
ThinkBeat
"According to Glenn, it is very hard to figure out what TeX code is doing,
mostly due to its terseness and extreme optimization"

This quote surprises me. The TeX code is extremely well documented compared to
just about any other piece of source code I have seen.

The literate programming style never caught on but I think you see the beauty
of it in the TeX source code. If its hard to read due to the mix of code and
comment it can be turned into a book. Or into a compiled program.

I am also surprised that optimized Pascal from decades ago, that compiles on a
wide variety of different hardware and software platforms outperforms a modern
re implementation.

The point being that if it was optimized for a specific cpu on a specific OS,
making it really fast makes sense. But when writing code for "some" or in
reality any (ok that isnt quite true but TeX runs on a lot of different
hardware) cpu, and any operating system optimizing code is harder.

It does get compiled by a C backend these days though.

~~~
xpe
For a better sense of what he means, check out the video go get more detail
about some of the complexities.

As a counter-point, a codebase can be well documented and yet be hard to
understand if the abstractions are not meaningful to the reader. I can say
this without attaching any normative judgment to the quality of the code.

If I recall correctly, Glenn found it difficult to reason about the code
because many parts were tangled in complex ways. For example, subsequent
processing steps often triggered previous steps to repeat in surprising ways.

To speculate a little bit, these parts of the code may be well-documented down
in the weeds, but they could still feel non-intuitive in the broader context
if they didn't feel consistent or non-surprising.

------
taeric
I find it a sad reflection on the "thirty years of programming" that we are a
long way from making anything as a) stable, b) bug free, c) fast, or d)
readable as TeX.

The entire pedagogical point of this exercise seems to have completely missed
its goal, but then a hand is waved and we all learned something valuable
anyway. But what? That for an industry that hates "technological debt," we are
highly indebted into those that wrote our platforms for all of the points
listed above.

~~~
jahewson
I see this claim a lot, but it's simply not true. TeX is a relic of another
age, frozen in time.

a) TeX isn't stable, it's dead. It can't generate PDF, or PostScript, or HTML.
It can't handle Unicode, it can't use OpenType, TrueType, or even Type 1
fonts. Without the need to conform to external standards, it's easy to declare
a defect to be a feature. Everything which makes TeX usable in the modern
world is a third party program, there's an entire "life support" ecosystem
written by others, with stability issues, and bugs, and worts in abundance.
TeX is by definition "stable" because it's maintainer does not wish to develop
it further - and that's not good for TeX, which has now fragmented in to
various competing successors.

b) Much of the TeX ecosystem is buggy, CTAN packages frequently conflict with
each other, in ways which are very challenging to resolve. The minimal and
low-level nature of TeX's macro language is the root cause of such bugs,
because it's hard to write macros correctly, and even harder to debug them. By
keeping the core of TeX unchanged and minimal, the responsibility for dealing
with complexity is forced onto macro package authors.

c) As for the claim that it is fast, that is questionable. TeX is needlessly
slow at resolving cross-references, requiring documents to be re-typeset up to
three times in a row. It also spends a lot of time doing I/O on its many macro
files - the end result is waiting 10sec or more to typeset a file which should
have taken under half a second.

d) TeX is written in Knuth's own private language, WEB, which makes it very
hard to read. WEB uses a complex system of macros which is highly
unconventional by modern standards. The code is highly non-linear, with macros
jumping all over the place. Everything is a global variable. No amount of
explaining or literate programming can make this mess "readable". It requires
the utmost effort and study to extract meaning from - as the OP explains so
well.

If you want to praise any open source project of the past 30 years, I'd
suggest looking at the Linux kernel. It's alive and well, for a start.

~~~
taeric
If we are just allowed to say things are true by adding the adjective simply,
then I will counter with it is simply true. :) To the points, though.

a) I actually mostly agree with this concern, with the caveat that I have
fewer problems producing pdfs of documents from TeX than I do any other tool I
have ever had the joy of using for the same purpose. I will also note that
even just treating TeX as a "core" for LaTeX, it is still a good example of
stability. Specifically in the, "I can still typeset _any_ LaTeX document I
have." This is far from true for opening pretty much any other file I have.
Period. Hell, even just compiling old c programs is less guaranteed.

b) Applications have bugs. Pretty much period. While the overall ecosystem has
unforeseen bugs, especially when you mix and match different contributions, I
still know of pretty much no examples of a core as large as TeX that is
essentially bug free. Care to provide any examples? And, to my point, I would
wager there are more bugs in this clojure rewrite than there are in the
original.

c) This is pretty much only claimed because of the failure of the clojure
version to achieve similar speeds. I mean, if we are fine with setting
contrived expectations ("should have taken under half a second"), then I don't
understand why I have to wait at all for a fully typeset product to emerge. In
the meantime, I have yet to see a product that competes in speed. Tons of
claims for those that should be able to, but few actual examples.

d) Is it a language you are familiar with? Obviously not. Does it require
effort and study to extract meaning? I'd actually agree that it does. I fail
to see how that is a detriment, though. I have yet to see _any_ solution to
_any_ non-trivial problem that does not require the same.

Specifically to the complaint of non-linearity of the code. Code is pretty
heavily understood in a non-linear fashion. Same for many other topics,
actually. One of the things I actually admire about WEB is that it is non-
linear and allows the author to introduce a narrative into the process.
Seriously, this is hugely beneficial once it is understood.

~~~
qznc
> Applications have bugs. Pretty much period. [...] I still know of pretty
> much no examples of a core as large as TeX that is essentially bug free.
> Care to provide any examples?

Linux. You can run still run a statically linked executable from 20 years ago.
Of course, Linux today has many bugs, but if we use TeX's definition of
stability we are only allowed to use the features we had 20 years ago.

qmail. Bernstein's bet is still open.

sh, ed, vi, grep, sed, and more of the default UNIX tools provided you
restrict them to their POSIX functionality (no Unicode etc).

~~~
taeric
sh, ed, vi, grep, sed and such are all good examples. But, much smaller than
TeX. Also, probably much less understood at the source level.

Linux is actually a good example. Of course, one of Torvalds' main drivers is
"never break user space." Which is essentially the stability I am talking
about here.

It is also a great example, in that it is also not written in a way that
academia approves of. Is famous for this, actually. (Among many other points,
of course.)

I'll have to look up Bernstein's bet.

And, you'd be surprised how many statically linked binaries from 20 years ago
won't run on modern setups.

~~~
qznc
Bernstein's bet:
[http://cr.yp.to/qmail/guarantee.html](http://cr.yp.to/qmail/guarantee.html)

Since IBM z-Series just hit the frontpage, you can add lots of COBOL and PL/I
code which runs unchanged for the last 50 years. However, it is all
proprietary and impossible to estimate.

~~~
taeric
Thanks for the link. That is a rather fun read. Is interesting to read someone
that has "mostly given up on the standard C library." I think I understand and
agree with the reasoning, to be honest, but is still far far from what is
recommended in academia. Or industry, for that matter.

The COBOL and related code is an interesting data point. I would be interested
to know just how bug free it all is, versus just used in ways that don't
trigger bugs. Still, they definitely count as long running software, if not
highly ported.

------
mturmon
I read basically the whole of the TeXbook (the spiral-bound book documenting
the TeX language and implementation) many years ago, and found it fun and
quirky, and very educational regarding how typesetting is done, and how
typesetting practices can be transported into a computer implementation.

It was, for example, where I learned what an "em" is, the difference between a
hyphen and an em-dash, what a kern is, what italic correction is, what leading
is, and on and on.

But I found the language description not very clear. There are passages in the
TeXbook that refer to different stages in processing as the stomach, mouth,
gullet, etc. I later read an offhand comment that Knuth did not use a
conventional lexer/parser setup in implementing TeX, and that decision made
TeX more ad hoc from a language point of view. This, in turn, may have made
the macro expansion setup more complex, with its expandable vs. not expandable
tokens, restricted modes, etc.

I wonder if others who have more experience with language design had the same
reaction?

~~~
JadeNB
I think that there aren't many fans of TeX's design. (Well, _I_ rather like
it, but probably because I am not a language designer.) Seibel
([http://lambda-the-ultimate.org/node/3613#comment-51120](http://lambda-the-
ultimate.org/node/3613#comment-51120)) quotes Knuth as saying (in "Coders at
Work"—but the relevant page is not on Google books):

    
    
        Mostly I don't consider that I have great talent for language design.

~~~
mturmon
Yes, I guess to put a finer point on it, lex and yacc were available in 1975,
and the first version of TeX was started in 1977, with a 1982 rewrite.

(OT, I just learned that _that_ Eric Schmidt was a co-author of the 1975
version of lex.)

~~~
acqq
Some writers of real compilers (vs. school projects) even today would avoid
lex and yacc. Stroustrup, for example, wrote that in hindsight he shouldn't
have used yacc for his cfront, apparently it brought him more pain than
benefit.

Yacc and lex (and even more modern versions of the parser generators) have
good "educational" value, but are far from being the silver bullets in
practice.

~~~
wglb
Yes. David Conroy who wrote the first couple of Mark Williams C compilers,
said of yacc "It makes the hard part harder and the easy part easier".

~~~
acqq
To add my example: For my first and simple compiler used in the commercial
application I used yacc. The start looked simple, then I hated it. It looked
to me that I invested more energy in managing to do something "the yacc way"
than to actually do something. For my more recent one, much more complex, I
avoided yacc using a variant of recursive descent and never came to the point
to miss anything, quite the opposite.

~~~
wglb
The one production compiler I worked on used YACC because we thought it was a
good idea, but this was probably correct in that the PhD that did the parser
part was a student of Hopcroft. She actually published a paper about error
correction based on the work there.

Any compiler I write these days would do what you are saying--recursive
descent. There is a nice technique called "chart" that helps with this.

~~~
acqq
Thanks. Do you mean this:
[http://en.wikipedia.org/wiki/Chart_parser](http://en.wikipedia.org/wiki/Chart_parser)
It seems it's something to be more used for the natural language parsing, less
for the programming languages?

~~~
wglb
That turns out to not be a very good writeup. The published work is
[http://dl.acm.org/citation.cfm?id=801348](http://dl.acm.org/citation.cfm?id=801348)
and Dick Vile did a lot of work on it years after that, using it to describe
an in-house production language.

------
acqq
"Clojure implementation was of course far slower than TeX."

I think that sentence is the most important one in the article. I'd like to
know how much, "far slower" sounds like some order of magnitude.

~~~
zokier
I found that bit slightly disappointing, and almost dismissive against the
PL/compiler/optimization research done in the past 30 years or so. I don't
believe that a modern clean rewrite would inherently need to be "far slower"
than the original.

~~~
DennisP
Compared to something extensively hand-optimized by Donald Knuth, it doesn't
seem that surprising, especially considering Vanderburg went with better
abstractions in preference to optimization.

~~~
taeric
I find it hard to consider the abstractions "better" when they are so far
behind the performance of the ones that were used. Higher level? Certainly.
Better?

~~~
dj-wonk
Glenn's goals are clearly tilted towards education and intelligibility. So
yes, the abstractions are better by his definition.

~~~
taeric
I still reject that they are better, and more that education should strive for
better results. That is, if you define them as better with no argument
allowed, then they are of course better. However, if you were to objectively
compare them... I am unsure why they would be awarded the success. Again,
higher level and/or easier to understand? Sure. Better?? By what metric?

This would be akin to declaiming that Newtonian physics is "better" than
quantam, because it is easier to understand.

Also consider that TeX is still one of the more widely read source programs in
existance. And there have been frighteningly few bugs. There are undoubtedly
more bugs in this rewrite with "better" abstractions than there are in what is
being rewritten. So it loses stability. It is slower. So it loses speed.

What, exactly, is it truly better at?

As indicated in another post, I am unsure that this really succeeded at the
pedagogical goals that were intended. If anything, it almost shines as an
example of how we have gone awry in teaching developers.

~~~
dj-wonk
I found his emphasis on history to be unusual and refreshing, but we don't
certainly don't need to agree on someone else's goals for their project.

That said, I personally don't see a lot of value in being too critical of
projects with different goals. (That feels a little bit like back-seat driving
where you have a different destination in mind than the person behind the
wheel.) If someone has the motivation and creativity to embark on a project,
good for them.

If you or someone else wants to port TeX with different goals, that's also
fine.

In any case, until Glenn's source code is available, neither one of us can say
very much about the code he has written.

~~~
taeric
Oh, I am not against the excercise. I would even go so far as to applaud the
work. What I do not care for, is essentially sloppy science in how we declare
successes.

I mean, this is essentially back-seat driving Knuth, of all people.

And... TeX has been widely ported as is. I am in the crowd that is somewhat
unconvinced it needs a rewrite.

------
johan_lunds
Was there a link anywhere to the source code? Would be interesting to take a
look. I'm currently learning Clojure.

It's not on his Github-account
([https://github.com/glv](https://github.com/glv)).

~~~
dj-wonk
Good point. I haven't found the Cló source code. To make it educational for
others, I hope Glenn V. shares it.

------
kpghost
The problem with TeX is that it gets most of its current power from the large
library of packages written on top of it (or on top of other packages, e.g.
LaTeX). People expect those to keep running on any new implementation of TeX.
Which means that you need to be almost 100% compatible; you cannot really mess
with the TeX language in any way. So the best you could hope for is a clean
implementation (or whatever we perceive as clean these days) of a processor of
an ugly (again by today's perception) typesetting language. It's not clear
that that is worth the quite substantial effort.

You can of course aim for a new typesetting language and just take the
typesetting algorithms from TeX, but then it's a completely different product,
incompatible with existing TeX packages.

~~~
hyperbovine
The packages are secondary; TeX stripped of all its accouterments already
solves a really really hard problem, and is extremely fast and free to boot.
The rate of bug discovery has slowed to about one a decade so there is really
no need to mess with the internals at this point. It "just works". I think TeX
will be with us for a very long time yet.

~~~
Retra
Hopefully someone will get so fed up with it that they just make a powerful
replacement from scratch. (Maybe with a TeX export utility.)

~~~
TTPrograms
Alternately, you could rely exclusively on a TeX export utility and make a
clean, less verbose language that compiles to TeX. Then existing packages
could be either wrapped similarly or inserted as TeX blocks.

~~~
_delirium
I think one reason most people wanting a next-gen-TeX project don't aim to
compile to TeX is that some of the biggest pain points of TeX are baked into
the core. So if you want to improve them, you need to change or replace at
least some of the core layout algorithms, not just the front-end input
language. For example a big wishlist item for many years has been some kind of
improvement on TeX's quite frustrating figure placement, possibly with a more
pluggable layout algorithm.

------
phreeza
I think it would be great if there was a concerted effort to do such a
reimplementation (maybe in clojure, but maybe in another language, eg Python).
This may be done in the style of NeoVim, I am sure a similar or even greater
amount of funds could be fundraised if a project leader with sufficient
credentials steps up to do it.

The challenges are of course formidable, as detailed in TFA. Any such project
would have to maintain almost full backwards compatibility, unless the huge
amount of work that has gone into CTAN is to be lost.

~~~
marvy
What's TFA?

~~~
JoshTriplett
"The Fine Article". Originated from Slashdot, I believe.

~~~
tunesmith
That's a joke, right? :) I hope that's a joke... I remember RTFM and RTFA from
usenet before the web existed.

~~~
JoshTriplett
The phrase existed long before, but as far as I know with a somewhat different
meaning, due to the Usenet connotation of "article". I was referring
specifically to the usage in the comments of a news post, applied to people
who have only read the other comments but not the actual article.

------
gus_massa
There are many claims that I think are incorrect, misleading or unrelated.

> _According to Glenn, it is very hard to figure out what TeX code is doing,
> mostly due to its terseness and extreme optimization, as outlined above_

The TeX source code is in literate programming, so it's not "terse" in the
standard sense.

> _there was no IEEE standard for floating point arithmetics;_

I always suppose that the integer arithmetic was to ensure 100% portability.
Every system handles integer equally (if you avoid undefined behavior), but
there are lot of small incompatibilities with the floating point numbers (for
example, single, double or extended precision).

> _portability meant supporting almost 40 different OSes, each of them with
> different file system structures, different path syntax, different I /O and
> allocation APIs, character sets;_

The original version of TeX was ASCII only, and using other character sets is
always a problem. (For example, in LaTeX you must use a package like
inputenc.)

~~~
taeric
I would be curious to know if the woven sources were used over the tangled
ones. As probably the only program I own in printed form, the source is much
more readable than I would have ever thought possible.

------
politician
After reading the article, it sounds like a terrible idea to port that sort of
program directly to Clojure (gotos, globals, and mutable state) without
performing an intermediate port to a C-like high-level language first.

TeX may have fewer dependencies (given that it predates many libraries), so
`Go` would probably a good target for an intermediate port -
imperative/procedural, globals, goto support, static compilation.

Modernize this program in Go, then port it to Clojure.

~~~
dj-wonk
Perhaps Glenn has different goals than you might expect? The video goes into
this better than the InfoQ article.

In short, I think Glenn embraced the differences and challenges of adapting
TeX in Clojure from an educational perspective. I don't think he intended to
port it quickly or verbatim.

Glenn wanted to head-to-head comparison to promote education. To explain, in
Glenn's talk around 11:36
([https://www.youtube.com/watch?v=824yVKUPFjU](https://www.youtube.com/watch?v=824yVKUPFjU))
he says:

> While I'm glad that the source code of TeX is still available to study, I
> sure wouldn't point a new programmer toward it because we should not be
> doing programs today like we did then.

> But what might be really valuable is to have a modern reimplementation
> written in a modern functional style to study alongside [the original TeX]
> to see how much has changed and to appreciate what he have and the value of
> improved tools and techniques and the tradeoffs that have been made.

His slide on his educational goals can be seen at 12:06:

> Illustrate what's changed

> Demonstrate values of expressive code

> Provide real examples of expressing procedural algorithms in functional
> style

> Show how functional programs are easier to reason about

> Show different styles of optimization

------
dfan
It's written in an obsolete programming style, and I'd never write code like
that today, but boy did I learn a lot from reading through the entire source
code of TeX.

