
The Hardest Program I've Ever Written – How a code formatter works (2015) - pcr910303
http://journal.stuffwithstuff.com/2015/09/08/the-hardest-program-ive-ever-written/
======
QuinnWilton
Philip Wadler has a seminal paper [0] on implementing pretty printers by
modeling them using an algebra. There's implementations of the algorithm in
most programming languages nowadays, and some state of the art advances that
help with performance [1]

It's a very elegant algorithm, and is very pleasant to work. Elixir's
formatter uses a variant of it [2], and it appears in the standard library
[3]. I've personally made use of it to write a code formatter for GraphQL
queries, and it was a very small amount of code for some powerful results.

Definitely take a look if you ever need to do something like this.

[0]
[https://homepages.inf.ed.ac.uk/wadler/papers/prettier/pretti...](https://homepages.inf.ed.ac.uk/wadler/papers/prettier/prettier.pdf)

[1]
[https://jyp.github.io/pdf/Prettiest.pdf](https://jyp.github.io/pdf/Prettiest.pdf)

[2]
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2...](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2200)

[3]
[https://hexdocs.pm/elixir/Inspect.Algebra.html](https://hexdocs.pm/elixir/Inspect.Algebra.html)

~~~
westoncb
I'm gonna take a look at Wadler's paper—but does anyone want to take a shot at
a high-level description of how an algebra would be applicable to the problem
of pretty printing?

Any more general comments on applying algebras to solving programming problems
would also be appreciated.

My rough understanding of the benefits/rationale atm:

1) the algebra is a nice/minimal factoring of the problem

2) being able to formally classify it means we know a bunch of its properties,
which tells us certain approaches are definitely (im)possible (and these
properties are drawn from a fairly small set: e.g. associativity,
distributivity, existence of inverses)

3) If the algebra is explicitly represented in code (maybe in a library), we
can use that general structure to verify (and generate?) certain aspects of
our particular problem solution.

Edit: The paper mentions "Over the years, Richard Bird and others have
developed the algebra of programming to a fine art" —and I found Bird's book
"Algebra of Programming". Seems like what I'm looking for maybe, but would be
grateful to hear of any alternative resources as well.

~~~
QuinnWilton
You'd likely be more interested in the original paper [0] by John Hughes that
Wadler built off of. It gets into a ton more detail about the algebraic
structure of the documents they're defining:

> The question addressed in this paper is: how should libraries of combinators
> be designed? Our goal is to show how we can use formal specification of the
> combinators, and a study of their algebraic properties, to guide both the
> design and implementation of a combinator library. Our case study is a
> library for pretty-printing. But the methods we present are of wider
> applicability, and we also show how they can be used to develop a number of
> different monads.

Wadler's paper touches upon some of the same ideas, but the primary goal of
his paper is to build a prettier printer, whereas Hughes aimed to demonstrate
the power of algebraic properties.

[0]
[http://www.cse.chalmers.se/~rjmh/Papers/pretty.html](http://www.cse.chalmers.se/~rjmh/Papers/pretty.html)

~~~
harpocrates
For anyone curious about the difference between the two: Hughes' pretty
printer works really well for pretty-printing Haskell code, but Wadler's is
more flexible for pretty-printing C-style languages where there is a dedented
closing brace at the end of scopes.

It is 100% true that Hughes' paper is a better introduction to the idea of
having algebraic document combinators. One can skim Hughes' and then feel at
ease using a library that implements Wadler's printer - the bigger differences
between the two libraries are tucked away in the layout algorithms and the
APIs are similar.

~~~
sjakobi
> Hughes' pretty printer works really well for pretty-printing Haskell code,
> but Wadler's is more flexible for pretty-printing C-style languages where
> there is a dedented closing brace at the end of scopes.

That's very interesting! So far I've only used Wadler-Leijen-style
prettyprinters.

Are Hughes-style pretty printers "better" in some way for Haskell-like
documents? If so, why?

------
svat
> Note that “best” is a property of the _entire statement_ being formatted. A
> line break changes the indentation of the remainder of the statement, which
> in turn affects which other line breaks are needed. Sorry, Knuth. No dynamic
> programming this time.

For what it's worth, the same is true of TeX's line-breaking algorithm too:
“best” is a property of the entire paragraph being formatted; a line break can
change the indentation of the remainder of the paragraph (in many ways, but
most obviously: discretionary hyphens with nontrivial pre- and post- texts),
which in turn affects which other line breaks are needed. And yet TeX uses
dynamic programming.

So it would be interesting to see how this specific problem differs. (Why
dynamic programming works for TeX but not for dartfmt.) Perhaps Knuth chose a
definition of "best" that is amenable to dynamic programming (after all he had
quite a bit of freedom; it doesn't take much to beat the naive first-fit/best-
fit line-breaking algorithms as used in Word etc), or perhaps TeX's algorithm
too can be expensive but in practice paragraphs just tend to have a different
distribution of feasible breakpoints than code statements.

Knuth and Plass's (very readable!) paper on _Breaking Paragraphs into Lines_
is here: [http://www.eprg.org/G53DOC/pdfs/knuth-plass-
breaking.pdf](http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf)

Edit: Ah, I think one relevant difference is probably that when formatting
code, one has to "match up" indentation with the previous corresponding
sibling of the same nesting level, while with paragraphs the amount of
required state is smaller: given a particular line break and indentation, the
remainder of a paragraph can always be formatted in the same way, while a code
formatter also has to "remember" all previous parent expressions' indentation.
(This reason is different from what is directly implied by the quoted
statement though…)

~~~
gowld
The answer is in between the extremes. There are multiple choices of
linebreaks that lead to the same indent level at char k, so dynamic
programmins is still potentially useful.

~~~
munificent
Exactly right. The formatter does memoization so that if multiple different
solutions end up putting the same subtree at the same level of indentation,
that subtree itself is only calculated once.

~~~
svat
Thanks, calculating subtrees only once makes sense. But can't you go further
(as in TeX and in the comment you're replying to) — that is, if there are two
"paths" of breaking that eventually end with the same break at the same
indentations (plural because of the matching-siblings issue), then you should
have to recalculate the rest only once?

Another question: I must confess not understanding everything clearly, but it
sounds like you pop off the lowest-cost breaks until finding a solution that
fits the column limit. Couldn't there be cases where avoiding the "best" split
in the first step could ultimately result in a lower overall cost? Optimizing
over such solutions is the main thing that makes TeX's paragraph-breaking
better than the usual greedy algorithms…

~~~
munificent
_> that is, if there are two "paths" of breaking that eventually end with the
same break at the same indentations (plural because of the matching-siblings
issue), then you should have to recalculate the rest only once?_

In theory, yes. But there are other constraints across a statement that the
formatter tries to preserve, so the memoization needs to take those into
account too. Basically, the key used in the memo table is more complex than
just "where in the code" and "indentation level". It has to roll in some extra
contextual information about other line breaking choices that were made if
those choices force this subtree to be formatted in certain ways.

 _> Couldn't there be cases where avoiding the "best" split in the first step
could ultimately result in a lower overall cost?_

Yes, there exactly are, which is why it does indeed find the overall lowest
cost solution.

------
vjeux
I wrote a super short description of the overall architecture of prettier if
people are interested: [https://blog.vjeux.com/2017/javascript/anatomy-of-a-
javascri...](https://blog.vjeux.com/2017/javascript/anatomy-of-a-javascript-
pretty-printer.html)

~~~
nojvek
Prettier is amazing. Thanks for building this tool. I’m not sure where you
love but if you’re in Seattle or in the valley I would love to buy you a
beverage of your choice sometime. I love the tool.

------
greggman3
AFAIK no formatting can understand intent. Example. I want my canvas.drawImage
calls that take 9 arguments to be formatted with arguments 2-5 on one line,
and 6-9 on another line because 2-5 are sourceX, sourceY, sourceWidth,
sourceHeight, and 6-9 are destX, destY, destWidth, destHeight.

That's just one example of 100s where an auto formatter can not figure out
intent.

Another example, I often align numbers and width and height

    
    
        // convert from screen coords to clip space
        clipX = mouseX / screenWidth  *  2 - 1
        clipY = mouseY / screenHeight * -2 + 1  
    

Another example would be how to format an if block. Depending on your style
guide

    
    
        if (cond) single-statement
    

might be ok. Or maybe you want them on different lines. Or maybe you require
brackets. Or maybe you don't. But there will almost ways be some place in the
code where an exception to your style guide makes the code more readable. Auto
formatters don't understand those exceptions.

This is why AFAIK many major projects do NOT use an auto-formatter. Firefox,
Safari, Chrome for example. Because there are places where an auto-formatter
will make the code far less readable. They'll give the advice that you are
free to run an auto-formatter on your own changes but adding an automated one
is a no-go. Even the Flutter team does not use an auto-formatter

[https://github.com/flutter/flutter/wiki/Style-guide-for-
Flut...](https://github.com/flutter/flutter/wiki/Style-guide-for-Flutter-
repo#formatting)

~~~
biddlesby
It's a trade off. We've started using a code formatter recently: the
overwhelming majority of the code in our team is more readable because it's so
standardised. There are a few sub-optimal lines here and there but I wouldn't
trade that for going back to the wild old days.

~~~
greggman3
The projects I mentioned above all have style guides. There's no "wild old
days". They just don't use a auto formatter because they know there are cases
where the auto formatter makes the code less readable and that there are
exceptions to every style guide. It's called a style "guide" not a style "law"
because it offers guidance which can be ignored when appropriate.

------
mholt
I just rewrote our Caddy v2 Caddyfile formatter this week [1]. It's not as
"smart" as a true code formatter, but what I thought would be just a 1 hour
task ended up taking all day. And I'm lucky it wasn't longer!

Lessons learned:

\- State is hard. There's something to be said for functional languages, or at
least functional patterns.

\- The Caddyfile is a very theoretically-impure syntax.

\- User input is always messier than you think it will be. (Spend some time
helping people on the forums, you'll see what I mean. I'm a neat freak though,
so I notice it...)

[1]:
[https://github.com/caddyserver/caddy/commit/7ee3ab7baa216599...](https://github.com/caddyserver/caddy/commit/7ee3ab7baa2165990d3fd358878d818154f7ee86)

~~~
pmarreck
> There's something to be said for functional languages, or at least
> functional patterns.

Here’s my obligatory (and excellent) blog post by John Carmack where he adopts
the functional style for a month:
[https://www.gamasutra.com/view/news/169296/Indepth_Functiona...](https://www.gamasutra.com/view/news/169296/Indepth_Functional_programming_in_C.php)

------
einpoklum
The author of that post has it _so easy_! Formatting well-formed programs, in
a young language with a short spec. He just starts with an AST!

No no my friend. The challenge is when the program is _not_ well-formed, i.e.
has errors, or worse yet - refers to to code elsewhere, so you can't even
decide if it's well-formed or not. So no AST for you. You have to go into the
dark and dangerous word of speculative and partial parsing of language
grammars, representing ambiguity, replacing undefined, ambiguous or
erroneously-defined semantics with deductions from the author's existing
formatting and so on. That's where the real challenges lie.

And - it's not gonna be 3,853 lines, I can tell you this much.

~~~
eterm
I actually really like a "format on save; don't format without AST"
combination.

It's really nice knowing as soon as you hit ctrl+s whether you've just written
in a syntax error or not.

~~~
dmit
To see this concept taken to its logical conclusion, check out Casey Muratori
write C code with the editor configured to reformat on every keystroke:
[https://www.youtube.com/watch?v=S3JutszP9fg&t=11m37s](https://www.youtube.com/watch?v=S3JutszP9fg&t=11m37s)

I'm not sure I'd ever go this far, but that might just be years of
conditioning by editors that format code much more slowly than the fork of
4coder that he's using in the video.

~~~
lioeters
The idea of autoformatted code editing is intriguing.

I've been exploring that (plus instant preview/eval, but that's another story)
in a home-made language with a basic editor. For smaller files/modules it's
totally feasible to do on every key press, fast enough to be fairly seamless.

It took a bit of getting used to, but there's something catchy about the
experience, like molding clay - if the clay was made of a "smart material"
that re-formed itself to an optimal shape.

The immediacy of working with self-reshaping code (and live reload too)
reduces mental friction, so I can forget about formatting or compilation
altogether. I hope more languages make it a part of the developer experience,
IDE, etc.

~~~
dmit
I feel like it's a subset of a much larger, more general topic of how latency
impacts computer interactions. People have long tried to emulate real-world
activities in computer programs, but not much attention has been paid to
making those programs instantaneous.

When you mold clay, there is no lag - the tactile and visual feedback is
immediate. Same with handwriting, painting, or petting a dog. On the other
hand, even in this current world of multi-core 4Ghz+ computers most software
fails to provide immediate feedback.

100ms, I believe, is the magic number. The maximum delay between cause and
effect that still registers as "instantaneous" by the human brain. Yet,
surprisingly, it's still a rarely reached target. FPS video game studios might
be the only major industry that consistently cares about and delivers on this
metric. (Edit: "major" was an intentionally vague word to use here. Obviously,
all kinds of embedded software projects have to deal with real-time or soft-
real-time requirements.)

If reformatting the whole current code file took 1ms, why wouldn't you enable
it? If compilation took 50ms, why not recompile on each new line? There's a
magic latency barrier below which actions feel "free", so why not try and move
as many of them as possible below that threshold?

~~~
lioeters
> When you mold clay, there is no lag

Wonderful phrasing.

On the importance of immediate feedback in the creative process, it reminds me
of audio/music production - in particular, MIDI instruments. The latency
between keypress and sound reaching the ear should be as close to zero as
possible.

Recently I read a discussion about remote collaborative music performance.
From what I understood, network latency is not nearly low enough to achieve
it. There's also the physical limit of the speed of light.

10ms was mentioned as acceptable for musicians - but then someone said, even
when musicians are _in the same room_ , there can already be too much latency
if it's a big room and they're distant from each other, making it difficult to
play together.

> why not recompile on each new line?

This is becoming the standard in web development, with incremental compilation
(and "hot reload" on the client) of only the changed code/module. I saw that
it's getting applied in building mobile apps as well, to get closer to the
ideal of instant feedback.

In thinking of the ideal developer experience, I often come back to Bret
Victor's work, Learnable Programming.
[http://worrydream.com/#!/LearnableProgramming](http://worrydream.com/#!/LearnableProgramming)

~~~
detaro
> _10ms was mentioned as acceptable for musicians - but then someone said,
> even when musicians are in the same room, there can already be too much
> latency if it 's a big room and they're distant from each other, making it
> difficult to play together._

Sound only travels ~3.4 meters in 10 ms, so it matches up that a large venue
easily exceeds that quite a bit.

------
Groxx
Previously:

2018:
[https://news.ycombinator.com/item?id=17271963](https://news.ycombinator.com/item?id=17271963)

2017:
[https://news.ycombinator.com/item?id=15063193](https://news.ycombinator.com/item?id=15063193)

2015:
[https://news.ycombinator.com/item?id=10195091](https://news.ycombinator.com/item?id=10195091)
(shortly after it was written)

------
wstrange
The dart formatter is awesome - kudos to Bob.

All languages should come with a highly opinionated formatter. It puts an end
to useless style arguments.

~~~
nicoburns
Every time I use one of these it ends up doing ridiculous things to the
formatting, like splitting what should really be 1 line over 4 or 5. Which
completely destroys whole-file readability/scanability. Does nobody else find
it really hard to read codebases subject to these formatters?

~~~
lhorie
Don't just blindly run a formatter.

I've developed a habit of refactoring long lines early (e.g. breaking large
expressions and assigning to intermediary variables) to prevent the formatter
from adding hard-to-read mid-expression line breaks

~~~
thatswrong0
Interesting.. not having to worry about where to break long lines and letting
the formatter (in my case Prettier) figure it all out for me has been one of
the most wonderful things in recent memory that has happened to my programming
experience. I can just write awful looking code at the speed of thought and
know it will turn out fine when I hit save.

Having to switch back to a non-opinionated non-line-breaking formatter like
golang's makes me very sad.

~~~
lhorie
Prettier outputs fairly decent formatting for code that is already reasonable
to begin with. It's not so great when you have things like template strings
with large expressions in the interpolation slots.

------
nojvek
I use prettier religiously. Works on js, typescript, jsx, css, json, html,
sass, yaml and a whole bunch of others.

The vscode prettier plugin brings format on save and it makes coding a joy.

I think what prettier does is much simpler than dart. It tries to put things
on a single line if it’s within line length, if it can’t then it uses multi
line where each item is in its own line with ending bracket at same
indentation level. For arrays, functions, object definitions, jsx etc. I’ve
written some automated code generators that output pretty code and that’s how
I tend to approach it too. You either get compact or expanded version and that
happens recursively. There is some backtracking but there’s early exits when
compact strategy exceeds line length so it’s pretty fast.

After reading the blog article it put me off that starting and closing
brackets were not at same indent level. I know lispers do this but I’m not
used to it so it hurts my brain.

One of my interview questions is to pretty print json files to a certain line
length.

------
rattray
Shameless plug that if you're bored and want a really tough, fun challenge,
the prettier project has many open tickets:
[https://github.com/prettier/prettier/issues?q=is%3Aopen+is%3...](https://github.com/prettier/prettier/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22)

Only two are tagged "difficulty:easy", likely for good reason – when I
contributed a few fixes a while back, I found it much more challenging than
implementing a language syntax, an unrelated project which I was also doing at
the same time.

It's really fun, though, and watching colleagues code and use the features of
prettier that you built is a nice feeling – "I helped write that code for
you!"

------
bmc7505
If you enjoyed reading this, you might want to check out CodeBuff:

[https://github.com/antlr/codebuff](https://github.com/antlr/codebuff)

CodeBuff is based on a paper by Terrence Parr and Juergen Vinju, “Towards a
Universal Code Formatter through Machine Learning”:

[https://arxiv.org/abs/1606.08866](https://arxiv.org/abs/1606.08866)

------
eyegor
I love that the author's final "gotcha" isn't even domain specific. It's a
hard limit on recursion, just like all modern debuggers and most other
formatting tools. It's crazy to me that they even tried to support formatting
auto-generated code snippets, I know I personally would've created a much more
naive/conservative bail out.

> If the line splitter tries, like, 5,000 solutions and still hasn’t found a
> winner yet, it just picks the best it found so far and bails.

------
nojvek
I wrote a simple formatter for outputting json values. It respects maximum
line length.

[https://gist.github.com/nojvek/875d8cad1005da0b95da2840bbc89...](https://gist.github.com/nojvek/875d8cad1005da0b95da2840bbc894e6)

It implements some of the ideas from
[https://blog.vjeux.com/2017/javascript/anatomy-of-a-
javascri...](https://blog.vjeux.com/2017/javascript/anatomy-of-a-javascript-
pretty-printer.html)

Could be re-used for other languages very easily. The core is you have nested
group chunks with (indent breakpoint, dedent breakpoint, separator breakpoint,
more text chunks and child group chunks).

when formatting a single line, indent and dedunt breakpoints do nothing, but
in multi line format, they specify how text will be laid out.

The algorithm is simple, try to format in single line if it fits within
maxLineLength, if not break into multiple lines, and do that recursively.

------
blumomo
I can't help myself, the required semicolon at the end of each line in Dart
programs keep hurting after writing for many years in Python, JS, Kotlin and
Swift. I grew up semicolon-less with QBasic and learned semicolon since Delphi
1.0 and Java 1.0 but today, wanting as few noise on the screen as possible,
they are hurting in my eyes.

~~~
munificent
I'm the author of the post and work on the Dart language. I spent a bunch of
time last year trying to figure out a good way to make semicolons optional in
Dart without breaking a lot of stuff.

Because of a handful of syntax quirks particular to Dart, it's really
difficult. Most other languages that have optional semicolons were designed
from the start to make them optional. Dart was not. On top of that it:

* Uses C-style variable declarations which means there's no keyword to indicate the start of a variable declaration.

* Likewise uses C syntax for function and even local function declarations. Again no keyword marking the beginning of a function.

* Has complex syntax for things like function types which means a type annotation can be arbitarily long and may split over multiple lines.

* Has lots of "contextual keywords" that are not true reserved words but behave like them only in certain contexts.

* Has a rich syntax that overloads a lot of tokens to mean different things in different contexts. "{}" can be an empty map, empty set, empty block, empty function body, or empty lambda body.

All of that makes optional semicolons just really difficult. I haven't given
up entirely. We have a thing now called "language versioning" that lets
different Dart libraries target different specific versions of the language.
That lets us evolve the language in nominally "breaking" ways without actually
breaking existing code. (For example, this is how we will migrate the world to
non-nullable types.)

That may give us a way to make other grammar simplifications that would then
also let us make semicolons optional. But syntax changes like this are always
difficult. People have _very strong_ opinions about you changing their
existing syntax under them, and the value proposition in return isn't super
compelling.

There's an old saying, the best time to plant a tree is twenty years ago. The
second best time is today.

Language syntax isn't always like that because the migration cost once you
have a lot of extant code (and users) can just be too painful. The best time
to make semicolons optional was twenty years ago. The next best time may be
today, but it may just be never.

~~~
chubot
Could Dart use this solution, which Python uses and Oil shell now uses?

1\. Newlines are significant tokens and they behave like semicolons (this is
how shell is)

2\. Within () [] {}, newlines are suppressed. The lexer has to recognize the
matching pairs. (Oil added an expression language to shell which does this)

This means that

    
    
        var x = 1 +
                2
    

is not valid because there's a statement terminator after +, but

    
    
        var x = f(1 +
                  2)
    

is valid. I feel like this captures the fast majority of cases where you want
to break a statement over lines.

I don't know Dart, but since it seems to have JavaScript-like syntax, I don't
see why that wouldn't work? JavaScript has some weird rules, the common gotcha
seems to be is:

    
    
        return 
        {
        }  // oops this does not return a dictionary like I thought
    
    

With those rules you would write:

    
    
        return {
        }
    

and hopefully the former is a syntax error, not a mis-parsed program.

\----

edit: OK my guess is that Dart probably doesn't want to enforce a brace style
like Go (and Oil) do?

That is in Go only

    
    
        if (x) {
        }
    

is valid, rather than

    
    
        if (x)
        {
        }
    

which is pretty much like the 'return' example. I have been using that brace
style for so long that I forgot other people don't :)

~~~
munificent
_> Within () [] {}, newlines are suppressed._

Unfortunately, no. That doesn't work. Consider:

    
    
        var map = {
          key: long
            - value
        }
    
        // block
        {
          long
          - value
        }
    

The newlines should be ignored inside the map literal, but not inside the
block. A correct semicolon insertion for this should be:

    
    
        var map = {
          key: long // <-- no semicolon here
            - value
        };
    
        // block
        {
          long; // <-- but there is one here
          - value;
        }
    

But the lexer doesn't know when curlies are maps and when they are blocks. The
parser does, which means you could theoretically define the semicolon
insertion rules in the grammar, which is what we'd likely have to do, but that
makes it much more complex.

~~~
chubot
Ah OK, that makes sense. So Python doesn't have that issue because it doesn't
overload braces for hashes and blocks.

Oil does overload them, but it has a separate lexing mode for statements and
expressions. It switches when it sees say the 'var' keyword, so the right of
var x = {a: 3} is lexed as an expression rather than a statement. This sort of
fell out of compatibility constraints with shell, but ended up being pretty
convenient and powerful.

The lexing mode relies pretty strongly on having the "first word", e.g. 'var'
or 'func', Pascal-style. So yeah I can see how it would be more complex with
Java-style syntax.

~~~
munificent
_> So Python doesn't have that issue because it doesn't overload braces for
hashes and blocks._

Yes, and also because lambdas in Python can only have expression bodies, not
statements. That means you can never have a statement nested inside an
expression. This is important because Python's rule to ignore all newlines
between parentheses would fall apart if you could stuff a statement-bodied
function inside a parenthesized expression.

~~~
chubot
Yes the lambda issue is something I ran into for Oil. Although this part of
the language is deferred, the parser is implemented:

    
    
        # Ruby/Rust style lambda, with the Python constraint that the body can only be an expression
        var inc = |x| x + 1
    
        # NOT possible in Oil because | isn't a distinct enough token to change the lexer mode
        var inc = |x| {
          return x + 1
        }
    
        # This is possible but I didn't implement it.
        # "func" can change the lexer mode so { is known to start a block.
        # In other words, { and } each have two distinct token IDs.
    
        var inc = func(x) {
          return x + 1 
        }
    

I think this is a decent tradeoff, but it's indeed tricky and something I
probably spent too much time thinking about ... the perils of "familiar"
syntax :-/

------
333c
Footnote 4 [0] discusses the struggles of formatting comments. This was
interesting to me because back in December I wrote most of a code formatter
for Lox, the toy language from Crafting Interpreters [1], but I never finished
it because I got caught up on how exactly to handle and represent comments,
and then I got distracted by other projects. Does anyone else have thoughts on
how to handle comments in a code formatter?

Edit: Well, apparently this blog post (from 2015) is by the author of Crafting
Interpreters. That makes this more interesting to me.

[0]: [http://journal.stuffwithstuff.com/2015/09/08/the-hardest-
pro...](http://journal.stuffwithstuff.com/2015/09/08/the-hardest-program-ive-
ever-written/#4-note)

[1]: [https://craftinginterpreters.com](https://craftinginterpreters.com)

------
pansa2
I'm surprised it was necessary for the author to design and write Dart's
formatter essentially from scratch. Is there not prior art - formatters for
other languages that enforce a maximum line length?

Also, I'm surprised that enforcing a maximum line length was considered so
important that it was worth all this work. Even though I personally like to
limit line lengths (and I know plenty of people who don't), I'm not sure
enforcing such limits are worth this amount of complexity in the formatter,
nor the sheer amount of labor required to implement and maintain that
complexity. I think the sweet spot is probably Go's `gofmt` - automatic
formatting that allows unlimited line lengths (possibly emitting a warning if
formatted code exceeds a desired length) seems to provide almost all of the
benefits of a formatter while being a much simpler solution.

------
jleahy
To some degree he’s just rediscovered branch and bound, the classic way to
solve NP hard problems like this. It’s a neat little trick.

The other way of doing this would just be to formulate it as some kind of ILP
and plug it into somebody else’s solver, quicker but there’s no way it’d work
as well.

------
sfvisser
I never grasped why this is so hard, you don’t need an exponential blowup with
some common sense heuristics. Just look at the AST and put the breaks as high
up in the tree as possible.

For example: If I have a function application with several arguments one of
which is a slightly longer lambda expression, don’t even try to break the
lambda before putting the arguments (nicely aligned) on their own line.

I’m a big fan of the ergonomics of prettier, but some of its output is
objectively ugly and seems the result of over-complicating their search space.

------
dfox
10 years ago I though that writing S-expression pretty printer should be
significantly easier than what the complexity of typical CL/Scheme pretty
printer (ie. Waters' XP as described in AIM-1102) implies. Well that is not
true ;)

And since trying to do that I almost involuntarily format any code I write by
rigorously applying the pretty printing algorithm of XP by combination of
manual line breaks and Emacs' newline-and-indent.

------
jansan
How do you indent block comments? Do they stay untouched? Do you indent them
the same as the previous line? If so, what to do if there are already indents
in the comment? Add the additional indent?

I wrote a simple but IMHO really nice XML prettifier a while ago and remember
that I finally gave up on block comments, because I could not find a way to
handle all cases in a satisfying manner. So I decided to leave the indenting
untouched.

~~~
munificent
_> How do you indent block comments? Do they stay untouched?_

Block comments are a little tricky and there are a handful of heuristics to
handle them. Generally they stick to the preceding token. Block comments are
rarely used in Dart and when they are, it's often just a small marker to
describe the previous element like:

    
    
        foo(null /* missing */, blah);
    

So adhering it to the previous token usually does the right thing.

 _> If so, what to do if there are already indents in the comment?_

It doesn't touch the body of the comment itself. There's too many ways for
that to go wrong given all of the various things a human might choose to stuff
in there: ASCII art, tables, prose, etc.

------
coldcode
The first program I ever wrote as a professional in the early 80's was a
Jovial (before Ada what the Air Force used) prettyprinter written in Fortran.
Hard to imagine now how little I knew yet still managed to get it to work
(required for Air Force documentation) in about 6 months. I wish I still had
the source code as that combo is nuts.

------
nickysielicki
Somewhere on my todo list is to make a systemverilog autoformatter using the
parser from slang [1].

[1]:
[https://github.com/MikePopoloski/slang](https://github.com/MikePopoloski/slang)

------
nemoniac
XP: A Common Lisp Pretty Printing System

[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.4...](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.4464)

------
crimsonalucard
Why make a language and make a linter? Make the linter part of the language.

In my opinion, there's no point in encoding certain concepts into a linter as
best practice, if it's best practice encode it as law into the syntax of the
language.

------
qorrect
I really don't like this guys writing style.

