To break or not to break - code formatters
The job of a code formatter is simple: reprint the source code with pleasing white space1. From the description it’s very simple and yet as with anything in the software world there are two camps. The first camp says that having an automated formatter reduces the burden of the programmer, allowing one to focus on more important things. Additionally having a tool that enforces a coding style means consistency. Consistency - the group argues - allows complete strangers to open a repository and feel at home because they already know the formatting. The other group snickers and accuses the first group of being “smooth brained” because why would you want a machine to make decisions for you. From their point of view, having to give way to a machine in something as simple as white space is giving away your freedom and slowly turning oneself into a smooth brain as a result.
Where do I stand? I don’t care. All I want is to explore the problem of source code formatting. If you’re also interested feel invited to read on.
The problem statement
Given an input string that contains source code written in a given language, produce an output string that does not change anything about the semantic meaning of the code, only its appearance.
I understood this problem but failed to find anything that would help me understand the solution in 15 minutes of reading time. As if the knowledge of how such an algorithm works is so obvious it does not require writing down. I’m of course exaggerating, there are primary sources which are mostly research papers but nothing basing on those. It seems that nobody presented these ideas in a more approachable manner.
The job of the algorithm
Formatters vary in their fanciness.
Some formatters will only commit to changing the indentation level of the code.
Others will go out of their way to give you the works: variable indentation level, variable line length limit, string quote conversions, sorting of imports, simplifications of expressions.
And probably more but these are the things that come to mind off the bat.
For simple formatters look to go fmt
or a little more complex zig fmt
.
The more advanced ones are something like prettier
.
I classify formatters into two buckets: static and dynamic. Static means that there is a predefined spot where a newline can go regardless of the context. Dynamic on the other hand will have places that can either become a space if there is enough room or a newline if we’re not going to fit on a single line. To drive this point home, let’s show this divide on an example:
zig fmt
The simpler formatter zig fmt
will not break a function definition that is too long, it does not even have a variable you can tweak that will say “how long is too long”.
pub fn sum(veryLongName1: i32, veryLongName2: i32, veryLongName3: i32) i32 {}
Given that code it will never try to break the argument list in a way where after each comma there is a newline depending on the length. If the programmer decides that the line is too long they have an option to put a trailing comma after the last argument and it will indicate to the formatter that it should put a newline after the comma instead of a space. This
pub fn sum(veryLongName1: i32, veryLongName2: i32, veryLongName3: i32,) i32 {}
Will turn into
pub fn sum(
veryLongName1: i32,
veryLongName2: i32,
veryLongName3: i32,
) i32 {}
prettier
The more advanced formatter has a variable you can tweak that will try to keep lines shorter than a given limit.
As an example of this kind of formatter I’ll use prettier
.
Given a very similar code
function sum(veryLongName1, veryLongName2, veryLongName3) {}
prettier
will not change it if we run it by default, because the line limit is set to 80 columns.
If we run it with --print-width 50
it will itself know that the line is too long and will break it at predefined points resulting in
function sum(
veryLongName1,
veryLongName2,
veryLongName3,
) {}
Why do certain formatters decide not to implement automatic line breaking? Because it’s harder, takes more time to develop and has edge cases where the output is not always the one you would choose yourself. But let’s go back to the basics, how do these formatters even work? What are the building blocks you’d need to use if you wanted to build one yourself?
The data flow
Universally almost all compilers have settled on the same approach to understanding source code.
The stages being: tokenisation (lexing) -> parsing -> code gen
.
I’d say that > 98% of the development time is spent in the last part because the first two steps are a solved problem.
The first step takes the source code and creates discreet units from it. Tokenizing the example function we would see:
[KeywordFunction, Identifier, OpenParen, Identifier, Comma, Identifier, Comma...]
Each token would have some data associated with it to contain more information.
The Identifier
would have a reference to the actual string being sum
or veryLongName1
.
Using these tokens we can now construct an Abstract Syntax Tree (AST).
The AST contains all2 information about the semantic meaning of the parsed source code.
In the example function the information we store is that it’s a function definition.
It has three parameters, what is the order of these parameters, what is the name of the function and what’s inside of the body the function.
The simplest approach
The simplest thing you could do to ensure that source code never extends past some column is to read the stream in token by token and when the token you’re about to print breaks the line limit insert a newline. While the implementation and the runtime aspects of this approach seem desirable, the actual output is not. Imagine running this approach on the example code from above:
function sum(veryLongName1, veryLongName2,
veryLongName3) {}
While it’s true that we have defeated long lines we also see that the solution we want is not breaking the token but breaking something more high level. And just as with any problem solving, adding a new layer of indirection results in a more powerful solution.
A step up in abstraction
I’d say that the first paper related to the final technique we’re going to be discussing is
Derek C. Oppen, “Pretty Printing” (1979).
It describes using the next level of abstraction over tokens which as we said is the AST.
Instead of directly printing to the output we’re first going to build an object that describes where boundaries start and where do they end.
The flow now is tokens -> ast -> blocks -> print
.
There is already a lot we could achieve with this algorithm and the output would a lot better than just scanning tokens.
The standard now is implementing the algorithm from Philip Wadler, “A prettier printer” (1997).
I had a really hard time reading this paper because it’s for Haskell programmers.
To get to the same conclusion I recommend Christian Lindig “Strictly Pretty” (2000).
I could be mistaken but from my point of view the Wadler algorithm is an extension of the one described by Oppen.
Either way, it’s the algorithm we’re going to be focusing on and it follows the same data flow as Oppen: tokens -> ast -> document -> print
.
Probably the most famous formatter prettier
has the ability to format many languages.
One might be under the impression that it’s helped by the fact that it implements the Wadler algorithm.
But as you can see from the data flow above, the first three steps are unique to each language.
Each language has it’s own definition of tokens and how those tokens are combined into an AST.
Conversely, since each language has it’s own unique AST, each language has to have it’s own tree walker that will build the document.
The last step is generic, since each document has to use the provided document abstraction the printing part is shared between all languages.
The part that is actually shared is the easiest part of the entire formatter.
All the complexity is funneled into building the document.
The Document
High level view
The document is an object describing the relationship between tokens.
It describes things like “Try to print all these tokens as one line” or “If you need to put a newline, you can put one here”.
At its core is the basic building block, the text
object.
It works as you can expect
doc = text("hello")
print(doc)
> hello
The second simplest object is line, it will be printed as a space if we can fit on a single line or as a newline if not. It’s like “if you must, break here”.
# abbriviating text as t
doc = t("hello") + line + t("world")
print(doc)
> hello world
# not enough space
> hello
> world
And the last basic building block is a group. It’s the container that will say whether a line will be a space or a newline. If a document in the group can’t be printed on a single line all the lines inside it become newlines.
doc = group(t("hello") + line + t("world")) + line + t("!")
print(doc)
> hello world !
# not enough space
> hello
> world !
There is one last building block: nest
.
Nests are sub-documents in groups that will add additional nesting if the group that they are contained in does not fit.
doc = group(t("hello") + line + nest(t("world")))
print(doc)
> hello world
# not enough space
> hello
> world
Something you can actually implement
In essence the document is a small set of instructions that are executed by the printer.
There is no need to model this a tree or some generic object containing “something”.
Since it’s a set of instructions fed to a machine executing them it can be streamed - therefore it can be an array.
In my personal project I’ve created a simple word
object that when combined into an array (word[]
) becomes a document.
The word
itself is just a discriminated union, where the type is the command to execute.
There are 8 commands:
text
: print the text attach to thisword
line
: either space or newlinespace
: always spacehardbreak
: always newlinegroupPush
/groupPop
: a pair to keep track of the groupingnestPush
/nestPop
: same pair to keep track of the nesting
The algorithm is very simple.
Iterate over all the words in the document.
When you find a text
, space
, hardbreak
print it to the output.
A line
needs to know if the current group fits or not to print the correct white space.
nestPush
and nestPop
are very similar to line
, check if the group fits, if yes do nothing, if no increase or decrease the nesting.
Finally if you see a groupPush
measure how many characters are needed to print the content of this group, if more than we allow for set a flag indicating the group does not fit.
In pseudocode
function renderDocument(writer, words, maxLineSize) {
nest = 0;
previousGroupFittings = []
groupFits = false;
for(word, i in words) {
if (word.type == GroupPush) {
previousGroupFittings.push(groupFits)
remainingWidth = maxLineSize - MAX(writer.lineSize, nest)
groupFits = fits(words[i:], remainingWidth)
} else if(word.type == GroupPop) {
groupFits = previousGroupFittings.pop()
} else if(word.type == NestPush && !groupFits) {
nest += 1
} else if(word.type == NestPop && !groupFits) {
nest -= 1
}
if(
word.type == HardBreak ||
(!groupFits && word.type == Line)
) {
writer.finishLine(w);
} else {
writer.writeString(word.text, nest);
}
}
}
The fits
function can be as easy as summing the width of all words inside that group and any nested sub-groups.
hardbreak
words are treated as infinite width, therefore they always cause the group to be broken.
But it’s not the correct implementation handling all possible edge cases.
A group only controls the line breaking inside itself but there is a tail of words that have to be on the same line as the group.
An example would be [group(width = 80), text(';')]
The semicolon has to be attached to the group, so in a way it’s part of that group also.
Otherwise it’s possible to have lines that are longer than permitted line length because groups do not consider their tail as contributing to their width.
Group’s tail are all trailing words up to the first line
or hardbreak
.
The full implementation will sum the width of all words in the group and the tail.
Having the implementation use a simple array makes it trivial to handle such a case.
And that’s it, even a heavily optimized document renderer written in C is around 200 lines of code or 300 lines with helpers for document building. The actual complexity is in building the document, it depends on the language but in flos - a Solidity formatter - there is ten times more document building code (around 2500 lines).
Bigger features
Things like import sorting goes a step further into formatting. It’s no longer a simple pact of “only white space changes”. There is a real possibility that you’ll format your code, that formatting changes the import order and the program breaks. Programs like Biome splits changes like these into two distinct categories: formatting and linting. They solve the problem by moving things having the potential to change the semantic meaning into the lint step. This leaves the formatter as a pure transformation.
Subjective or objective
There is a real problem in formatting, it’s really subjective. The step of building the document has many assumptions baked inside. Where do you put potential line breaks and where there always has to be a space. Do you put a space of leave no white space between two tokens. You can expose all of these decisions as switches in the configuration. But I’d go as far as to say that it’s better to be subjective and build something you’d like.