Computers need to talk to one another, which means programmers need to agree on things. This is the worst, for various reasons, but mostly because it leaves us sitting around implementing parsers and printers for formats we barely care about. A good portion of my career thus far has been spent on parsers and printers!
Parsers and printers are a classic missed isomorphism: they're supposed to be dual, but in general we cannot derive both from the same specification.1 If you want some assurance that your implementation is correct, expect to be uncomfortably reliant on testing.
Unit tests are pretty much unavoidable here, and are especially handy when there's a simple specification or pre-existing test suite to compare against. However, the bugs encountered while writing a parser are generally not easy to pre-empt! Unit tests cannot bear the entire load. You'd end up shipping bugs or wasting time.
We know that parsers and printers are supposed to be dual. We can simply treat this as a law, and write a property test to enforce it.
Using Hedgehog primitives, it might look a little like this:
... although a wise Hedgehog programmer would probably use the supplied tripping function.
While the round-trip test is pretty well-known among functional programmers and property-based testing fans2, I really want to hammer home three things:
This is both the simplest useful property-test I can think of for a working engineer and the most likely to reliably identify bugs of consequence. It is property testing's trojan horse. Talk to your team lead about property testing today!
Let's mock up a parser and pretty-printer, break them subtly, and use a generator to identify incompatibilities. This is especially easy when using a property-testing library like Hedgehog. If you're stuck in a language without a decent framework, this is a simple enough property to implement by hand using random primitives4.
Our pal today is a tiny subset of a C-like expression language. I'll only be implementing a scanner, since it should be sufficient to get the point across. Lexers / scanners are the simplest form of parser, and also the easiest to test. I'll befriend two birds with the one loaf here by also demonstrating a little bit of Parser Combinator Hell5.
Here's a valid program (don't worry about how it executes):
Here are examples of our tokens, in lieu of a real specification:
... and here's a corresponding datatype:
We don't care about any kind of whitespace. It can be peeled off by the lexer as we go, and can be inserted freely by the pretty-printer.
From this point on, code may be intentionally incorrect. Please don't cargo-cult anything without reading the full post.
Personally, I prefer to write the printer first, so I can get the property test in place before I even get started on the parser. This is a preference I developed over time by doing the precise opposite. Your mileage may vary.
A (very bad) pretty-printer for a stream of tokens, given the above caveats, is pretty straightforward. We just write a printer for an individual token...
... and then add liberal amounts of whitespace. For this toy language, whitespace only really matters for disambiguation between ++ and +, x y and xy. Let's be lazy and apply it between each token. This is a hack job, after all!6
Here's a first attempt at a Hedgehog generator for tokens. Recall we had some very convenient restrictions on float precision and variable names. We'll take advantage of Hedgehog to enforce these.
To get the property in place, we'll need stubs for parse:
Now we can write our property:
... and now we can type tests into the repl to see how we're doing.
I'll write the parser in a naive-but-reasonable7 way and write about each bug the property identifies. I'm using Parsec for this, since it makes it relatively easy to make mistakes.
Let's start by peeling off whitespace...
... stubbing out a token parser...
... and writing our top-level parsing functions, ensuring irrelevant whitespace is handled and we parse all the way to the end of the stream:
Most of our tokens are pretty boring string literals, so let's get those out of the way.
Parsec doesn't give us numeric parsing primitives, so I'll write some slow ones of my own:
... and I'll need to recognise our alphanumeric variables, too:
... so, stitching it all together, it looks like something like this should work:
... but does it? My first unit test passes. Looks like it!
Looks like we're done! If the property passes, we can go home.
Weird failures on floats. That's weird, parseDouble worked fine when I tested it in isolation! Turns out the order inside a choice block really matters in Parsec. When rules are ambiguous, the most general one needs to be at the bottom, else it will always succeed in place of any more specific combinators. Flipping the order of the TInt and TFloat parsers is not enough, however! Parsec also allows failing generators to consume input, so we need to use try to stop parseDouble eating integers. Here's how we get away with it:
The next counterexample is [TPlusEq], which complains about an unexpected =. This looks like the same thing, the first + is picked up by TPlus and the = is meaningless. So, let's put Increment and PlusEq above Plus, and wrap them both in try:
The next counterexample is [TVar ""], which really shouldn't be happening. This is a generator bug, and not an intentional one. Whoops. Adjust the lower range parameter: TVar <$> Gen.text (Range.linear 1 20) Gen.alpha.
Here's our final counterexample:
Looks like we're generating some floats that require more precision than six decimal places to render. They're being truncated during serialisation. This is another generator invariant. We can fix this by ensuring our generated values are truncated up front:
That's it, the property now passes!
I'm now reasonably confident that this parser works. This is a really simple language, so I could probably have done that with unit tests this time around, but it probably would have taken a similar amount of effort.
This was the simplest, least interesting serialisation problem I could think of. It intentionally has no recursive productions or context-sensitivity. I added several restrictions to make it easy to parse. It still picked up a couple of places where I was holding Parsec wrong, and a couple of spots where my generator, parser, and printer were implementing different languages.
Round-tripping is useful for pretty much any kind of parser-printer combo, from JSON or binary serialisation up to ridiculous layout-sensitive Haskell parsers. It pays off pretty much immediately, even for inane cases like this. It is hard to live without when implementing complicated formats. If you're solving this kind of problem, I implore you to give it a try.
While you may still need other tests to ensure you haven't implemented some wildly incorrect language, round-trip testing can ensure your parser is complete and unambiguous. Care sometimes needs to be taken in writing a good printer and a good generator. This feels a lot like writing a specification! It's a much more productive process than thrashing your brain for interesting permutations.
Parser ambiguities are an extremely common class of bug that can wreak the worst kind of havoc. With a small amount of up-front due diligence, we can eliminate them in a semi-automated fashion. This is almost certainly in your interest.
This technique can be reused for any serialisation layer, as well as any internal functions that should be involutive, including isos.
© 2017-2019 Tim Humphries