New post.

diku-dk · Jun 10, 2024 · 08d50db · 08d50db
1 parent 30a2608
commit 08d50db
Showing 1 changed file with 384 additions and 0 deletions.
diff --git a/blog/2024-06-10-how-we-test-the-futhark-compiler.md b/blog/2024-06-10-how-we-test-the-futhark-compiler.md
@@ -0,0 +1,384 @@
+---
+title: How we test(ed) the Futhark compiler
+author: Troels Henriksen
+description: A description of how the testing tools for the Futhark compiler have improved over time.
+---
+
+In this post I will go through the evolution of the tools we use for
+testing the Futhark compiler. Along the way, these also became
+`futhark test`, [the user-facing tool for testing Futhark
+programs](https://futhark.readthedocs.io/en/latest/man/futhark-test.html),
+and although rather convenient, it is still pretty crude compared to
+the testing tools you will find in other languages. This post is
+perhaps most interesting to other language implementation hobbyists.
+
+## Every program starts with a shell script
+
+In the earliest days, when you ran a Futhark program it would read
+things from standard input and write things on standard output.
+(Actually, in the *extremely* early days, Futhark actually supported
+impure functions for doing this, although this was quickly changed to
+implicitly reading the arguments to the `main` function, driven by
+their types.) At some point we realised that we weren't sure that the
+compiler worked, and so we decided to add a way of running our test
+programs.
+
+The vision was simple: each program could be associated with an `.in`
+file that would be fed to the program on stdin, and an `.out` file
+that would be compared against whatever the program on stdout. If no
+`.in` file was present, it was assumed that the program contained a
+type error and should be rejected. This early support for negative
+tests was remarkable considering how crude the system as a whole was -
+[here is the shell script that implements
+it](https://github.com/diku-dk/futhark/blob/667370911e3563ee2e16f445d4e595cd2270c9da/data/runtests.sh).
+
+You will notice, for example, that the test programs are explicitly
+listed (sort of) in the test runner itself. Also, the script does not
+actually *run* the programs - it merely compiles them. I vaguely
+remember that this testing script is derived from one I wrote when I
+was a TA on [DIKU](https://sigkilll.dk)'s compiler course (which is
+[incidentally also the root origin of
+Futhark](2021-12-19-past-and-present.html)), but I am pretty sure that
+one actually ran the test programs.
+
+It took about a week to actually reach the point [where the rest
+runner tested the
+programs](https://github.com/diku-dk/futhark/blob/df66f8355ac8e072ee8d93aa6589f45b756e9766/data/runtests.sh) -
+looking at the commit history, I spent most of that week making it use
+more compiler passes, and presumably only implemented *running*
+programs once the compiler stopped crashing. Note that the expected
+and actual program output is compared *exactly*, with `cmp(1)`. We
+will return to this.
+
+A few weeks later [I added support for only type checking the test
+programs](https://github.com/diku-dk/futhark/commit/f71c4cf980900cdf58df1649b2f86d7702cc8292),
+which is very handy when hacking on the type checker. I suppose this
+represents the point where we had added enough tests (or made the
+compiler slow enough) that actually *running* the entire test suite
+was too annoying. This option is still supported in `futhark test`.
+
+## Moving to Haskell
+
+This code didn't last very long however, as soon after I [rewrote the
+test runner in
+Haskell](https://github.com/diku-dk/futhark/commit/dd40268e7c84172aac2449bc0ba1debdf9935150).
+I actually thought the shell script lasted a lot longer than a few
+weeks, but the Git log doesn't lie. This test runner was still not a
+part of Futhark proper however, but a completely separate Haskell
+program you ran with `runhaskell`. The main advantage was that it
+would perform multiple tests concurrently, as compared to the totally
+sequential shell script. Looking at its architecture, it is actually
+very similar to the modern `futhark test` tool. For unclear reasons,
+it took almost a month before we actually [switched to using the
+Haskell-runner by
+default](https://github.com/diku-dk/futhark/commit/132e195613474d1befe7a1d436ecc03666b06201).
+Good bye, shell script.
+
+The next major change [was adding a terrible CI
+setup](https://github.com/diku-dk/futhark/commit/e6fa896d9e8c8ac44cd2fce4c1f7c36e0da9ece0).
+Specifically, I piggybacked on an IRC bot I was running to manage a
+lunch club with fellow students. This IRC bot was already set up to
+handle IRC notifications from GitHub whenever someone updated its
+code. It was a small amount of work to also make this lunch management
+bot [receive notifications from the Futhark (well, L0 in those days)
+repository and run the CI
+suite](https://github.com/athas/eggsml/commit/03676c2a337d4b75c4ba7a98c446d0958b2cbc33).
+CI failures were reported by sending me an email. It was *super*
+flaky, and I don't remember why I didn't do something more sensible.
+This was in December of 2013, so hardly the stone age when it comes to
+CI. It wasn't until over a year later that we [finally started using
+Travis-CI](https://github.com/diku-dk/futhark/commit/9ec064114873eee244f7984494bfb7694fb0044f)
+([and eventually in 2020, GitHub
+Actions](https://github.com/diku-dk/futhark/commit/d99e8ecf6cc8c3cc2bcfc95343a0f83ebc6996a3)).
+
+But back to the test runner. Its main problem at this point was that
+it did an *exact* comparison of expected and actual output, which
+meant it was very sensitive to pretty-printing details (this all
+predates the [binary data
+format](https://futhark.readthedocs.io/en/latest/binary-data-format.html)).
+It was also very annoying when we started having multiple backends
+that did not guarantee bit-level consistency for things like floating
+point arithmetic. The first (weak) attempt at solving this was
+[ignoring whitespace when comparing
+outputs](https://github.com/diku-dk/futhark/commit/b3fecfd8ddbba4a17207d27a2dd5699e165b16e9),
+but it wasn't until [February
+2015](https://github.com/diku-dk/futhark/commit/1d97f5533ef592e9690b98549b9b99f0c1d588a0)
+that the test runner actually started considering the output to be
+Futhark *values*, rather than just arbitrary strings, in order to
+compare floating-point values within an allowable tolerance. It's
+quite remarkable how long it took to get rid of some of the misdesigns
+of that initial shell script.
+
+By this point, the test runner run both the interpreter and compiler.
+This was a very good design, as having two implementations of the
+language forced us to think about what programs were actually
+*supposed* to do, rather than what they happened to do by accident.
+However, eventually we started writing test programs that were too
+slow to run in the interpreter, so it became possible to [turn them
+off](https://github.com/diku-dk/futhark/commit/c5d1648e6b1c2a96f0416521bcae046b500c2638).
+This is the start of the test runner growing a confusing array of
+different *modes* in which it can be run. A few weeks later, we added
+the obvious counterpart option that [only runs the
+interpreter](https://github.com/diku-dk/futhark/commit/964193c01e96600f800733032a043ece43c25dc7).
+These options are still supported in `futhark test`.
+
+The first large change to the *design* of how test programs were
+specified [came in April
+2015](https://github.com/diku-dk/futhark/commit/6eda2025dedc45f6a88ec8d0695805c2a34575d2),
+and removed the part where a program `foo.fut` was associated with
+`foo.in` and `foo.err`. The problem (apart from being annoying to
+juggle multiple files) was that it only allowed a single input/output
+pair for each program. Instead, the test input and expected output
+would now be embedded in the program itself, in the form of a comment,
+as shown here:
+
+```
+// --
+// input {
+//   10
+// }
+// output {
+//    [ 0 , 1 , 1 , 2 , 3 , 5 , 8 , 13 , 21 , 34  ]
+// }
+```
+
+This is still essentially the design we use today, although `//`
+became `--` and `--` became `==`. It did prove awkward for [test
+programs with very large expected
+outputs](https://github.com/diku-dk/futhark/blob/6eda2025dedc45f6a88ec8d0695805c2a34575d2/data/benchmarks/BlackScholes.fut), but was a marked improvement for the rest.
+
+In the same month, the ad-hoc Haskell script tucked away in a
+sub-directory [finally became
+`futhark-test`](https://github.com/diku-dk/futhark/commit/668be565957dffe7687f530d49ffc99d7c9d2de3),
+and was built using the same build system as the main compiler. This
+was the start of providing testing tools that could be used by Futhark
+programmers to test their own programs (as opposed to compiler hackers
+testing the compiler), although it was still quite crude, and almost
+completely undocumented. (But anyway, you wouldn't want to use
+2015-era Futhark.)
+
+The opening comment in `futhark-test.hs` was pretty clear about the
+audience:
+
+```Haskell
+-- | This program is a convenience utility for running the Futhark
+-- test suite, and its test programs.
+```
+
+## Beyond correctness testing
+
+Until this point, the program then named `futhark-test` had been
+focused on functional correctness - when run, did the programs produce
+the right output? However, at this point we had spent a lot of time
+adding optimisation passes, and (unwittingly) also on breaking
+optimisation passes. Our testing tool could easily detect when an
+optimisation produced incorrect code ([helped by the internal
+consistency checks in the
+compiler](2023-01-18-how-we-make-the-compiler-crash)), but it had no
+way of detecting when we broke an optimisation pass to such an extent
+that it simply *did not change the program*.
+
+Inspired by some tests I found in the LLVM test suite, I became
+interested in being able to express that the program resulting from
+optimisation should some loosely specified shape. I didn't want to
+hardcode the entire expected representation, because the fine details
+about variable naming and operation ordering are not (necessarily)
+intended to be stable, but I wanted to say things like "this program
+should consist of two loops at top level, the latter with a reduction
+inside".
+
+Instead of doing the work myself, I proposed the construction of such
+a tool as a Bachelor's Project at the department ([the start of a
+beautiful tradition](2024-02-03-quantifying-student-projects.md)), and
+the skilled [Maya Saietz](https://tayacan.dk/) developed a tool where
+you provided a high level AST "pattern", and the tool check checked
+whether that pattern matched the program produced by the compiler.
+Maya's work was quite good, but a bit overkill - and it was difficult
+to express that a given construct was *not* allowed to appear in some
+position (e.g., demand the absence of bounds checks). Instead, [I
+added a "quick
+hack"](https://github.com/diku-dk/futhark/commit/a6c0d0864bddad8f5eef2ce627c79e28ba9ef65a)
+(still around in essentially unchanged form, naturally) where you can
+simply specify how many of a specific kind of AST node are allowed in
+the program. For example, this might specify the absence of bounds
+checks:
+
+```
+structure { Assert 0 }
+```
+
+And this might specify that the program contains only a single
+SOAC (which we might use to test fusion, another thing that broke often):
+
+```
+structure { SOAC 1 }
+```
+
+Nesting was handled by naming the nodes appropriately. For example, a
+`SOAC` inside another `SOAC` would be tallied as a `SOAC/SOAC` node
+(as well as two plain `SOAC`s - the outermost and the innermost one).
+This "quick hack" has proven extremely useful, and perhaps merits a
+more detailed post one day.
+
+We also [added a notion of
+"tags"](https://github.com/diku-dk/futhark/commit/709f432622a417661b1dd407a9c98ff56aeb7741),
+where programs (and workloads) could be associated with arbitrary
+*tags*, which would be excluded for a specific run with an `--exclude`
+option. This was (and is) used for programs that do not work with a
+specific backend, or for datasets that are too slow. (The sequential
+Python backend, despite ostensibly being "compiled", is particularly
+slow for nontrivial workloads.)
+
+## Benchmarking
+
+In early 2016, Futhark was becoming good enough that we started
+wondering how fast it was - so [we added
+`futhark-bench`](https://github.com/diku-dk/futhark/commit/ac9a62cb5ff719787c2c8e398bf6c31fca2be4d3).
+It accepted the same kinds of files as `futhark-test`, but instead of
+merely checking for correctness, it ran each program a bunch of times
+and reported the resulting performance. In later years, when I have
+had to benchmark other systems, I have often appreciated that we
+(relatively) early on decided on a single fully automated way in which
+Futhark programs should be benchmarked.
+
+Around this time, since we wanted to benchmark programs with large
+(ish) amounts of data, the MSc student Rasmus Wriedt Larsen [designed
+and implemented a binary data
+format](https://github.com/diku-dk/futhark/pull/295) that we still use
+(almost) unchanged. Again, simply specifying a simple format (which is
+incidentally also easy to read and write from other languages) has
+proven enormously convenient. For example, the [`futhark
+dataset`](https://futhark.readthedocs.io/en/latest/man/futhark-dataset.html)
+and [`futhark
+datacmp`](https://futhark.readthedocs.io/en/latest/man/futhark.html#futhark-datacmp-file-a-file-b)
+tools can be used for generation, conversion, and comparison of data
+files. At this point, test programs were also able to indicate that
+the input (or output) data should be fetched from another file, rather
+than being embedded in the program itself.
+
+We have a bunch of auxiliary tools for benchmarking purposes, such as
+[`futhark
+benchcmp`](https://futhark.readthedocs.io/en/latest/man/futhark.html#futhark-benchcmp-file-a-file-b)
+that can be used for comparing results, and
+[`plot-bench.py`](https://github.com/diku-dk/futhark/blob/master/tools/plot-bench.py)
+for visualising results, but I will leave a discussion of those for
+another post.
+
+## Server mode
+
+At this point, `futhark-test` and `futhark-bench` were essentially
+complete. Certainly, small things were added, such as testing multiple
+entry points in the same program, or [the ability to test the presence
+of specific compiler
+warnings](https://github.com/diku-dk/futhark/commit/1dc03211d1322d3eae9f69a023211a9fd90c3ad1),
+and they eventually became subcommands under the names `futhark test`
+and `futhark bench`, but a modern Futhark programmer could probably
+use them without noticing much difference.
+
+Since the beginning, testing had been based on a simple operational
+principle: a Futhark program read input on stdin and produced output
+on stdout. The format had changed from text to binary, and options had
+been added through which the program would report its runtime, but
+running a Futhark function still implied starting the program from
+scratch. It also meant testing was limited to the values that could be
+represented in the binary data format, which was exclusively arrays of
+primitive values - no records, tuples, or sum types.
+
+Another problem with this approach was that it could be slow.
+Initialising the GPU context could easily take several seconds
+(potentially much more for large programs), which was detrimental for
+programs that consisted on a large number of functions with
+(relatively) small individual runtimes, [like this reduction
+microbenchmark](https://github.com/diku-dk/futhark-benchmarks/blob/986b9e109f2a25a2e79ba949309706470e7d51f0/micro/reduce.fut).
+It would be better if a Futhark program could be kept running (and GPU
+context intact), and have it run multiple entry points during its
+lifetime. This was added in early 2021 in the form of [Futhark server
+mode](2021-01-18-futharkscript.html). The most visible feature
+provided by server mode was `futhark literate` ([now used for our
+collection of examples](http://127.0.0.1:8000/examples.html)), but it
+also had significant impact on testing and benchmarking.
+
+In particular, it was now faster - a *lot* faster for benchmarks with
+many datasets, like the microbenchmark linked above. But a more subtle
+change was that input needed no longer be expressed in as arrays of
+primitives. Instead, it could be computed with *another* Futhark
+function, and passed directly to the entry point being tested, without
+having to go through a deserialisation/serialisation step. (And even
+when necessary, the server protocol also came with support for
+serialising *arbitrary* Futhark values to byte streams.) This meant we
+could now write test programs like this:
+
+```Futhark
+-- ==
+-- entry: doeswork
+-- script input { mkdata 100i64 } output { 5050.0f32 }
+-- script input { mkdata 10000i64 }
+-- script input { mkdata 1000000i64 }
+
+entry mkdata n = (n,map f32.i64 (iota n))
+
+entry doeswork n arr = f32.sum arr + f32.i64 n
+```
+
+Here the `doeswork` function is the actual function being tested, and
+`mkdata` is some arbitrary function that generates data. The test
+stanza will run the expression `mkdata 100i64` to generate the input
+(the runtime is not counted when benchmarking), which is then passed
+to `doeswork`. While the example above is trivial, this is very useful
+for providing arbitrarily complicated input data (although the result
+must still be of a simple type). It is also quite useful that we can
+test or benchmark with large amounts of synthetic data, without having
+to store it on disk, by generating it dynamically in Futhark.
+
+## What about unit tests?
+
+All of the discussion above concerns *integration testing*, where the
+full functionality (or close to it) of the compiler is exercised. What
+about unit testing, where the functionality of individual small
+components is verified? I am hardly averse to unit testing in general,
+but for compiler development, I must admit that I find integration
+tests to provide more *bang for the buck*. In particular, compiler
+passes often take quite complicated inputs (entire ASTs), for which
+the representation is not stable. Keeping unit tests for such
+functions updated is not impossible, but it is time consuming.
+
+By writing small test programs, and perhaps augmenting with things
+like the `structure` tests above, we can obtain many of the benefits
+of unit testing, with less maintenance overhead. Futhark does have a
+[small collection of unit
+tests](https://github.com/diku-dk/futhark/tree/master/unittests),
+mainly for verifying the functionality of particularly subtle core
+components, like [graph
+colouring](https://github.com/diku-dk/futhark/blob/master/unittests/Futhark/Optimise/MemoryBlockMerging/GreedyColoringTests.hs)
+or [index function
+operations](https://github.com/diku-dk/futhark/blob/master/unittests/Futhark/IR/Mem/IxFunTests.hs)
+(for the [array
+representation](2024-03-06-array-representation.html)). However,
+compared to the 2293 test programs, the 2985 source lines of unit
+tests do look pretty meagre.
+
+We also have [a collection of manually written library
+tests](https://github.com/diku-dk/futhark/tree/master/tests_lib) for
+testing the C and Python APIs, as well as some [completely ad hoc
+tests](https://github.com/diku-dk/futhark/tree/master/tests_adhoc)
+that are just shell scripts that check various things in the auxiliary
+tooling. These are not terribly interesting.
+
+## Wrapping up
+
+Futhark is a project that even in the best case [will never have many
+resources](2018-06-18-designing-a-programming-language-for-the-desert.html).
+Therefore, all our tooling has been built with a sense of minimalism
+and simplicity, and often sacrificing flexibility. Yet I think we have
+ended up with tools that work pretty well for our purposes, and where
+we can usually get our work done. The main missing feature is some
+form of property based testing, which I hope we can add one day.
+
+Although not originally developed for that purpose, the testing and
+benchmarking tools are also [fairly easy to explain to
+users](https://futhark-book.readthedocs.io/en/latest/practical-matters.html).
+In most other languages, testing tools are libraries that are built in
+the language itself. This is not possible Futhark, because it is not a
+general-purpose language, so we had to build tools that are outside
+the language itself. This can easily result in tools that are just
+plain bad, but I think we managed to do alright.