Skip to content

Qi Meeting Apr 5 2024

Siddhartha Kasivajhula edited this page Apr 26, 2024 · 6 revisions

Who Needs Proofs?

Qi Meeting Apr 5 2024

Adjacent meetings: Previous | Up | Next

Summary

We reviewed how Qi uses contracts in the compiler to generate good, useful errors, and the challenges presented in doing the same with the new stream consumers that are being deforested in racket/list. We also looked at some improvements in error messages in recent releases of Python to see if there was anything there that either Qi or Racket could adopt, and discussed some foundational reasons why the considerations around error handling are quite different between Python and Scheme. We reviewed our unit testing infrastructure to see if it's sufficient for our purposes as we continue to deforest more list APIs in the compiler. Finally, we revisited recent discussions about "Schrodinger's probe" and discussed whether there was more we could do to work around this phenomenon.

Background

Dominik has been working on deforesting more APIs in racket/list and found that generating error messages would require a different approach from what we have been using. Testing the optimizations is also not as convenient as we'd like.

We've been discussing in recent meetings the broader question of good error messages both in Qi as well as in Racket.

Hendrik recently discovered the by-now-infamous "Schrodinger's probe" phenomenon, and since then, we've been trying to account for it in our theories and have considered whether we can come up with implementations or practices that would avoid it in some way.

Review: How do we use contracts in the Compiler?

Whenever there is an error encountered at runtime, there are few questions to be answered:

  • Who's complaining?
  • What went wrong?
  • Whose fault is it?

We could generate custom error messages that answer these questions, but Racket already provides a standard and formalized way of doing this -- contracts.

Although contracts can wrap any value, in practice, due to the performance cost of these checks, they are typically used only at module boundaries (i.e. (provide (contract-out ...))) rather than internally within modules (e.g. define/contract). In the Qi compiler, for the stream fusion (deforestation) optimization, we do neither of these, since the implementation of the optimization need not be provided outside the defining module, and as the fused operation forms a single anonymous composed function, it need not be defined and thereby bound to a name. This is why we use the low-level contract API directly, to wrap the anonymous fused sequence with a contract to validate its behavior as a whole, without needing to give it a name and without needing to provide that name outside the implementing module.

In general, we'd like to use contracts to mimic the behavior of the first function in the fused sequence that received an incorrect value, thus keeping the deforested implementation hidden from users.

The above approach allows us to report errors at the producer end (e.g. range) of the pipeline, but if there are any errors encountered in transformers or consumers, those would not be caught.

So far, we haven't had any transformers (i.e. just filter and map so far) that could raise a runtime error. On the consumer end, we've only really had car up to this point, which of course can raise a runtime error if it receives an empty list. We take a different approach there, of creating and then breaking a "virtual contract." That is, if we encounter an empty list, we manually construct a contract appropriate for car, and then intentionally break it by passing it an empty list, while providing the contract the source context forwarded up to this point (so that the correct source expression would be highlighted in the IDE). This emulates the error that would have occurred had car been invoked directly (instead of via our deforested implementation that mimics car). It incurs a performance penalty, but that's OK because it's an error anyway, and in practice, in fact, this mechanism isn't ever actually hit! TODO: why not?

Why Don't We Just Generate Errors Manually?

Instead of using contracts, we could just generate custom errors containing all of the relevant information -- source location, offending expression, the problem that happened, everything we know. This would avoid any performance cost associated with using the more general mechanism represented by contracts. But the problem is, such an error would be unstructured data that is understood (presumably) only by the user, and not by any tools in the Racket ecosystem.

By using contracts, we follow standard error pathways understandable to Racket, which means that it takes advantage of any features provided by IDEs and other ecosystem tools. For instance, DrRacket will highlight the offending expression in the source buffer.

Another option that represents a middle ground between custom errors and contracts is to raise an appropriate exception which includes source location information. Two built-in errors that include source location information are raise-syntax-error and raise-contract-blame-exception. These too are "proper" channels for generating errors, but it seems they don't provide everything we need (TODO: what exactly?), which is why it ends up being most convenient use contracts to avoid reinventing the wheel.

More Consumers

On to more racket/list APIs! As we discussed last time, there are many APIs in racket/list, including cadr, caadr, list-ref, and more, which could be deforested using a single consumer implementation. But unlike in the case of car, using a "virtual" contract to simulate the a priori error is tricky because we don't know what the original function is, at this stage. It could be any of the functions that map to the common fused implementation.

So we need to have some way to preserve and propagate the name of the original interface used so that we can generate an error implicating it. We discussed adding an extra 'name' field to our stream fusion implementation to propagate the source interface name (similar to how we already propagate source syntax context) and felt that it could resolve this issue.

Even More Testing

In the lead up to the Qi 4 release, we got some good advice from the Racket community on best practices for testing the compiler, and managed to write a fairly expansive and comprehensive suite of tests.

This includes:

  • regular unit tests for language semantics
  • smoke tests for interface macros wrapping the main flow macro
  • tests for macro-defining forms such as define-qi-syntax-rule
  • smoke tests for the expander
  • tests for traversal strategies used in compiler passes
  • "equivalence" tests for compiler normalization rules
  • tests to assert "whether" functional sequences are deforested
  • tests to validate that the meaning of the program is preserved through optimization for various combinations of deforested operations
  • benchmarks to validate performance improvements from optimization

... and with 99% coverage on the Qi codebase!

Why Not 100%?

There are a few expressions in the compiler that are not ever hit in practice today, such as the "virtual contract" in the car consumer. We felt this was OK since we expect these virtual contracts to be hit when more complex consumers are implemented.

Why 100%?

We are still aiming for 100% coverage, not just for technical reasons but especially for social and logistical reasons. If a PR brings coverage down from 99.5% to 99.2%, it may be OK, or it may not be OK. But it may not even be noticed during review, and would contribute some overhead to the effort of reviewing to determine whether more tests are needed or not. But with 100% coverage, every PR must either ensure that new code is covered by tests, or include the necessary configuration so that the uncovered code isn't considered in the coverage report. Either way, the developer must think through it and take the appropriate action, which factors this out of reviewing and lightens the review process, increasing "velocity" while also, of course, ensuring that the code is rock solid.

When 100% Isn't Enough

Despite the fact that the code is basically completely covered by tests, the purpose of tests is not just to catch regressions but to help us during development by giving us small feedback loops that allow us to identify issues and make rapid progress. From this perspective, we felt that there are gaps in the testing infrastructure, as iterating on deforestation still feels a bit awkward.

We went back to the basics to ask what we want from these tests.

What would we like the deforestation tests to accomplish?

We want to know:

  1. Whether the optimization was applied, and how many times it was applied.
  2. What the performance gain was.
  3. Whether the resulting program after optimization produces correct output.
  4. Whether the source expression was correctly rewritten during optimization.

For the first, we felt that Vincent St-Amour's suggestion of logging when compiler optimizations are applied, and then capturing these logs in tests, would suffice.

For the second, our benchmarks are adequate.

We'll leave the third aside for the moment.

For the fourth, we don't really have a way to do that, and probably our rudimentary string-based checks (which we currently use) or St-Amour's logging approach, would be the best we can do here.

Pushing the Boundaries

Now for the third -- testing that optimizations preserve source code semantics -- we felt that a new approach we could follow, now that the compiler is extensible through a simple require-based mechanism, would be to run the same tests with and without deforestation, and then compare the output. This allows us to compare the deforested version with the output from a reference (non-optimizing) compiler, without having to ourselves declare (as we would in conventional tests) what the expected output is!

For this purpose, we could enumerate all producers, transformers, and consumers, and then define simple uses of these (e.g. (filter odd?) and (map sqr)) as representative building blocks for a new, automated test suite. These building blocks would then be combinatorially combined to generate lots of tests, the output of which would be compared with and without (require qi/list) (i.e. with and without deforestation), to ensure that they have the same behavior -- whether that's returning true or false, or a specific value, or raising a certain exception. We counted 6 producers, 20 transformers, and 17 consumers, and came up with some wild numbers for how many tests these could generate once building blocks were defined over them, ranging from 1500 to 100 million tests!

Regardless, for now, we plan to just start with defining a way to run manually-written (e.g. our current) tests with and without deforestation, and then think about automatic code generation later. The approach is likely to have a lot of overlap with the way we're running benchmarks today, and in the same manner, can in the future be generalized to selectively turn on and off any number of compiler passes in running these tests and benchmarks.

Of course, an alternative to this approach would be to simply produce proofs that each compiler optimization preserves the meaning of the original expression in relation to a reference compiler (as we've talked about before). But it is perhaps easier to write tests for specific and representative (and corner) cases than to prove the general case, so that this approach of automatically comparing the output of two compilers is perhaps more practical than maintaining proofs that may diverge from the code they refer to.

The Competitive Error Landscape

The release notes for Python 3.11 and 3.12 show many recent improvements in error reporting for the language. We took the opportunity to discuss these and understand more about Qi's and Racket's error reporting infrastructure in comparison to other languages that are out there.

No Traceback

A language like python provides a traceback when a runtime error occurs. A traceback is a fairly intuitive and widely employed way to characterize the causal sequence leading to a problem, and can be a very helpful starting point for debugging. But a traceback is possible because python is a call stack-based language. That is, its control flow is implemented using a stack of function invocations, and control flow can proceed either by pushing a fresh invocation onto the stack, or popping one off to "return to the caller."

We don't have tracebacks in Racket because Racket is a continuation-based language, and that provides a lot more flexibility in control flow, but at the cost of losing the simple call stack that could be shown to users in the event of errors. In the absence of the need to maintain a call stack, Scheme compilers routinely eliminate continuation frames, such as in the standard Tail Call Optimization. Chez Scheme is very good at such optimizations, which is another reason we might lose information that could have been useful for debugging. The errortrace utility approximates a call stack by introducing prompts [correction: continuation marks. See the "Response" below] at every procedure evaluation, but it isn't something that is used very widely, and the reasons for this are unclear. Perhaps it does not reliably produce a stacktrace in every case, or cannot easily be used in some situations? Or maybe it's just a matter of limited awareness of this feature and of how and when to use it.

In any case, in the absence of a traceback, it's all the more important to generate good error messages with proper source location information. While a stack trace allows providing users a good starting point for debugging without any special effort on the part of the programmer, providing good error messages in Racket requires proper use of appropriate error reporting tools such as contracts and the appropriate exceptions. That's why it's important to get this right for Qi, and ideally, to develop any useful tools and best practices along the way that could improve error reporting in the Racket ecosystem more broadly.

A Response

On Discord, Sam T pointed out some errors in the above and provided many clarifications. The conversation is reproduced below:

samth — Today at 12:47 PM I think the discussion here (https://github.com/drym-org/qi/wiki/Qi-Meeting-Apr-5-2024#no-traceback) is confused in several ways.

  1. Racket does provide what are described as "tracebacks" here. Just as an example, you can look at this result file: https://pkg-build.racket-lang.org/server/built/test-fail/GLPK.txt. Everything after context: is exactly a traceback (or a stack trace, as it is usually called).
  2. Given (1), why do people think Racket doesn't have stack traces? One reason is that Racket stack traces sometimes contain fewer stack frames than you might get in Python. There are two big reasons for this. One is that the Racket compiler (including Chez here) performs inlining and other optimizations which can eliminate function calls. Python doesn't do that. Java and JavaScript, which do, keep track of all that information at runtime to give you a stack trace as if there was no inlining. A second reason is that Racket does not keep stack frames around for tail calls. A third reason is that sometimes macro-generated code does not have any source location information and so the frame doesn't have anything useful to say.
  3. What about errortrace? It rewrites the program to use continuation marks (prompts are not involved) to keep track of where in the computation you are. This does two things. First, it adds "stack frames" for every nested expression around the site of the error (eg let, if), not just function calls. So this is a lot more information than you would have anywhere else. Second, those continuation marks are part off the program, so they can't be affected by optimization. Thus the first reason can't apply any more. However, continuation marks respect tail calls, and so tail calls are still missing from the stack trace in the errortrace-generated stack trace.
  4. Why doesn't everyone use errortrace? First, it's not the default. Second, it causes 10x or more slowdowns.

Sid — Today at 1:00 PM Thank you, that definitely helps to understand it further! Do you mind if I include it, or excerpts from it, as a response in that section?

Does the package index use errortrace when reporting build failures? It could be useful if it did

Interesting, if I understand you correctly, errortrace should recover (or prevent the elimination of) all of the stack frames resulting from function invocation (same as what's available in other, callstack based languages, minus tail calls which seems like it would be noise anyway), but in addition also contains frames around all nested expressions. That sounds very useful indeed!

samth — Today at 1:10 PM First, sure, you're welcome to include that. Second, the pkg index does not use errortrace, and the stack trace for build failures is not usually useful, since it's the stack trace of the expander/compiler at that poing. It could be used during testing, but that would (a) use a lot more resources and (b) not necessarily do as much as you'd want (see below). But also I don't really recall a situation where it would have helped much.

Third, Racket is a "callstack based language" in the sense you mean. It works basically the same way, and reports errors in basically the same way, as Python or C. The difference is in how much optimization happens/how much the debug info system tracks those optimizations.

Chicken Scheme, or GHC Haskell, are not stack based in the relevant sense, and they have to work somewhat harder to produce stack traces.

Also, the lack of frames for tail calls really can be confusing sometimes, it's not just noise. What would be noise would be repeated frames from the same call, and there have been proposals to address that, but Racket doesn't do anything special there.

Fourth, one additional limitation of errortrace is that because it is implemented by rewriting code as it is evaluated, already-compiled code is not affected. So if you have something compiled already and that's an important part of the stack trace, those parts will be entirely missing in errortrace-based stack traces. If you want to get the most out of errortrace, you need to either recompile your program in errortrace-using mode, or remove compiled versions of the program so that Racket recompiles them on-the-fly.

Sid — Today at 1:20 PM That is very handy to know, and could account for why errortrace isn't used more (e.g. if the user didn't know this, they might assume errortrace isn't useful after using it the first time on compiled code... that might be me lol. I'll explore it more in depth)

Sid — Today at 1:20 PM ah so even the first frame is removed... that is surprising

samth — Today at 1:21 PM Note that all of this is explained in the errortrace quick start: https://docs.racket-lang.org/errortrace/quick-instructions.html

samth — Today at 1:22 PM This has to be true so that, for example, mutually-recursive functions or programming in continuation-passing-style do not grow the stack endlessly.

Sid — Today at 1:29 PM "keep track of all that information at runtime to give you a stack trace as if there was no inlining" - is this something that would be possible for Racket to do, or does the existence of errortrace make this moot?

Sid — Today at 1:33 PM Out of curiosity, why doesn't it detect such repeating sequences and preserve representative initial frames, like a, b, a, b, a, b, ... becoming a, b ?

samth — Today at 1:36 PM Those are both things that are possible to do (C, for example, does what Racket would need to do for the first, the second hasn't been done that I know of). Detecting repetition in general is hard of course, but easy in simple cases. They would both require significant implementation effort, and potentially have runtime or compile time (or both) costs.

Sid — Today at 1:39 PM Gotcha, thanks for explaining all this!!

Did You Mean ...?

We also looked at the new "Did you mean ...?" style hints in python error messages. We felt that the kind of analysis that would be required here -- for instance, determining identifiers bound in scope and comparing those against syntax errors in terms of Levenstein distance, to recognize misspelled identifiers -- would be more appropriate to do at the level of Racket than at the level of Qi.

Probing Without Effects

Recently, we realized that the probe debugger can read out values that aren't reified at any time in the actual computation that would have been performed in the absence of the readout, and may also affect the order of effects -- that is, the "Schrodinger's probe" phenomenon. So far, we have been content to say, that's just how it is, and a few warnings in the documentation to make users aware of the phenomenon is all we can do about it. But we discussed it further this time to see if that was in fact the case.

As a refresher, this is the essential problem. Consider this code:

(define-flow foo
  (~> (pass odd?) (>< sqr)))

This compiles to:

(>< (if odd? sqr ⏚))

If we add in some effects and observe it using the probe debugger, like so:

(define-flow foo
  (~> (pass (effect E₁ odd?))) readout (>< (effect E₂ sqr)))
(probe (foo 1 2 3))

Then something weird happens. With the readout, all the effects E₁ would occur first. Without the readout, the flow would be deforested (i.e. to the compiled program above) and the effects E₁ and E₂ would be interleaved.

But could we have a readout that was somehow "invisible" to the compiler, so that it behaved the same way with the readout as without it?

A debug core form?

One way could be if we introduce a debug as a core form of the language. Then, compiler passes could be aware of this form and handle it specially, making the necessary changes in the optimized algorithm to preserve the illusion of the readout while still executing the flow in the optimized form. It's unclear if this would be possible, but regardless, it would be a lot of work to essentially implement parallel semantics for every compiler optimization with and without readouts present.

An Undefined Translation

But the problem really is that the source program and the target program are different programs - different algorithms computing the same result – and they can be very different. Thus, a point in the source program doesn't always map to one in the target program. In some cases a translation of the readout to the target program is just not defined.

An aside: We've talked about how Qi programs can be expressed as graphs. If we formalize this further, we could represent the source program and target program as graphs and then study the nature of the translation between the two. There just might be an interesting topological way to think about it — for instance, can we define a useful notion of continuity? Would that tell us anything about whether a translation of readout is possible?

We compared it to refactoring code. If there is a comment in the code, then after refactoring, where should that comment be placed? If the comment was adjacent to a function that was split into two functions, which function should the comment be moved next to? What if the original function was done away with altogether?

Some languages in the past had richer syntax for comments – in these languages, comments could be syntactically identified as being headers, prefix comments, or trailing comments. In such languages, it would perhaps be more clear where the comment should go after refactoring, since the comments explicitly indicate the code that they refer to.

In Qi, we do have something analogous – the effect form. In developing our theory of effects, we have been saying that effects are not defined in isolation but only in reference to a flow. Additionally, the compiler guarantees that such effects will never be separated from the flows they are associated with.

So one way to achieve the kind of predictable readout we have been looking for could be to always place the readout within an effect. This ensures that if the readout is hit at all, that it will represent values in the target program, and its presence or absence would not affect the order of effects. But it achieves this by essentially changing the source program to one that is well-behaved in this way. In particular, the presence of an effect at all (which may happen to be a readout) may prevent some optimizations from being applied in order to honor the guarantees on effects. So this approach would be more deterministic, but it's unclear whether it would be more useful.

Next Steps

(Some of these are carried over from last time)

  • Improve unit testing infrastructure for deforestation.
  • Schedule a discussion for Qi's theory of effects and merge the corresponding PR.
  • Schedule a discussion for the language composition proposal and implement a proof of concept.
  • Decide on appropriate reference implementations to use for comparison in the new benchmarks report and add them.
  • Deforest other racket/list APIs via qi/list
  • Decide on whether there will be any deforestation in the Qi core, upon (require qi) (without (require qi/list))
  • Review and merge the fixes for premature termination of compiler passes.
  • Continue investigating options to preserve or synthesize the appropriate source syntax through expansion for blame purposes.

Attendees

Dominik, Hendrik, Sid

Clone this wiki locally