Reading (for 9/19) : Profiling #385

rcplane · 2023-09-11T15:37:13Z

rcplane
Sep 11, 2023

This is a discussion thread for the Profiling reading.
Please read the selected paper
Efficient Path Profiling
Thomas Ball and James R. Larus. MICRO 1996.
We will have in-class discussion on Tuesday September 19.

Some high-level questions to consider include but are certainly not limited to -

When would compiler researchers want to implement and use path profiling?
When you use a profiling tool, what metrics do you wish to capture and what runtime overhead would you accept?
What are the tradeoffs between efficient path profiling as presented here, edge profiling, tracing, sampling, and logging?

I (@rcplane) am the discussion leader and will try to answer any of your questions or direct you to useful informative resources.

stephenverderame · 2023-09-16T17:34:57Z

stephenverderame
Sep 16, 2023

Firstly, I thought the actual algorithm presented was really clever, especially the method of incrementing a register along a minimal set of edges to produce unique sums. The first thought that came to my mind was that the extra information over edge profiling would be extremely helpful for reordering blocks and setting branch fallthroughs. Edge and block profiling can help, but having concrete knowledge of which are the hot paths to try and keep them contiguous in the generated code seems much better.

One motivation I wasn't totally convinced of, however, was using this for code coverage. While ensuring path coverage in tests would definitely be exhaustive, it also seems overboard. I think ensuring line and branch coverage would yield a much higher value-to-time ratio than path coverage and because of this be more adhered to by developers in a professional setting.

4 replies

keikun555 Sep 18, 2023

I second the path coverage sentiment. There may be some paths that are syntactically possible but in practice are impossible at runtime, for example:

def foo(x):
    y = 0
    if x == 1:
        y = 1

    if x == 2:
        y = 2

    return y

A simplified CFG here might look like

All paths from this CFG here may look something like {(y = 0, return y), (y = 0, y = 1, return y), (y = 0, y = 2, return y), (y = 0, y = 1, y = 2, return y)}. In reality, we are guaranteed to not run (y = 0, y = 1, y = 2, return y) because x cannot be both 1 and 2. So it doesn't make sense to make sure tests cover all paths, since from what I understand this algorithm doesn't prune these unreachable paths.

rcplane Sep 18, 2023
Author

I agree that there is good potential for performance tuning gains by accounting for full hot paths and performing block reordering to optimize fall-throughs.
Regarding applicability to software test coverage, there is ongoing debate as to whether full line coverage, integration tests, or other variants are sufficient and efficient. Noting that expensive software failures still happen, at least integration test simulations with full mission or business conditions provided may be sufficient in most cases but some industries place higher burden and cost valuation on software assurance and might be interested in testing all possible entrant paths.
One might argue that code unreachable in any normal business condition is dead code that should be eliminated for the health of a verifiable code base.

zachary-kent Sep 18, 2023

Reflecting on the most frustrating compiler bugs I've worked out in the past, I'd actually contend that path coverage is very useful when working with complex code. Specifically, it's very easy to get full statement/expression coverage when writing tests for compilers -- you just write a massive program utilizing most language features. For example, although I have not tried it myself, I imagine that the snapshot tests in the Bril repo result in quite high statement coverage, despite being unsatisfactory for rigorously analyzing correctness. Further, even statement coverage suffers from the unreachable code problem, as invariant violations should never be covered.

sampsyo Sep 18, 2023
Maintainer

About test coverage: I think it's true that nobody strives for 100% path coverage. But once you've hit 100% statement coverage, it can still be useful as a signal about how much better than that you're doing.

bennyrubin · 2023-09-18T12:09:17Z

bennyrubin
Sep 18, 2023

Another comment mentioned how clever the solution was. I was also kind of surprised that it is possible to have a different integer result for each path from 0..n-1. However, the proof was very simple and elegant, which is always a nice result for a surprising outcome. It is also nice to see a paper deviate from the status quo, and show it is possible to do profiling (from edge --> path) a different way.
Given that this paper was from 1996, I am curious where the field has gone from there. Considering we are still reading it today, my guess is that this paper was impactful on profiling and also the field as a whole. I am especially interested in applications of this work to runtime verification. One of the biggest barriers to adoption for runtime verification is the high overheads of instrumentation. I imagine being smart about where instrumentation is placed has applications not only for path profiling, but for trace collection as well. Perhaps the interplay between different forms of collection (edge, path, tracing) has interesting research problems (i.e. if you want to do multiple on a single binary).

2 replies

rcplane Sep 18, 2023
Author

This is an interesting idea that we could use integer encoding of partially executed paths to represent progress through the program and potentially express properties useful for runtime verification.
The claimed efficiency in this paper in part derives from the encoding mentioned, as well as choosing based on edge profiling of call count frequencies to instrument only necessary edges with low call count to account traveled paths with less register add instructions inserted. The particular implementation of the tool, PP, relies on a data flow analysis to identify unused registers that would certainly also be helpful for efficiently inserting runtime verification information. (pg.53 "5 implementation")

sampsyo Sep 18, 2023
Maintainer

I may have mentioned this briefly on Thursday, but AFAIK this paper is the "last word" on efficient path profiling. I think it's the way people to do it if they need path profiling—the only question is how many people need that. For example, LLVM used to have a Ball-Larus instrumentation pass, but it has since been removed because it wasn't used widely enough.

Great question about RV; I don't have a good answer, but I think it's interesting to ponder.

keikun555 · 2023-09-18T15:01:44Z

keikun555
Sep 18, 2023

I felt that the algorithm to derive an acyclic CFG from a cyclic one in pages 52 and 53 was clever. The key intuition was to remove backedges while maintaining a bijective map between all paths in a CFG to those in the resulting CFG. I was also surprised that this algorithm would work for any CFGs, including those with irreducible loops. I wonder how applicable this reduction technique was for analysis tools that came after this paper.

One part I found unsatisfying is how they deferred how they handle early termination to another paper, Bal94. It wasn't clear upon my first skim of the paper why the event counting algorithm handled early termination -- all I found was:

Query increments also can be used to update the event counter correctly when execution of a procedure terminates early (i.e., not at the EXIT vertex) due to an exceptional conditional or interprocedural transfer of control (such as setjmp / Iongjump).

2 replies

rcplane Sep 18, 2023
Author

Later work also by Larus on Whole Program Paths also by Larus furthered this acyclic path notion to develop a grammar of acyclic paths that programs actually execute in the interest of finding minimal acyclic hot subpaths.

For early termination, in the efficient path counting case early termination would interrupt correct path counting as the path regeneration algorithm relies on the fact that paths always reach from entry to exit and path regeneration can follow 0 cost edges to complete a path. Since the exception handling logic doesn't seem to be in scope for EEL instrumentation by binary rewriting, they don't have an easy way to add an instrumented exit block in the CFG for the case of early termination.
The comment "The event counting algorithm provides a way to correctly update the counters in these routines [Ba194]." seems to refer to how, if one was counting some particular metric or set of events during profiling using the query increment scheme they describe, that the current state of the metric register counter, query increments and call stack allow accurate reporting of that metric or set of events for a partial execution.

keikun555 Sep 19, 2023

I see, so even if the function terminates early, we can still report the path profile up to that point till termination -- useful to know!

jdroob · 2023-09-18T17:13:55Z

jdroob
Sep 18, 2023

As others have mentioned, the algorithm discussed in the paper is certainly elegant - even after reading into the algorithm it's surprising that such an efficient algorithm exists for CFGs with cycles. I was unaware of the "overlapping paths" inaccuracy of edge profiling and basic block profiling and the fact that these inaccuracies were ignored because it was assumed that path profiling would be prohibitively expensive.

One of the main themes of this paper is "Overhead vs Accuracy". On one end of the spectrum are basic block profiling and edge profiling which appear to have lower accuracy but also have less overhead. Alternatively, we have the path profiling presented in this paper which has more overhead but is also more accurate. For the use case of program optimizations and performance tuning, I believe the additional overhead of path profiling would be outweighed by the performance gains from using path profiling instead of relying on heuristics. I'm not so sure about the test coverage application for reasons mentioned by others on this thread.

3 replies

rcplane Sep 18, 2023
Author

"Overhead vs Accuracy" is a notable theme from this work, and applies very generally to performance profiling as any operation that records information to improve accuracy must take some time and space overhead in order to execute.
The measures taken to make path profiling possible to perform given the infinite set of potential execution paths have some notable implementation limits described on the top of page 54, even though newer 64 bit architectures and larger modern storage devices could relax these practical limits.
I am curious since I was reading through papers citing this work a bit, have we seen a reference for particular optimizations that were implemented based on path profiling instead of heuristics, function call counts or edge profiling?

MelindaFang-code Sep 18, 2023

Totally agree on this. Besides giving elegant proofs and a simplistic algorithm for computing path profiling, the author puts a lot of thoughts on comparing the time vs space tradeoffs. For instance, it talks about using hash table vs memory increment. It also talks about having the path register hold a counter’s address instead of the index. This saves instructions in the code that increments a path’s counter, but reduces the range of increments that fit in an instruction’s immediate field. It is quite interesting to see authors dive deeply into the level of registers and talk about the tradeoffs under such low level

sampsyo Sep 18, 2023
Maintainer

I honestly am not aware of any optimizations seen "in the wild" that consume path profiling as an input. However, it really seems to me like basic block layout could benefit from this information, as could approaches to strategic code duplication. I would be surprised if no one had ever tried this, but if not, it could make for a nice course project…

A quick google search reveals that the Codestitcher paper contains a paragraph talking about the pros/cons of using path profiling in basic block layout. (They do not use it.)

alifarahbakhsh · 2023-09-18T21:06:53Z

alifarahbakhsh
Sep 18, 2023

Reading this paper had two positive points for me:

A static set of instrumentations can go a long way at measuring the execution frequency of different paths in a CFG.
Optimizing the performance of a given compiler algorithm pushes one to dissect the algorithm to the many tiny operations that it involves, and to either eliminate the unnecessary ones or to aggregate a bunch of them together.

Following the first point above, I am wondering about self-organizing, dynamic instrumentation and whether it makes sense to think about it. It would have to happen at runtime and through many iterations of executing a piece of code. In a nutshell, some runtime agent would have to execute the program in a controlled environment, collect data, and suggest some instrumentation for the next run. The process would continue either until convergence or until some measure of complexity or error has been reached. The result would be an instrumentation of the program, which would now be ready for the actual path profiling.

A problem that I had with the paper was realizing that the algorithm itself provides the uniqueness guarantee for every path starting from every vertex in the CFG. This is what forces the authors to add some steps to the case in which loops in the CFG are broken via removing their backedges. While I understand how the extra steps help, I do not know why we need to preserve the uniqueness guarantee for every path - even ones not starting at ENTRY or not ending at EXIT.

1 reply

sampsyo Sep 18, 2023
Maintainer

That's a fun idea! It's not too different from one very simple form of "cascading" profiling they do in this paper, which is to start with a block profile to decide the cheapest places to put their instrumentation for path profiling. But you're suggesting something possibly finer-grained than that.

Good point about those "partial" paths. It's possible that this is just a way to build up the instrumentation in general; I'm not sure how much simpler you could make it if you only wanted to track those "end-to-end" paths.

SanjitBasker · 2023-09-18T23:10:00Z

SanjitBasker
Sep 18, 2023

Reading this paper reminded me of the importance in having a variety of profiling tools. For instance, when I look at a flame graph in my IDE I'm willing to have, say, 100% overhead (aka 2x factor in wall-clock time) in running my code to generate that visualization, because I don't generate it very often. I'll use this rough summary to rewrite a large amount of code (e.g. rethink my choice of data structures for a function that is taking up a lot of the execution time).
Meanwhile, compilers make much more fine-grained decisions, like whether to make some loop more efficient at the cost of increasing the static instruction count. In this case, more precise data is needed, and the data is of a different form: rather than telling which portions of the code contribute the most to the runtime, a path profile tells you something about the distribution of runtime paths in your code. The authors focus on a specific problem in this area and give a theoretically satisfying solution to it.

I'm curious about real-world systems that use profile-directed compilation. I'm also not sure what the standards are for overhead--it's quite impressive that the overhead numbers they reported are small, but I feel like in most cases, it should be possible to use path profiles to measure the execution once on one version of the code and then use the results to guide optimizations to get the next version.

3 replies

sampsyo Sep 18, 2023
Maintainer

We'll get into this in a little more detail a bit later in the semester, but the two settings where this stuff is used for real are:

profile-guided optimization (PGO) / feedback-directed optimization (FDO): moderate overhead is OK; we're running the thing on a few inputs to guide a second round of compilation
just-in-time (JIT) compilers: overhead must be truly minimal

Path profiling is probably in the former category.

vivianyyd Sep 19, 2023

The algorithm presented in this paper feels quite slick. I liked the portion about finding a minimal placement of instrumentation to reduce overhead from transitions - it was really elegant to me.

I noticed Prof. Sampson mentioned above that there don't seem to be any well known optimizations that use path profiling. But I do also feel like having even stronger information than say, edge profiling should buy us some cool optimizations. I wonder why there seems to be less successful work here, so reading this paper makes me excited to learn more about PGO.

NgaiJustin Sep 19, 2023

Interesting points about the importance of profiling tools and the various considerations when it comes to optimizing code. I wonder if the path profile information could be looped back to the developer in the form of code suggestions—highlight hot paths, cold paths, etc. Similar to how the JetBrains IDEs would provide suggestions when variables are unused, simplify primitive logic, and even flag control flow bugs, I think surfacing this path profiling information to the developer could either help confirm the mental model or flag opportunities for optimizations.

20ashah · 2023-09-18T23:28:27Z

20ashah
Sep 18, 2023

I found this research paper to be really interesting and shifted the way that I think about compilers. This paper made me realize the importance of a profile driven compiler, and how crucial profiling is for optimization. There were a few sections in particular that stood out to me that I had a few thoughts about. First, the extensions section stood out to me as the most exciting part of this algorithm. It suggests various directions for future applications. Which of these extensions do people find most promising and do you think any of these ideas have been explored further in later research? Another part that was interesting to me was the mention of path profiling being useful in areas beyond program optimization, like testing. One part of compilers that interests me is debugging, so I wonder if this algorithm could extend to adding debug support in compilers.

4 replies

sampsyo Sep 18, 2023
Maintainer

About this part:

Which of these extensions do people find most promising and do you think any of these ideas have been explored further in later research?

Maybe this would more efficiently drive discussion if you were to gather a list of them for people to choose from?

20ashah Sep 19, 2023

Good point. From the paper and doing some research, here is a list of some of the extensions that I thought were interesting.

Leverage path profiling to aid in debugging
Extend path profiling from programs into neural networks to help with potential optimizations in machine learning models
Track memory related performance like cache misses or page faults
Deciding which program optimizations to pursue given the path profiling information

keikun555 Sep 19, 2023

I wonder if this algorithm could extend to adding debug support in compilers.

It would be pretty cool (and useful) to give a path of the program that terminated early (i.e. raised exceptions, segfaulted, etc). In addition to stack traces and stack examination from GDB, with the path we can figure out how the program got to that point, which I think is really useful in situations where it's not clear from just the stack trace.

rcplane Sep 19, 2023
Author

In production application monitoring some organizations will invest in production instrumentation to monitor latency and distributed tracing call stacks across (micro) services to hunt down bugs and slow performance using a distributed trace id. https://www.datadoghq.com/knowledge-center/distributed-tracing/

zachary-kent · 2023-09-18T23:54:36Z

zachary-kent
Sep 18, 2023

As a JIT enthusiast, this was an amazing read! I did have a couple open-ended questions/thoughts:

In the AOT compilation model, paths never change. However, I can also imagine this sort of path profiling being immensely useful for adaptive optimization, where paths can indeed change when code is recompiled at a higher/lower optimization level. This would, however, likely invalidate the weights computed at each edge in section 3.3. Do you know if there has been any work to adapt this for a more dynamic setting?
I am also very intrigued about how path profiling information can be best visualized from a UI perspective. For example, how do you best present path coverage information to a programmer who might not be familiar with basic blocks and other compiler knowledge?

5 replies

sampsyo Sep 18, 2023
Maintainer

Good question. I think the norm in JIT compilers is that profiling data is usually "invalidated" once you recompile at a different tier (or based on profiling from a previous round). What you're suggesting is somehow avoiding the need to re-collect profiling data by "transplanting" it from the old, less-optimized code to the new, more-optimized code. Seems very intriguing and I'm pretty sure I've never seen anything like that!

One could argue it's "easy" (actually a no-op) if all you're doing with the path profile is rearranging the basic blocks. But beyond that, it seems like you'd have to be super careful.

zachary-kent Sep 19, 2023

Yeah -- I agree that the core of what I'm getting at is some kind of fine-grained incremental profiling, which definitely seems very hard.

keikun555 Sep 19, 2023

how do you best present path coverage information to a programmer who might not be familiar with basic blocks and other compiler knowledge?

Maybe a tool might take in a line number and spit out how many paths with that line of code exists and how many paths are covered in the given tests. Something like code coverage but instead of a green checkbox next to each line, we have a fraction: num covered / num paths.

willwng Sep 19, 2023

Maybe a simple question, but at what point should a JIT decide to use the path profiling information and optimize? At what point can we confidently state that a path is most frequently taken?

I also wonder if a JIT ever has to undo these optimizations if the hottest path is becomes less frequently. If it turns out that some other path is suddenly chosen more, how easy is it to make a change - maybe the simple solution is just a reordering of blocks? Though I can imagine a very aggressive optimization being propagating expressions and computed values along one path (where they may take a different value on another path); this could be tricky to undo.

Enochen Sep 19, 2023

I'm not the most well versed in this topic but I'm under the impression that (at least some) existing JIT compilers do undo/redo optimizations as the empirical "hottest path" changes across an execution.

However, how one goes about determining when to reoptimize is an interesting question. On one end, you may reduce performance by switching too often and introducing too much compilation overhead. However, it seems disadvantageous to compile once and never again for that run, since in many cases workloads will vary across time.

I would imagine some kind of hardcoded criteria would work "well enough" in the average case, though I can imagine a whole class of heuristics that try to find the optimal criteria to recompile. The challenge there is trying to keep overhead minimal, so I can totally see a world where hardcoding a criteria wins out in end-of-day performance against smarter approaches.

obhalerao · 2023-09-19T00:38:36Z

obhalerao
Sep 19, 2023

I was very intrigued while reading this paper; path profiling in and of itself seems like quite a difficult task, and being able to efficiently compute and represent what could be up to an exponential number of potential paths seems like a near-impossible task at first. The algorithm that the paper presented is quite clever in that it uses the structure of the path itself to store its number, and as such, successfully takes advantage of the many cases where the number of paths in a program is more manageable, and the hash table optimization further takes advantage of the cases where the number of theoretical paths is high, but the number of actually executed paths is lower.

This had me thinking about a more general idea; throughout all of my time as an undergrad CS student, worst-case complexity was emphasized when analyzing algorithms. Thus, my first instinct when I am asked to analyze an algorithm is to evaluate its worst-case complexity. However, as this paper implicitly takes advantage of when designing their algorithm, the worst-case complexity of an algorithm is not always indicative of its performance in practice (for instance, there are almost certainly comparatively very few programs in the real world for which this path profiling algorithm needs to truncate paths in order to run successfully). As such, when I first read the paper, my first instinct was to dismiss the algorithm as requiring exponential space "in the worst case," when in reality, not only do most cases not run into this issue, but the algorithm uses clever tricks to deal with those large cases, as opposed to designing the algorithm solely around the worst case. This general paradigm of designing algorithms for the average case and dealing with the worst-case separately seems quite useful when designing practical algorithms; is it the case that most algorithms for program analysis tasks like these follow this paradigm?

2 replies

bcarlet Sep 19, 2023

That's an interesting point about average- vs worst-case complexity for practical algorithms. In general, I think it's certainly the case that compiler writers have to focus on the "average" case more than the worst case, since lots of interesting problems compilers want to solve are, in full generality, undecidable. Though we often dismiss exponential algorithms as intractable, we could just as easily point to Rice's theorem and declare that program analysis/optimization is a hopeless endeavor. To get around this, we have to be comfortable accepting a conservative answer that we hope will still be useful for "average" programs.

xalbt Sep 19, 2023

As @bcarlet mentions, many problems in program analysis are undecidable or NP-hard. Then it is almost necessary to construct algorithms with heuristics that focus on the average case or give conservative answers. Foremost is dataflow analyses; they're not guaranteed to give the most informative meet-over-paths answer for certain inputs, but their output is enough to enable powerful optimizations. Another example is register allocation with the Chaitin-Briggs algorithm. Optimal graph coloring is NP-hard, but Chaitin's provides a near-optimal solution in just worst case quadratic time. In programs with little control flow or register pressure, which describes most programs, it usually runs in linear or linearithmic time. I do wonder if certain compilers have an "early stopping" mechanism or a "fast path" if it detects that an optimization is running or will run too slow.

An additional thought I had when reading your post is that compilers and compiler optimizations are very much designed to make the average case fastest. This underlies the premise of loop optimizations where running a loop body is expected to be the average case, or basic block reordering and code duplication to make the expected average case path faster.

matth2k · 2023-09-19T00:56:17Z

matth2k
Sep 19, 2023

The paper is interesting to put into the context of today, where we're using more complicated and heterogeneous architectures for a lot of HPC problems. I think all the small progressions in modern computing, like SoC architectures, multi-core synchronization, parallel programming , etc., all add up and make this type of profiling less useful for optimizing compilers of today. After all, the SPEC CINT95 programs that the paper shows the most improvements on (in terms of % correct) are the ones that have explosions in the number of CF paths they create, like the game of go benchmark versus something like compression. I just don't think problems like that are very representative of the cutting edge today. Let me know if you disagree, I might have too narrow a perspective on this.

Nonetheless, I definitely see the JIT argument for a profiler like this. But for AOT, my first thought still is to just use more structured control flow. That sounds easier than converting the CFG to a DAG, to a DAG with "chord" edges, then having different instrumentation for each edge type.

3 replies

zachary-kent Sep 19, 2023

Could you elaborate on how specifically these advances might make path profiling less relevant? Overall I think I agree though :)

matth2k Sep 19, 2023

I guess I'm just thinking that the complexity of control-flow graphs is no longer the focal point when you have more elaborate models of execution. The trend is to make applications simpler by partitioning it into more modular components, rather than brute-forcing through the complexity of a more monolithic design.

sampsyo Sep 19, 2023
Maintainer

I think a version of this argument I can support is that this type of "messy" CFG profiling is less relevant when we're focusing on the big computational kernels that a lot of performance-oriented work prioritizes today. Like, is there any meaningfully interesting CFG-related work to be done to a matrix multiply kernel?

I wonder, however, if this kind of stuff could apply at a different level. For example, maybe path profiling could be useful at the level of a TensorFlow execution graph, for instance? I haven't thought about this for more than 30 seconds but it could be fun to explore.

collinzrj · 2023-09-19T01:22:36Z

collinzrj
Sep 19, 2023

The authors propose an elegant algorithm and also did a great job explaining it. When I first came to the algorithm in Figure 5, it seems hard to understand. But when I look at Figure 6 just besides it showing a demonstration of Figure 5, I quickly get the point of this algorithm. I can also try to add up the values on the edge in my mind to test the algorithm. Figure 6 also makes it clear how the algorithm works efficiently: if there are m paths from a node to the end, then each new path point to it has to increase m to leave a space for these paths. It's also an elegant idea to use count[r]++ to count the number of times a path get visited.

The paper was published almost 30 years ago, do we still see a lot of elegant algorithms in compiler research today? Or we are not seeing so many as problems get more complicated.

2 replies

jiahanxie353 Sep 19, 2023

Totally, I think these concise yet powerful (hence elegant) ideas/algorithms really make this paper appealing and still stand out today. Apart from Figure 5&6, the algorithm to find the minimal cost of chord edges is also very neat and clever.
I'm not sure if there are a lot of elegant algorithm out there in compiler research, I feel like it's also correlated with the development of theory research? But I do agree that a good metric would be it has to be neat so it worth the trade-off, and that it has to be widely applicable to complicated scenarios (because sometimes it might be fancy but it's only useful under specific conditions).

sampsyo Sep 19, 2023
Maintainer

There probably are indeed fewer neat, tidy algorithms like this being published in compilers these days. But that doesn't mean they don't exist! We can all dream of coming up with something clever and "obvious in retrospect" like this. We just need the right problem...

ryanwmao · 2023-09-19T03:51:26Z

ryanwmao
Sep 19, 2023

As many other comments have mentioned, I thought the algorithm described by the paper is a very neat approach to a problem that sounds pretty complicated. The results presented in the paper are also quite impressive, but I'm curious as to how optimizations utilizing path profiling perform compared to alternatives. I'm also wondering how these optimizations would be performed alongside some of the analysis we've discussed in class

0 replies

emwangs · 2023-09-19T04:04:20Z

emwangs
Sep 19, 2023

One of the most though-provoking aspects of the paper is the trade-off between edge profiling and path profiling in terms of overhead. It mentions that path profiling has negligible overhead or is more performant for procedures with a small number of potential paths, while edge profiling could be more suitable for procedures with a large number of potential paths, as increase in overhead is correlated with path profiling on larger programs. For me, this seems a little counterintuitive -- if this is an analysis that can lead to more powerful program optimizations, it should be taken advantage better in large programs, but it is far slower for large programs. How might this affect the choice of profiling technique in real-world software development projects, and what factors would you consider when making this decision?
Personally, even though they say the performance overhead is comparable, the fact that some benchmarks were up to 96% higher seemed a little dubious. An average of 30% higher overhead I feel like also can be significant considering extremely large codebases -- I feel like there are many large large CPP codebases, for example, that already take on the order of hours to compile, and then tradeoff of using this profiling technique in these projects could be significant.

Aside from this, I felt that the algorithm of placing instrumentation + assigning unique value paths was a little hard to follow, so I would love to go over it in discussion!

1 reply

sampsyo Sep 19, 2023
Maintainer

Very broadly, profiling-based optimization is expensive merely because you have to run the program at all during compilation (as opposed to "normal" compilation, where you never run the program at all). Nonetheless, people see it as worthwhile even for giant, complicated C++ projects like Chrome:
https://blog.chromium.org/2020/08/chrome-just-got-faster-with-profile.html

he-andy · 2023-09-19T04:23:57Z

he-andy
Sep 19, 2023

This was a really interesting topic to read about. As it's a very novel idea for me to use profiling information to optimize code at compile time. Beyond just the neatness and simplicity of the algorithm (which others have commented on), I think some of the applications of this beyond compiler optimizations are really interesting. One that the paper touched on is for performance analysis by identifying performance bottlenecks in specific paths of the program. Another that would be really interesting is to provide better debugging information, like determining what path the code took before a panic or segmentation fault.

2 replies

sampsyo Sep 19, 2023
Maintainer

Interesting idea about the "post facto" debugging use case! Knowing the recent path (in addition to a stack trace) on a crash seems potentially really useful. I wonder if there's any SE research out there about how best to present this to programmers.

bennyrubin Sep 19, 2023

I am already thinking about all the times this would have saved me a ton of time. I would love to see this as a feature, perhaps if only in a fancy IDE.

evanmwilliams · 2023-09-19T05:18:35Z

evanmwilliams
Sep 19, 2023

I really enjoyed this paper! I guess I am a bit surprised that this did not exist before 1996 - the techniques used to come up with the algorithm and the proof are covered in introductory data structures and algorithms courses. I think it is pretty neat that a reasonably clever undergraduate today could conceivably come up with something like this?

I found it particularly interesting how many areas that this algorithm could be applied to. Optimizations are obviously a big one, but it also helped make advanced in debugging and testing as well as code layout (for example, you can place frequently executed paths of instructions in contiguous memory locations to better utilize caching). I am particularly interested in seeing what research has been done in this area since this paper was released - I'd imagine that limitations such as loops and instrumentation complexity made it difficult to immediately adopt these results in industry. But we clearly use profilers a lot now, so I wonder what other advancements led to the technology becoming. more practical.

0 replies

Arthur-Chang016 · 2023-09-19T06:17:25Z

Arthur-Chang016
Sep 19, 2023

If I have know this in my CS4120 project, I would have produce a better solution for the Tree Register Allocation which I implemented it as our compiler backend. In our tree formation portion (cutting off some edges of CFG), there're several way to construct it, including EBB(extended basic block), dominator tree, and maximum spanning tree with path profiling. The path profiling would achieve the best performance experimented by path profiling. But we didn't have much idea how to implement it.
With the low runtime overhead, this path profiling algorithm is pretty suitable for productive compilers. Overall, this would be useful in either mid-end or back-end of compiler when some branches need to be emphasized to adopt heavier optimization on it.

1 reply

sampsyo Sep 19, 2023
Maintainer

That's a cool story! It might make for an interesting 6120 course project to drive register allocation based on path profiling...

Reading (for 9/19) : Profiling #385

Replies: 16 comments · 35 replies

rcplane Sep 18, 2023 Author

sampsyo Sep 18, 2023 Maintainer

rcplane Sep 18, 2023 Author

sampsyo Sep 18, 2023 Maintainer

rcplane Sep 18, 2023 Author

rcplane Sep 18, 2023 Author

sampsyo Sep 18, 2023 Maintainer

sampsyo Sep 18, 2023 Maintainer

sampsyo Sep 18, 2023 Maintainer

sampsyo Sep 18, 2023 Maintainer

rcplane Sep 19, 2023 Author

sampsyo Sep 18, 2023 Maintainer

sampsyo Sep 19, 2023 Maintainer

sampsyo Sep 19, 2023 Maintainer

sampsyo Sep 19, 2023 Maintainer

sampsyo Sep 19, 2023 Maintainer

sampsyo Sep 19, 2023 Maintainer

Replies: 16 comments 35 replies

rcplane Sep 18, 2023
Author

sampsyo Sep 18, 2023
Maintainer

rcplane Sep 18, 2023
Author

sampsyo Sep 18, 2023
Maintainer

rcplane Sep 18, 2023
Author

rcplane Sep 18, 2023
Author

sampsyo Sep 18, 2023
Maintainer

sampsyo Sep 18, 2023
Maintainer

sampsyo Sep 18, 2023
Maintainer

sampsyo Sep 18, 2023
Maintainer

rcplane Sep 19, 2023
Author

sampsyo Sep 18, 2023
Maintainer

sampsyo Sep 19, 2023
Maintainer

sampsyo Sep 19, 2023
Maintainer

sampsyo Sep 19, 2023
Maintainer

sampsyo Sep 19, 2023
Maintainer

sampsyo Sep 19, 2023
Maintainer