Lesson 8: Loop Optimization #295

sampsyo · 2022-02-18T16:40:29Z

sampsyo
Feb 18, 2022
Maintainer

charles-rs · 2022-03-22T17:27:23Z

charles-rs
Mar 22, 2022

Find my implementation here. I implemented LICM

Implementation:

Pretty much just the lecture notes, with some extra hints from the 4120 lecture notes, which actually present the analysis in an inverted way (start with everything LI, and mark it as non-LI). I had to figure out how to merge loops with the same header, and tried to do this conservatively, but ended up just merging every loop with the same header into one. There is probably a better way to do this, but this worked.

Results:

Running on the benchmarks, the average % speedup was 3.9, which seems nice. Most of this is coming from the programs generated by the typescript compiler, as they redeclare a lot of constants. The better handwritten ones would often have a slowdown of an instruction or two since the preheader being inserted caused some overhead. I guess I could've taken it out, but it caused enough headaches i just left it.

Testing:

brench + benchmarks for final testing, again most actual debugging happened in the REPL

Challenges:

Not to sound like a broken record, but I still have a poor cfg design. If I were to do this class over I would make a better abstraction that more easily facilitates adding/removing nodes/edges, as I have a lot of hidden invariants right now.
Another challenge was the way equality works in lisp: by default, it is pointer equality, which is not what I want for pretty much anything. This is annoying since this is how to check is a is a member of the list l:

(member a l)

HOWEVER, this checks w/ pointer equality, if i want structural equality I have to write

(member a l :test #'equal)

which is hard to remember late at night. This is not limited to the member function, it is pretty much everything to deal with lists, but I think I have it working now.

Another challenge was loops with the same header, and how to insert preheaders. The 4120 notes say to merge them if they are disjoint, but I thought it would be more optimal not to when they are nested. This broke, so I ended up merging them anyways, and still have a 4% speedup. I also had some issues inserting the preheader as I naively assumed it was the unique predecessor to the header, but it's really the unique predecessor to the header outside the loop, which I learnt the hard way (this was also hella broken because of my bad inconsistent CFG)

1 reply

sampsyo Mar 27, 2022
Maintainer Author

Thanks for the interesting discussion!! Yeah, getting a data structure "right" so the invariants are enforced rather than expected is a good goal for some future compiler you work on. 😃

About pointer vs. structural equality: sometimes it's actually nice when pointer equality is the "easy" one because it forces you to set things up so that identity is equality for some stuff. That means interning or hash-consing stuff enough so that it's impossible to have different non-equal elements. That can be tricky to get going, but once it's going, it means comparisons can be super fast.

JonathanDLTran · 2022-03-22T23:55:52Z

JonathanDLTran
Mar 22, 2022

Summary

Code is here, here and benchmarks where the benchmarks link contains the tests/benchmarks, the full csv with data and plotted bar chart.

For this lesson, implemented loop invariant code motion. I tested the reuslts against a small test suite I wrote, and then also used the brenchmarks for the Bril core language as testing. In addition, I also used the same benchmarks to measure the impact of optimization. The optimization itself can be run via bril2json < benchmark | python3 licm.py --licm=True | brili -p {args}. To see the results on the benchmarks, one can also run brench licm.toml > results-licm.csv to get results in CSV form.

I had some challenges along the way, such as changing all jumps to the loop preheader rather than to the header in the case of backedges in the loop. This caused the optimization to perform poorly, and once I realized this mistake, fixing it also showed a performance improvement on the benchmarks.

Writing the tests also revealed another challenge, in which I had problems trying move instructions out from the loop to the preheader. I found that I needed to move instructions from the loop in a specific order, where the operands to a instruction had to already be moved, before the instruction could be moved.

Another issue is dealing with nested loops. At the present moment, I do not distinguish between inner loops and outer loops, with the inner loop nested in the outer one. Doing an optimization on the inner loop first might allow instructions moved to the inner loop's preheader to be once again moved to the preheader of the outer loop, allowing for further optimization impact.

As an aside, to learn more about loop optimization, which I found interesting, I also choose to implement induction variable elimination, which I tested on some trivial examples to see whether it could detect simple instances of induction variables. Because I did not implement the optimizations to target the Memory extension of the Bril language, the induction variable elimination effect is much more limited. For this optimization, I also faced challenges, many of them pertaining to what kinds of basic induction variables I could detect, and how I could move their definitions. This also forced a more basic optimization. Because it was not the focus of this week's assignment, I did only basic tests on it, and did not measure the impact of the optimization.

Optimization Results

I found that the loop invariant code motion optimization was able to achieve an average of 3.5% improvement in the number of dynamic Bril instructions executed by the interpreter (with standard deviation of 5.6%), when compared to running the benchmark without loop invariant code motion. I use the same arguments that the benchmark writer speicified. The results come from evaluating the optimization on the Bril benchmarks that only use the Bril core language (Excluding Floating Point extensions and Memory Extensions). Because I measure the number of dynamic instructions using the interpreter, I do not have to measure many runs (unlike measuring an LLVM optimization, which when run on a machine, would have variability in timing.)

A sample of the table with the data is included here, to avoid making the post too large, but note that this sample is not representative across all benchmarks, it is just the first 8 on my benchmarks ordering:

Here benchmark refers to the benchmark name, run is either baseline, referring to the benchmark without loop invariant code motion applied, licm refers to the benchmark with loop invariant code motion applied, and result is the number of reported dynamic instructions when the brili interpreter is used.

benchmark	run	result
quadratic	baseline	785
quadratic	licm	703
primes-between	baseline	574100
primes-between	licm	491191
orders	baseline	5352
orders	licm	4928
relative-primes	baseline	1923
relative-primes	licm	1858
check-primes	baseline	8468
check-primes	licm	8058
sum-sq-diff	baseline	3038
sum-sq-diff	licm	2840
loopfact	baseline	116
loopfact	licm	101
recfact	baseline	104
recfact	licm	104

A plot is also shown in the repository, and the full CSV is also in the repository.

Overall, across all the benchmarks evaluated, Loop Invariant Code Motion reported no increase in the number of dynamic instructions. Indeed, for a majority of the benchmarks (15 of the 23 benchmarks), the number of dynamic insturctions executed was exactly the same as the number of dynamic instructions executed for the baseline. For a 8 of the 23 total benchmarks, a decrease in number of dynamic optimizations was achieved. For the benchmarks where a decrease was observed, the decrease was generally a result of pulling out instructions that were defining constants; these constants were then removed to the loop header, thus decreasing dynamic instruction count.

Some limitations of these measurements is that working with dynamic instruction count fails to take into account other factors into the time the program executes. For instance, branching can affect the time by forcing the instruction pipeline to be flushed, and instructions may not all take the same amount of time to execute.

In summary, the loop invariant code motion optimization improves performance of Bril Core programs by sometimes decreasing the number of dynamic instructions executed. A more interesting analysis would be to see if further optimization can be achieved by pairing loop invariant code motion with other optimizations, that eliminate dead code in the loops, or whether there are different loop scenarios that can be optimized even further.

1 reply

sampsyo Mar 27, 2022
Maintainer Author

Extremely nice work, Jonathan! I enjoyed looking at your bar chart. 😃

I found that I needed to move instructions from the loop in a specific order, where the operands to a instruction had to already be moved, before the instruction could be moved.

Yeah, this is one of the tricky things about moving instructions, even if you know them to be loop invariant: you have to preserve data flow ordering too.

At the present moment, I do not distinguish between inner loops and outer loops, with the inner loop nested in the outer one. Doing an optimization on the inner loop first might allow instructions moved to the inner loop's preheader to be once again moved to the preheader of the outer loop, allowing for further optimization impact.

This is exactly the way to do it to avoid working too hard to manage an entire loop nest at once!

barabanshek · 2022-03-23T01:11:55Z

barabanshek
Mar 23, 2022

OK, I tried implementing LICM in LLVM and got some weird results that I would like to discuss. My initial implementation is here

Idea

I derived the following algorithm to implement LICM:

detect loops with LoopInfo &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
for each loop, find the preheader block
for each block in the loop, and for each instruction in it, check is it's a loop invariant with the built-in function isLoopInvariant()
make a list of invariants that should be moved out of the loop
after the loop pass, iterate through the list and move instructions to the preheader with moveBefore()
if there was an instruction that got moved, repeat the steps above to move any dependent on it instruction that now became a loop invariant, run till convergence

Limitation

The biggest limitation of this approach is moving of sub-basic blocks inside the loop if the whole sub-basic blocks are invariants. E.g. if there is a branch inside the loop that is also invariant as a whole with all its successors.

Issues

After implementation I faced the issue that my optimization is not moving any code on the examples that I've tried. The reason is because isLoopInvariant() can NOT consider load and store instruction as loop invariants. And this is understandable as load/store actually deal with memory locations rather than registers. Understanding reaching definitions for memory locations is very hard, and isLoopInvariant() picks up the conservative approach.

Now the question is why clang generates literally all explicit assignments in my program as load/store from the stack - it never uses registers for them, so the whole program turns out to be load/store instructions, and nothing can be detected as a loop invariant. An example of this is bellow:

For a C program

    size_t sum = 0;
    for (size_t i = 0; i < 10000; ++i)
    {
        size_t t = 0;  // should be an invariant
        sum = call(t + 10 + 10 + 10);  // should be an invariant (both call and additions)
    }

clang generates

      store i64 0, i64* %5, align 8          // it generates a store, so it can not be classified as an invariant
      %10 = load i64, i64* %5, align 8   // it generates a load, so it can not be classified as an invariant
      %11 = add i64 %10, 10 // not an invariant as %10 is not
      %12 = add i64 %11, 10 // same
      %13 = add i64 %12, 10 // same
      %14 = call i64 @_Z4callm(i64 %13) // same
      store i64 %14, i64* %3, align 8
      br label %15

So the question is how to deal with this. I understand that using stack if totally fine for assignment operations, but how to detect invariants in this case? If I just replace sum = call(t + 10 + 10 + 10); with sum = call(10 + 10 + 10);, clang folds out the additions into sum = call(30); so I can not see the effect of LICM either.

Not workable workaround

I managed to bypass the isLoopInvariant()'s constraint on load/store by using another API called hasLoopInvariantOperands() which can detect that all the operands are invariants. This worked, but obviously, I got a lot of false positives when an instruction is NOT an invariant due to memory dependencies, but it gets detected and moved out.

Upd

I tried replacing sum = call(t + 10 + 10 + 10); with sum = call(10 / 0); hoping that clang won't reduce it. But it does!! It reduces devision by 0 to just 0. With a warning tho.

Upd_1

Thanks @orkosinha for the suggestion, I was able to run the optimization after doing -mem2reg pass. Here is my list of commands for future reference:

clang -emit-llvm -O0 -Xclang -disable-O0-optnone -c test.cc -o test.bc
llvm-dis test.bc
opt -mem2reg -S test.ll -o test-mem2reg.ll
clang++ -Xclang -load -Xclang llvm-pass-skeleton/build/skeleton/libSkeletonPass.* test-mem2reg.ll

4 replies

michaelmaitland Mar 23, 2022

It optimizes it because it can do anything it wants with undefined behavior!

sampsyo Mar 27, 2022
Maintainer Author

Thanks for the detailed discussion! If you want to learn more about mem2reg & stuff, please see some of my comments on the LLVM task thread from last week.

Did you get a chance to try out your pass on post-mem2reg LLVM code? Did it do anything?

barabanshek Mar 28, 2022

Yes, it moves the code out of the loop. The isLoopInvariant() method still returns 0 for some reason even on invariant instructions. But I ended-up using hasLoopInvariantOperands() instead and limit my optimization to arithmetic instructions.

sampsyo Mar 28, 2022
Maintainer Author

Interesting!

orkosinha · 2022-03-23T05:32:58Z

orkosinha
Mar 23, 2022

My implementation is here for LICM.

I learned LLVM has some really powerful utilities, and one that I guess trivializes some aspects of LICM implementation is the Loop class' makeLoopInvariant function found here. After my initial troubles with LLVM, I wanted to quickly use this to figure out the problems that I encountered in Lesson 7.

LLVM Round 2

So in my lesson 7 implementation, I ran into the same issue as @barabanshek ran into with new alloca's for every variable ever initialized. Luckily from my lesson 7 failure, I learned about the -mem2reg pass, which simplifies optimizations like these. However, enabling it was not an easy task for me. I first used opt to run -mem2reg and then run my pass on it with clang. This produced desirable results in the following example.

// simple.cpp
#include <stdio.h>

void double_it(int n) {
    n = n * 2;
}

int main() {
  int a = 2;
  int b = 2;
  for (int i = 0; i < 100; i++) {
    int c = a * b;
    double_it(c);
    printf("%i\n", c);
  }
  return 0;
}

with LLVM IR looking like

// simple.bc with -mem2reg
...
; Function Attrs: noinline norecurse
define dso_local i32 @main() #1 {
  br label %1

1:                                                ; preds = %6, %0
  %.0 = phi i32 [ 0, %0 ], [ %7, %6 ]
  %2 = icmp slt i32 %.0, 100
  br i1 %2, label %3, label %8

3:                                                ; preds = %1
  %4 = mul nsw i32 2, 2                           ;<- here's the invariant
  call void @_Z9double_iti(i32 %4)
  %5 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str, i64 0, i64 0), i32 %4)
  br label %6
...

going to

// simple.bc with -mem2reg and my LICM code
; Function Attrs: noinline norecurse
define dso_local i32 @main() #1 {
  %1 = mul nsw i32 2, 2                              ; <- loop invariant code motion'd

  br label %2

2:                                                ; preds = %7, %0
  %3 = phi i32 [ 0, %0 ], [ %8, %7 ]
  %4 = icmp slt i32 %3, 100
  br i1 %4, label %5, label %9

5:                                                ; preds = %2
  call void @_Z9double_iti(i32 %1)
  %6 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str, i64 0, i64 0), i32 %1)
  br label %7

Awesome, then I spent a couple hours back and forth trying to figure out how to integrate it into PassManager and what to #include. It was hard to find answers but eventually I found something that worked.

So some success was had...
this line PM.add(createPromoteMemoryToRegisterPass()); in adding to the pass manager was the solution to my troubles in lesson 7 and now. Finding the corresponding import took a while too.

Some brief implementation

So with the all powerful makeLoopInvariant the pseudocode for LICM goes from

iterate to convergence:
    for every instruction in the loop:
        mark it as LI iff, for all arguments x, either:
            all reaching defintions of x are outside of the loop, or
            there is exactly one definition, and it is already marked as
                loop invariant

to

iterate to convergence:
    for every instruction in the loop:
        call makeLoopInvariant
        if there was an invariant, restart search

So it's not very intelligent, working on that

Benchmarking

I ran embench-iot and I randomly got this fluke result against this baseline. I do not think I'm running the benchmarks correctly but I will update once I get better data. I do have some pretty basic tests in my test directory, and the loop invariants are indeed in code motion out of their original loops. This resulted in an average 0.97 % speedup as per perf on an average of 10 tests for each file.

Summary

Still a good amount of work to do if I really want to dive deep with using LLVM. My implementation of LICM should be more rigorous and benchmarking (outside of fluke result) should be... also more rigorous.

1 reply

sampsyo Mar 27, 2022
Maintainer Author

Alright; neat that you got embench going. Can you explain more about what the results output files mean? What makes it a fluke? Did you try running it twice to see if it's deterministic? I'd be curious about your interpretation of the data, more so than looking directly at the data itself.

andrewb1999 · 2022-03-23T05:33:27Z

andrewb1999
Mar 23, 2022

I implemented LICM in llvm. The code can be found here.

Implementation

One unfortunate side effect of using llvm is that the -O1 already performs LICM along with many other analyses and the -O0 option leaves the code in a faux ssa form. I was able to significantly simplify the implementation of LICM by first running the code through the llvm mem2reg pass. This pass converts the clang generated llvm code that puts all variables onto the stack into a more traditional ssa style with virtual registers and phi nodes. This can be done by simply adding the pass to the pass manager when registering the LICM pass. To allow optimization under -O0, clang must also be provided with the -Xclang -disable-O0-optnone option.

Once the llvm code is in a "real" ssa form, LICM becomes much simpler. The pass is implemented as an llvm loop pass and makes use of the utilities for determining if an instruction has side effects and whether a value is loop invariant. In this setup, the code effectively follows the pseudocode provided in lecture.

Evaluation

I evaluated on a small set of benchmarks that I found on benchmarks game. These benchmarks were easier to work with because they are self contained, opposed to a more realistic benchmark suite like embench. For each benchmark, I profiled the application with the perf stat command and compared the results to a post-mem2reg baseline. Each case was run 5 times and averaged. The results are as follows:

benchmark	baseline	opt	percent speedup
spectral-norm	1.148	1.077	6.2
fannkuch-redux	6.206	6.189	0.3
binary-trees	9.016	8.936	0.9
fasta	0.163	0.160	1.8
mandelbrot	0.189	0.193	-2.0

We can see that LICM provides a small but reasonable speedup in most cases, with an average speedup of 1.4%. It is also likely that there are other benchmarks that would see a larger speedup, especially combined with other optimizations. The one weird case is madelbrot which saw an average slowdown of 2%. I don't quite understand why this is happening, other than potential variance in the results because of unpredictable behavior.

Summary

Overall, I found working with LLVM to be pretty pleasant. Most issues I had were well documented with simple fixes. The hardest part was figuring out how to get llvm code into the "real" ssa form with mem2reg.

1 reply

sampsyo Mar 27, 2022
Maintainer Author

Great! Awesome that you were able to show a proper speedup on some benchmarks. It would be interesting to know how this compares LLVM's "real" LICM pass.

michaelmaitland · 2022-03-23T17:26:19Z

michaelmaitland
Mar 23, 2022

Introduction

I implemented LICM using LLVM.

Setting up

When compiling sample programs to LLVM IR, I realized that clang was generating code that used alloca, load, store opposed to phi instructions. With a little help on Zulip, Professor Sampson reccomended to run the mem2reg pass which according to the LLVM Docs

Promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form

I used clang and opt to manually verify it was working as expected:

clang -S -emit-llvm -Xclang -disable-O0-optnone a.c  
opt -mem2reg -S a.ll

Note the disable-O0-optnone option. `clang`` started to add optnone attribute to each function, which prevents further optimizations afterwards including mem2reg pass. To prevent that, add -Xclang -disable-O0-optnone to clang source

Now we are ready to write our pass.

Setting up the pass

I created a LoopPass to do this work.

The first interesting thing we needed to do was to run mem2reg before performing the pass. We do this in the code instead of by the command line so someone wanting to use this pass in the future does not have to incur the complexity of setting the options on the command line:

static void registerLICMPass(const PassManagerBuilder &, legacy::PassManagerBase &PM) {
   PM.add(createPromoteMemoryToRegisterPass());
   PM.add(new LICMPass());
 }

The createPromoteMemoryToRegisterPass function does this for us. I will admit, it was a little tough to find this function when searching for mem2reg in the docs. I had to refer to the source code directly which is fine, but obviously less than ideal. Finding which header file to import also was harder than it should have been (you cant import the Mem2Reg.h file, we must import the Utils.h header instead).

At this point we can use clang with our LICMPass which will output the pruned SSA Code. Now its time to actualy implement runOnLoop.

In runOnLoop we hoist variables in the loop to the header that are invariant with a call to makeLoopInvariant. We run this until convergence.

makeLoopInvariant iterates over all blocks in the loop, and over all instructions in a given block. For each instruction we call the makeLoopInvariant function belonging to the loop. If we changed any item in any of the blocks we return true. However if we hoist an instruction out of a block, we stop iterating over the instructions over that block since modifying a list of instructions while iterating over it can cause unwanted behavior. Since we know that changed was set to true, that means we will make another call to makeLoopInvarient from runOnLoop, we know that we will iterate over that block to see if it has any more loop invariant instructions that we can hoist. Taking this approach allows us to keep the code clean and simple.

Testing

To test, I took benches from the embench-iot benchmars

For each of these benchmarks I calculated the time it took with no optimization, the time it took with optimization. The results are summarized by the following table:

Benchmark	Base	LICM
aha-mont64	3.25	3.21
crc32	2.41	2.41
cubic	79.64	75.84
edn	4.78	4.76
huffbench	5.69	5.73
matmult-int	10.87	10.81
minver	30.82	30.72
nbody	236.36	237.72
nettle-aes	5.35	5.32
nettle-sha256	4.17	4.14
nsichneu	5.94	5.89
picojpeg	7.03	6.99
primecount	2.27	2.25
qrduino	2.33	2.17
sglib-combined	4.29	4.06
slre	4.68	4.63
st	89.67	90.15
statemate	9.26	9.24
tarfind	9.38	9.13
ud	4.49	4.59
wikisort	31.44	30.44

You can see that for some programs we saw a speedup and for some we saw a slight slow down.

Summary

Overall, I enjoyed working in LLVM. The existing architecture made my life much easier. I do wish the documentation for LLVM was better! Next steps with this implementation would be to assert the same functionality between Base and LICM benchmarks. I was not sure how to do this so I wrote some sample programs to assert by hand.

1 reply

sampsyo Mar 27, 2022
Maintainer Author

Sounds great. I'm glad you were able to run all of embench and measured some amount of speedup on some benchmarks.

chhzh123 · 2022-03-23T17:52:13Z

chhzh123
Mar 23, 2022

In this task, I implemented a loop reordering/interchanging pass in LLVM. My code can be found here. I reused some facilities from my loop analysis pass in Lesson 7, and again compiled and run the program using LLVM 14 with the new pass manager.

Implementation

At first, I thought it would be easy to interchange two loops just like cutting and pasting the for-loop header in an IDE, but later I found it was very troublesome to do it in LLVM IR. Since a loop is organized as several basic blocks as shown in the following figure (source from LLVM Tutorial), we need to carefully move those blocks and change their inter-relationship.

The inputs of my reordering pass are two loops (outer loop and inner loop). I firstly extract all the loop preheaders, headers, latches, and exits from the loops, and then use moveBefore to move those basic blocks to corresponding places. Notice there is no swap function in LLVM, so I need to record the successors and predecessors of the original basic blocks, and do at least two movements for each pair of inner and outer loop blocks. After that, all the related branch instructions should be updated and points to the new basic blocks.

There is one more tricky thing here. If the loop to be reordered is the top-level loop, its preheader may be the entry block of the function. In this case, we need to separate the block and create a new preheader block for that loop. Also, the exit block may also connect with the following blocks, so it needs to be separated as well.

For the programming interface, I allow users to specify which loops to be interchanged and can do multiple interchanges in one pass.

Evaluation

As a golden test case, I tested the three-nested-loop GEMM on my machine, and consider 6 possible loop permutations to see the performance of different traversal orders. I set the matrix size to be 1024x1024, and my CPU is Intel Xeon Silver 4214 with 16.5MB L3 cache.

for (int i = 0; i < SIZE; ++i)
    for (int j = 0; j < SIZE; ++j)
        for (int k = 0; k < SIZE; ++k)
            C[i][j] += A[i][k] * B[k][j];

The results are shown below. The speedups are normalized by the time of the original ijk order.

Traversal Order	Time (us)	Speedup
ijk	5365352	1x
ikj	3041839	1.76x
jik	5495723	0.97x
jki	11241154	0.48x
kij	3070182	1.75x
kji	10926492	0.49x

Based on this table, we can obtain the following speedup comparison. The conclusion is the same as the GEMM example (Fig 6.46) in the famous introductory book CSAPP. If the traversal order is the same as how the data is organized (ikj), there will be a large speedup due to the improved cache locality. This also indirectly proves the correctness of my pass and shows loop interchanging can indeed improve the performance of the program.

kij ~ ikj < ijk ~ jik < kji ~ jki

1 reply

sampsyo Mar 27, 2022
Maintainer Author

Awesome work! Yeah, working on general loop structures can be quite a bit more detailed than working with the restricted case of simple/direct loop nests in source code. The payoff should be, hypothetically, that the transformation is much more general—for example, you should be able to swap while loops and manual goto-based code just as easily as you can swap for loops. It would be interesting to try a few variants to see if that's the case for your implementation.

5hubh4m · 2022-03-23T22:38:14Z

5hubh4m
Mar 23, 2022

Loop Optimization

I also went the route of implementing LICM in LLVM. It was surprisingly easy given LLVM utilities like isLoopInvariant and makeLoopInvariant. The main trouble was dealing with weird C++ iterator problems when moving instructions. I uploaded my code here. On the benchmark programs that I ran my pass over LICM fired for 76 instructions.

Benchmark

I wrote a Makefile to compile my programs with and without my pass (i.e -O0 and -O0 followed by LICM) and run the benchmark and compile results into a text file output.txt.

$> make bench
...
$> cat output.txt
orig 	 pass
27.87s	27.90s
0.94s	0.83s
12.61s	12.29s

On these three programs I do not see any significant improvement in wall-clock execution times.

Loop Optimization: Optimized Again

I sat down to implement LICM for Bril and I went pretty far. There are some corner cases that I currently ignore and possibly a bug somewhere that makes this particular benchmark (cholesky.bril) incorrect but I am giving up for now.

My code is here.

The implementation was fairly close to what the lecture describes. The function mkLoops creates loops from backedges; loopInvariants finds the invariant instructions, and moveInvariants moves them to the preheader block. The function licm :: Function -> Function (here) which calls them for each loop does so in "decreasing size of loop body" so that outermost loops are processed first --- so we can hoist the "most" invariant code in the outermost region first.

There are some caveats though:

Actually modifying the program to induce a unique preheader was a bit complex so for now I only consider loops which already have a unique preheader. I should implement a pass that inserts preheaders.
I do not check for the condition of the invariant definition dominating all loop exits -- only if the variable is dead after the loop.

Feelings and Evaluation

SSA is downright pleasant to work with. I did not even need to implement a reaching definitions pass, I just checked whether a variable was defined outside the body of the loop to check if it was trivially invariant.

I ran my pass over all Bril benchmarks except the one mentioned before. I would say, though, that seeing the optimized programs with all the right instructions hoisted to the right place (especially in programs with nested loops) was more rewarding than the 3.617% average speedup.

benchmark	baseline	licm	%improvement
ackermann	1808699	1808699	0
adj2csr	74866	73587	1.708
adler32	10211	10211	0
armstrong	178	178	0
binary-fmt	126	126	0
binary-search	93	93	0
bubblesort	317	314	0.946
catalan	816842	816842	0
check-primes	9761	8446	13.472
collatz	187	187	0
digital-root	327	327	0
eight-queens	1132674	1070610	5.479
euclid	667	648	2.849
factors	89	89	0
fib	149	149	0
fizz-buzz	6532	5905	9.599
gcd	63	63	0
loopfact	134	119	11.194
mat-inv	1377	1357	1.452
mat-mul	3199143	3066696	4.14
max-subarray	288	288	0
n_root	1017	1017	0
newton	293	293	0
orders	7631	7207	5.556
pascals-row	152	135	11.184
perfect	299	299	0
pow	40	37	7.5
primes-between	866239	788982	8.919
pythagorean_triple	69273	61647	11.009
quadratic	921	713	22.584
ray-sphere-intersection	170	170	0
recfact	112	112	0
rectangles-area-difference	23	23	0
relative-primes	2327	2250	3.309
riemann	385	385	0
sieve	4102	4003	2.413
sqrt	435	395	9.195
sum-bits	100	100	0
sum-divisors	235	235	0
sum-sq-diff	3443	3045	11.56
up-arrow	427	409	4.215
		min	0
		max	22.584
		avg	3.617
		stddev	5.243

1 reply

sampsyo Mar 27, 2022
Maintainer Author

Fantastic! Two LICMs for the price of one! This statement is extremely true:

I would say, though, that seeing the optimized programs with all the right instructions hoisted to the right place (especially in programs with nested loops) was more rewarding than the 3.617% average speedup.

Thanks for reporting your emotional results alongside your quantitative results. 😃

anshumanmohan · 2022-03-24T02:12:43Z

anshumanmohan
Mar 24, 2022

My code is here.

Phew, I found this assignment rather challenging. I didn't actually have to write very much code, largely because I was able to use built-in LLVM methods to do the heavy lifting. However, this also meant that I was playing with lots of code that I didn't understand. There was lots of monkey-see-monkey-do using all manner of internet sources, followed by silly experimentation of my own.

Note that I didn't use the makeLoopInvariant() method; that felt like too much outsourcing! Instead, I walked through the instructions and checked if an instruction used only loop-invariant operands. I then lifted those instructions to just before the end of the preheader of the loop.

I tinkered with LoopPass a fair bit but couldn't figure it out, especially since I needed no just the loops but also their containing blocks and preheaders.

The following resource was a real lifesaver, although it steered me towards LoopInfo:
https://www.inf.ed.ac.uk/teaching/courses/ct/17-18/slides/llvm-2-writing_pass.pdf
patched with
https://stackoverflow.com/questions/30351725/llvm-loopinfo-in-functionpass-doesnt-compile
and
https://stackoverflow.com/a/9701867

I then got into trouble with clang segfaulting; this eventually turned out to be because I was tagging branches as loop-invariant and moving them around. I got this to work on four little C programs that I wrote.

I grabbed the Embench test suite and ran my pass on all of it. Some functions were still segfaulting mysteriously (I suspect I am moving other delicate things, no just branches...) but I just nixed those programs. Some very handy cherrypicking!

Finally, I don't think I deserve any points for evaluation. I have tested my pass by eye, basically just by printing out which instructions I moved up. However, I struggled to figure out an actual evaluation tool fully linked up and deployed. None of the existing guides for gperftools seems to work as written; our clang compilation via an LLVM pass complicates things too much. As of writing I believe I have gperftools figured out, but the tool fails to work with the embench programs because they don't have main methods, gah. In any case, I will stop whining. I could probably figure this out but am deciding to hand it in. Sorry!

1 reply

sampsyo Mar 27, 2022
Maintainer Author

Great work persisting despite the hacking challenges! I hope you learned something by encountering the wealth of half-helpful resources out there that guide you through the LLVM API. 😃

gperftools is an interesting route, but I'm not sure you need it to have a convincing performance evaluation. Just measuring the wall-clock time taken to execute entire programs should be fine! And in fact it could be seen as more convincing because you don't have to contend with profiler interference in the results. No need to do it for this one, but maybe something to keep in mind for future evaluations.

tonyjie · 2022-03-24T03:44:32Z

tonyjie
Mar 24, 2022

My code is here

LICM: Loop-Invariant Code Motion

I use LLVM to implement LICM for this task. In the beginning, I'm trying to follow the pseudo code for identifying loop-invariant instructions by checking the reaching definitions. But then looking through the method of llvm::Loop, I found the straightforward function makeLoopInvariant(&I) which can directly detect the loop-invariant instructions and move it to the pre-header of the loop.

Therefore, the implementation is just simply go over all the instructions of the loop, and call makeLoopInvariant(&I) to each instruction. If a loop-invariant instruction is found, we break this cycle and continue this until convergence (no loop-invariant instruction can be found.)

Rigorously evaluate its performance impact.

If you choose LLVM, select an existing (small!) benchmark suite such as Embench.

I'm still figuring out how to use Embench. It seems that I can't directly use its source code as it doesn't have a main function.

Commands

Build shared library for the Pass.

cd LoopOptimize/
cmake ..
make
cd ..

Run it.

# bash run.sh
clang simple.c -S -emit-llvm -Xclang -disable-O0-optnone
opt -mem2reg -S simple.ll -o simple-mem2reg.ll
clang simple-mem2reg.ll -Xclang -load -Xclang build/LICM/libLICM.so -flegacy-pass-manager -o simple.out
clang simple-mem2reg.ll -Xclang -load -Xclang build/LICM/libLICM.so -flegacy-pass-manager -S -emit-llvm -o simple-licm.ll

To allow optimization under -O0, we need to provide -Xclang -disable-O0-optnone option. Otherwise, we cannot apply -mem2reg pass later.
-mem2reg pass allows us to have normal SSA-form IR with phi nodes instead of load/store, which prevents us from detecting loop-invariant operands and instructions.
Load our pass.
Generate the LLVM IR after LICM. Comparing the simple-licm.ll with simple-mem2reg.ll, it is easy to find that the loop invariant instruction is moved ahead the loop.

define i32 @main() #0 {
  %1 = add nsw i32 2, 2
  %2 = mul nsw i32 2, 2
  br label %3

... # Basic Block for Loop

Discussions

As LLVM already has corresponding helper function to handle loop-invariant detecting, the hardest part of this task for me is those tricky things with clang. It's kind of hard to search for the answer on how to removing unexpected load/store and the necessary disable-O0-optnone option, until I see it on Zulip and discussions from others.

1 reply

sampsyo Mar 27, 2022
Maintainer Author

OK, sounds like the LLVM implementation itself was reasonably successful, although it sounds like you weren't able to get a benchmark suite to run and give you performance results. It doesn't have to be Embench, but it might be instructive to see what the performance effect is for your optimization—and to see whether it correctly preserves semantics for full-fledged applications.

gsvic · 2022-03-25T00:12:51Z

gsvic
Mar 25, 2022

Bril - Python

My implementation is summarized in the following files:

In fact, I tried to create something that looks like with the corresponding abstractions in LLVM. The main classes are the following:

Loop

Represents a loop in the input code. It consists of the blocks that belong to that specific loop, and contains some functionality like the following:

def is_loop_invariant(self, instr): Checks if a given instruction is a invariant to that loop
def mark_invariants(): Marks all the instructions that are invariants
def create_preheader_block(): Collects the invariant instructions and creates the pre-header block.

LoopPass

Similar to the LLVM's Loop Pass. The execute_pass() method iterates through all the instructions. If a loop is detected, it marks the invariant instructions, deletes them and moves them all to the pre-header block. Next, it appends the pre-header block before the loop's start block. In case of a jmp instruction that jumps to the loop's first block, it also makes sure to change that to jump to the pre-header block instead.

Example

An example of my LoopPass on the loopcond.bril code. The output would look like the following, which includes the prehead0 block with the loop-invariant instructions:

@main {
.entry:
  x: int = const 0;
  i: int = const 0;
  j: int = const 0;
  one: int = const 1;
  z: int = const 1;
  jmp .prehead0;
.prehead0:
  inv: int = mul one z;
  uses_inv: int = add inv z;
.loop:
  max: int = const 1000000;
  cond: bool = lt i max;
  br cond .body .exit;
.body:
  x: int = add x one;
  jmp .endif;
.endif:
  factor: int = const 2;
  x: int = mul x factor;
  i: int = add i one;
  jmp .loop;
.exit:
  print i;
  ret;
}

Experimental Evaluation

I selected a couple of benchmarks of the benchmark directory of the bril repo. I measure the performance with respect to the total_dyn_inst obtained with brili -p. My results are depicted in the table below.

Benchmark	Without LICM	With LICM
loopfact.bril	119	117
check-primes.bril	8469	7950
loopcond.bril (modified version)	11000011	9000013
quadratic.bril	788	750
sqrt.bril	322	316

LLVM

I also did quick implementation in LLVM as well, however, due to it's simplicity I doubt that it's correct. My code can be found here. I am following the algorithm presented in the class, which iterates until convergence. In order to check if an instruction is invariant, I am checking if all of its arguments are invariant in the following loop:

bool invariant = true;
for (op = 0; op < inst.getNumOperands(); ++op) {
    if (!L->isLoopInvariant(inst.getOperand(op))) {
        invariant = false;
    }
}

If so, I am marking the instruction as invariant as follows:

if (invariant) {
    L->makeLoopInvariant(&inst, changed);
}

I took some ideas from the LLVM's LICM implementation, which looks way more complex. I did a quick performance test using this loop, but I am not seeing any improvements, in terms of execution time.

1 reply

sampsyo Mar 28, 2022
Maintainer Author

Great! That's an interesting idea to implement LLVM-inspired APIs in your own Bril-analyzing framework. I'm glad this turns out to have made the implementation straightforward, and that you were able to measure performance differences in some Bril benchmarks. I wonder what the results would say for the complete suite.

zzzDavid · 2022-03-25T06:13:54Z

zzzDavid
Mar 25, 2022

My implementation of loop optimization is here: link.

Loop Invariant Code Motion

I chose LICM as the loop optimization pass to implement. I wrote a loop pass that checks if an instruction is loop-invariant, and then moves the loop-invariant instructions to the pre-header.

Testing with a real-life benchmark

I was able to figure out how to setup the Embench benchmark on my machine. I also modified its compilation script and added an option to run my pass during compilation:

Baseline build

cd embench-iot
./build_all.py --arch mac --chip speed-test-clang-X86-64 --cc clang

Build with LICM enabled

cd embench-iot
./build_all.py --arch mac --chip speed-test-clang-X86-64 --cc clang --run_mypass

Results

Note that the Embench test suite by default use -O2 for clang, we have to change it to -O0 in order to have a fair comparison.

Benchmark name	Baseline (ms)	With LICM (ms)
aha-mont64	2.75	2.74
crc32	2.10	2.14
cubic	60.65	54.38
edn	1.41	1.41
huffbench	1.65	1.61
matmult-int	1.68	1.69
minver	10.37	10.30
nbody	64.43	60.99
nettle-aes	2.98	2.88
nettle-sha256	2.87	2.78
nsichneu	3.46	3.63
picoipeg	1.42	1.39
primecount	1.22	1.25
qarduino	1.12	1.17
sglib-combined	1.41	1.31
slre	2.33	2.25
st	33.29	32.00
statemate	8.81	8.45
tarfind	7.18	7.14
ud	1.90	1.88
wikisort	7.93	7.61

Discussion

For most test cases, we observe that LICM consistently brings speedup ranging from 0.01ms – 3.44ms.

A few cases has a bit slow down, which could be caused by the change in instruction alignment.

The other thing I learned is that clang doesn't generate SSA form IR by default. Instead, we get pointers load/stores. I noticed that in the last assignment but I just went on with it. This time, I learned that there's an LLVM pass called mem2reg to convert the IR to SSA form with Phi-nodes. I also learned that instead of generating IR from clang and then run the pass with opt, we can just add the mem2reg pass in the pass manager and let it run right before our pass, like this: code.

1 reply

sampsyo Mar 28, 2022
Maintainer Author

Fantastic work! It's exciting that you were able to measure real performance differences on Embench, even if they're small. That is sort of the deal with individual optimizations—each one has a small but measurable effect, especially when viewed in aggregate. Honest work indeed!

ayakayorihiro · 2022-03-25T07:08:06Z

ayakayorihiro
Mar 25, 2022

I implemented LICM in Bril. My implementation, script for evaluation, and data from evaluation can be found here.

Running LICM

My implementation of LICM can be ran simply by running:

bril2json < ${BRIL_FILE} | python3 licm.py

This will produce a JSON representation of the LICM version of the original program.

Challenges and Limitations

I had a lot of challenges trying to figure out/deal with the various specifics of the algorithm: finding the natural loops, finding loop invariants (especially the part about tracking the reaching definitions), and filtering out loop invariants that couldn't be moved out of the loop. But, grappling through each of these sub-parts really helped me understand the precise definition of natural loops and what a loop invariant consists of. So I consider this exercise to have helped me further my understanding! (I anticipate there are lots of bugs in the current version still, which I hope to come back to and improve after I work on SSA, as noted below) It's also astonishing how much one has to deal with to perform a seemingly simple optimization such as LICM.

An unfortunate limitation due to me catching up right now is that I haven't implemented SSA yet! So I bet that there are a lot more instructions I would be able to optimize, and with more cleaner approaches, if I was working in the SSA form of Bril. One would notice that my LICM pass didn't have much of an effect on the Bril benchmark's performance, and I'm guessing lack of SSA is a major reason for that. I hope to run this LICM pass on the SSA form once I get that implemented, and fix bugs when I find them.

Evaluation

I ran both the non-LICM version and LICM versions of all bril benchmarks, and compared the number of dynamic instructions executed. The experiment can be replicated by running the command:

bash evaluation.sh ${BRIL_BENCHMARKS_DIR}

where ${BRIL_BENCHMARKS_DIR} contains the bril benchmarks to run. Note that I had some trouble with factors.bril, so the evaluation script ignores that benchmark. There are a couple of holes within the below table which would definitely need fixing in the future.

name	baseline	licm	%-improvement
ackermann	1464231	1464231	0
adj2csr	56629	56629	0
adler32	6851	6851	0
armstrong	133	133	0
binary-fmt	100	100	0
bubblesort	253	253	0
check-primes	8468	8468	0
collatz	169		-1
digital-root	247	247	0
eight-queens	1006454	1006454	0
euclid	563	544	3.300
fib	121	121	0
fizz-buzz	3652	3652	0
floats			-1
gcd	46		-1
loopfact	116	116	0
mat-inv	1044		-1
mat-mul	1990407	1990407	0
newton	217	217	0
orders	5352	5352	0
perfect	232	232	0
pow	36	36	0
primes-between	574100	534888	6.800
pythagorean_triple	61518		-1
quadratic	785	785	0
ray-sphere-intersection	142	142	0
recfact	104	104	0
rectangles-area-difference	14	14	0
relative-primes	1923	1923	0
sieve	3482	3482	0
sqrt	322	306	4.900
sum-bits	73	73	0
sum-divisors	159		-1
sum-sq-diff	3038	3038	0
up-arrow	252	252	0

1 reply

sampsyo Mar 28, 2022
Maintainer Author

Great!! Awesome work getting the loop extraction stuff partially working in Bril, even if things break sometimes.

An unfortunate limitation due to me catching up right now is that I haven't implemented SSA yet! So I bet that there are a lot more instructions I would be able to optimize, and with more cleaner approaches, if I was working in the SSA form of Bril.

Good point! I should have said this earlier, but since you're catching up, it would have been OK to use someone else's SSA conversion (like mine) just for this task. Not everything needs to build on your own prior implementation, especially if those prior implementations have bugs.

andreyyao · 2022-03-26T00:23:13Z

andreyyao
Mar 26, 2022

My terrible LICM implementation is here

Implementation

I did loop invariant code motion. However judging from the benchmarks there doesn't seem to be much motion going on... Anyways this assignment was much more challenging than I expected and making my main data structures immutable definitely made the job 10 times harder. My workflow is roughly:

Find all the headers (These should stay the same throughout the optimizations).
For each header: Compute dominators and then find all backedges from header, merge loops from them to form one "loop", and then insert preheader before header. Recalculating dominators is necessary because loops can be nested, and it is possible that we inserted preheaders to some inner loop of this loop first, which messes up the dominance graph. This merging loops thingy is just to make the logic cleaner and prevent multiple preheaders from being inserted before the header. I used this algorithm to find the natural loop around each backedge.
Identify loop invariant instructions. This is done using the criterion from the 6120 lesson 5 website. However I added the instruction purity check to this stage as well. This didn't take TOO long.
Hoist invariant instructions, again according to criterion from course website. This one broke me. See my "hoist" function for example: let hoist (func:Func.t) doms head preh loops exits (invars : SIS.t) vars2defs (reachingdefs : Dataflow.Factory.ReachingDefinitions.t) : Func.t
Ugh...
Fortunately, during steps 3 and 4 I could reuse the same dominator graph, as the CFG structure remains unchanged.

Testing and performance analysis

I did incremental testing. First of all, if I just call insert_preheaders without the actual LICM then the code passes all the benchmarks with a slight increase in dynamic instructions count. The increase comes from my inserting terminators into each block when they don't have any, i.e. adding jumps when there is a fall through and ret if the last block doesn't end with a return statement.

Currently if I perform LICM, a majority of the benchmarks from the bril repo fail. I suspect this is because there is dependency between some of the loop invariant instructions in some loop, and I didn't hoist them in the right order or so. Unfortunately despite spending many hours on this assignment, the result wasn't as ideal as I hoped.

Replication

To replicate my experiment, simply run brench bench.toml > out/results.csv from the root directory of my repo.

1 reply

sampsyo Mar 28, 2022
Maintainer Author

I'm sorry to hear about the challenges in this implementation! You did a valiant job getting the basic scaffolding for LICM working, even if actually applying the optimization was made especially hard by immutable data structures. I wonder what it would be like to stay immutable for most of the analysis and then grant yourself special permission to update the IR right at the end, to finally apply any changes.

Anyway, great work despite the difficulties!

atucker · 2022-03-28T03:10:40Z

atucker
Mar 28, 2022

I implemented LICM! My code is here.

Implementation

My main challenge for this implementation was getting sick. After I recovered reasonably well I was able to finish the assignment.

Finding loops

Before I got sick, I got most of the way to finding the loops. I split this into finding the back edges, then finding the blocks in the loop for each back edge.

Finding backedges

This was pretty straightforward, since a backedge is just whenever one node A dominated by another node B has that dominating node B as onee of its successors. In other words, whenever one node dominates a node which can go back to it, there is a backedge. I just used my old code to get the dominators of each node, and whenever a node's dominator was also one of its successors, we had a backedge!

Finding loop contents

This was trickier because we had a definition but not an algorithm. Then I realized that the fact that we wanted a minimum set meant that I could just start with the two elements definitely in the set (the header and the backedge node), and then add everything necessary for the definition to fit, unless I ran into anything that broke the natural loop.

So I just did an iterate-until-convergence of adding the predecessors for every node except the header, unless that predecessor wasn't dominated by the header. If the predecessor wasn't dominated by the header, this broke the natural loop, since it meant there were multiple entry points, and the whole set couldn't be a natural loop.

If every predecessor was dominated by the header, then this meant that it was part of the natural loop. Iterating to find all such predecessors then gives the minimal set such that for each node all of the set predecessors are in the set, or its the header.

Implementing LICM

After recovering, I actually implemented the code that did loop-invariant code motion. There are some comments on the analysis, and then I talk about the two main parts where I marked instructions loop invariant and then moved them if they were safe to move.

Code analysis caching

Noticing that I was passing around the whole function contents and then frequently re-analyzing it to find predecessors, successors, dominators, etc., and inspired by LLVM, I started a tiny refactor to have the option of caching and reusing analyses.

The basic design is that each function has access to an analysis dict which it can use to either find an analysis that already happened, or add to. By passing around the analysis object between functions that should share a context (i.e. analyzing the same function), this can cache the analysis so that things only need to recompute the analysis when they need to.

This code is mostly here.

I didn't rewrite all of my code to use this pattern, but all of this assignment is written that way.

Reachability and other analyses

First, I realized that I needed to do analyses of variable use and definition, so I implemented those. Then I noticed that I didn't actually know where the loop exits were, so I said that a loop node was an exit node if it had any successors not in the loop. Using Python's set made this super easy to implement.

I managed to reuse my reachability analysis from before. My previous reachability analysis from the dataflow assignment was just block -> var -> set of blocks (or in other words, each variable definition which reaches a node comes from some other block, and the structure is a dict from node to a dict from var to the set of blocks with reaching definitions). I was briefly sweating because I realized I would have rather had something like block -> var -> set of (block, instr # in block) in order to even be able to know if a definition was marked code invariant. However this was fine for a few reasons -- 1) many questions are about the block, not the instruction, 2) control flow doesn't change within a block so we're always talking about the last instruction that it could have been.

Preheaders

My basic block list is set up as an OrderedDict, so it always remembers the order that it read the basic blocks in, so that makes it pretty easy to just concatenate all the instructions together in order to rebuild the code. I just represented the preheader as a list of instructions that I would splice in after a label when reconstructing the code.

Marking instructions loop invariant

I did a pretty direct translation of the pseudocode, with a maybe-wonky control flow based on Python's for/else construction where the code in the else section only runs if you make it to the end of the for loop. My idea that I would loop through all of an instruction's arguments and if there was any reason why that argument shouldn't count as loop invariant, then I would break, and not hit the else which marked it loop invariant.

Moving instructions when its safe

This was pretty okay, I don't really have comments on this.

One more thing...

One question that I didn't really know what the answer to was how to handle the fact that the same blocks can be in multiple lines. My implementation had a bug where I would rename all of the labels pointing to a loop header from outside the loop to point to the preheader instead. This makes sense, but if two loops shared a header (like in factorial) then loop A would tell loop B's code to point to the preheader instead, then loop B's code would tell loop A's code to point to the preheader, and then it wouldn't be much of a preheader!

To fix this, whenever loops had the same header and shared any nodes I merged them into one big loop. This seemed to work, but I wouldn't be that surprised if it decreased the amount of code that it could move since it might care too much about what happens in other loops. Another way to deal with this would probably just have been to say that if a jmp/break to a header is within any loop, then it shouldn't get renamed.

Analysis

I finally used brenchfor analysis, though there maybe wasn't as much loop invariant code to move as I would like. Oh well!

You can run my brench evaluation with python brench.py licm.toml in the brench directory. The .toml is here.

Besides a few programs that were missing in the base case (cholesky, mat-inv, test), all benchmark programs not in the table were marked correct, but had unchanged performance after my licm pass.

benchmark	baseline	licm	%
check-primes	8468	8103	4.310345
orders	5352	5022	6.165919
sum-sq-diff	3038	2838	6.583278
relative-primes	1923	1903	1.040042
quadratic	785	659	16.050955
loopfact	116	108	6.896552
pow	36	33	8.333333

Basically it looks like whenever loop invariant code motion can be used, it helps at least a little. Quadratic somehow gets this huge 16% improvement, probably since 3/21 instructions in sqrt get moved into the preheader.

2 replies

sampsyo Mar 28, 2022
Maintainer Author

I'm so glad to hear you're feeling better, @atucker! Awesome work on this. It was interesting to hear about your general approach to caching analyses to avoid rerunning them every time they're needed. And it's very interesting to see the distribution of benefits across those Bril benchmarks.

One tiny question:

One question that I didn't really know what the answer to was how to handle the fact that the same blocks can be in multiple lines.

I think you mean "multiple loops," given the discussion below? Yeah, this certainly does make things trickier!

atucker Mar 28, 2022

Thanks! Yes, I'm definitely talking about multiple loops!

alaiasolkobreslin · 2022-03-29T23:40:33Z

alaiasolkobreslin
Mar 29, 2022

My implementation of LICM is here

Implementation

I'm sorry that this is so late. Hopefully the arsonist will chill out and my room won't flood in the near future 🥴.

Natural loops

To find natural loops, I followed the pseudocode in these notes, so this task was pretty straightforward

Reaching Definitions

I should have known reaching definitions would be useful later in the semester! I did not implement this for the data flow analysis assignment, so I had to implement it for this assignment. This was not too difficult, but I didn't realize until later on that I should be storing the reaching definitions by line rather than by block. So I did have to make some changes after I realized this.

LICM

Most of the LICM implementation wasn't too difficult, but I did have a lot of trouble with moving instructions to the preheader. At first I was appending safe-to-move instructions to the end of the preheader, but this didn't work if the last instruction was a branch or return. I was getting errors for undefined values, which makes sense because the program was branching before executing the loop-invariant instruction. Then I decided to add terminators to each block before inserting instructions as the second-to-last element in the preheader. For some reason this approach didn't work either. I tried to figure out why for about an hour or two before giving up (I didn't like this approach anyways because for programs that don't identify any loop-invariant code to move, the number of instructions executed actually increases from the baseline). I ultimately went with the approach of checking whether the last instruction is a terminator. If it is, I insert the loop-invariant instruction right before the terminator, and if not, I append it to the end of the preheader.

I'm don't think that all the benchmarks cover all the cases for determining whether an instruction is safe to move. I could have come up with some test cases to see if these cases are accounted for. However, I am already very late on this assignment, and all the benchmarks have the right output, so I am happy with that.

Evaluation

I used brench to test my implementation. The command is python brench.py brench.toml > results.csv (run from the brench directory). This table shows the total dynamic instructions for the baseline and LICM versions of many benchmarks, as well as the percent improvement.

Benchmark	Baseline	LICM	% improvement
ray-sphere-intersection	142	142	0
quadratic	785	577	26.50
orders	5352	4928	7.92
relative-primes	1923	1839	4.37
sieve	3482	3383	2.84
newton	217	217	0
check-primes	8468	7091	16.26
sum-sq-diff	3038	2640	13.10
loopfact	116	101	12.93
recfact	104	104	0
perfect	232	232	0
digital-root	247	247	0
sum-divisors	159	159	0
ackermann	1464231	1464231	0
pythagorean_triple	61518	54015	12.19
euclid	563	544	3.37
binary-fmt	100	100	0
mat-mul	1990407	1990407	0
gcd	46	46	0
fib	121	121	0
collatz	169	169	0
sum-bits	73	73	0
sqrt	322	282	12.42
eight-queens	1006454	975423	3.08
fizz-buzz	3652	2929	19.80

A lot of benchmarks have 0% improvement due to lack of loop-invariant code, but I'm really impressed by these results!

3 replies

sampsyo Mar 30, 2022
Maintainer Author

That's awesome; thanks for the very clear writeup! It makes sense that you'd need to go back and make sure that your reaching definitions analysis was in place. 😃

I'm interested that your optimization seems to have reduced more dynamic instructions than some others… for example, @JonathanDLTran shows a smaller reduction on the quadratic benchmark. It could be fun to compare notes and see what's different!

alaiasolkobreslin Mar 30, 2022

I think it's possible that I moved some unsafe loop-invariant instructions to a preheader, and the quadratic test case happened to pass because a loop ran at least once on the given input. I need to look more into this to figure out why my results were so different than other students' results.

sampsyo Mar 30, 2022
Maintainer Author

Ah, cool, that would certainly be a plausible explanation!

yy665 · 2022-05-18T11:44:14Z

yy665
May 18, 2022

My implementation of LICM in LLVM is here

Implementation

I mainly used analysis tools provided by LoopInfo and LoopPass to check the following 3 conditions:

Whether this basic block is in a SubLoop. Whenever this loop is nested and the instruction we are looking at is in the inner loop, we cannot do LICM on this particular value.
Check if this instruction would have any side effects - mainly memory operations which we don't know if would be invariant.
Check if all the operands of this instruction are loop invariant.

If all the operands are loop invariant while the instruction is not in a subloop and does not create any side effects, we should be able to do LICM on this value safely. I have also seen some other definitions on LICM safety restrictions, however, I was not quite sure if theirs is more restrictive or less restrictive or equivalent to what I implemented, since the checking of loop invariant is done by LLVM and is shadowed by multiple layers of LLVM implementation.

There is also one rule which I believe should be added but I couldn't figure out how. Basically we would like to make sure that the instruction we are moving out of the loop would actually run within the loop.

Evaluation

Benchmark	Time w/o LICM	Time w/ LICM	Speedup
aha-mont64	4.61	4.59	1.004
crc32	4.11	4.30	0.955
cubic	81.21	100.69	0.805
edn	6.38	6.44	0.990
huffbench	8.22	8.44	0.974
matmult-int	14.34	15.14	0.947
nbody	294.77	350.65	0.840
minver	39.72	44.16	0.899
nettle-ges	6.57	7.65	0.858
nettle-sha256	5.94	6.04	0.983
nsichneu	9.69	9.87	0.981
picojpeg	9.82	10.06	0.976
primecount	2.90	2.77	1.046
grduino	2.74	2.70	1.105
sglib-combined	5.65	6.48	0.871
slre	6.54	6.85	0.954
st	102.57	123.28	0.832
statemate	12.73	13.39	0.950
tarfind	13.56	14.26	0.950
ud	7.11	7.32	0.971
wikisort	35.27	36.51	0.966

The performance looks horrible. It's even slower after LICM. I think the reason might be that the benefit from LICM cannot justify the extra cost. Or maybe my implementation isn't right.

1 reply

sampsyo May 24, 2022
Maintainer Author

Looks good! It would be interesting to look in more detail about why the code got slower. Staring at the code directly could pay off.

susan-garry · 2022-05-27T08:28:53Z

susan-garry
May 27, 2022

My implementation of LICM can be found here.

Implementation

To hoist loop invariant expressions, we must first identify the loops and then identify all loop invariant expressions. To identify loop invariant expressions, I used a kind of dataflow analysis that begins by assuming that all expressions are loop invariant and marking as non-loop invariant the variables which are defined more than once within the loop and the expressions which rely on variable which are not-loop invariant variable or that are live into the loop.
This certainly misses opportunities for optimization, since a variable might be defined twice within a loop but assigned to the same loop invariant value each time, or a variable which is live going into the loop may be used once and never defined or altered within the loop, so the expressions which use it would be loop invariant. However, this somewhat simplifies our analysis since variables that are live-in and then redefined only once during the loop may appear loop-invariant, but the initial live value might be used once before it is redefined. LVN can mitigate some of these issues.

Evaluation

I used brench to evaluate the performance of this optimization. On average, LICM resulted in a 1.7% speedup on the benchmarks. Here are the results for the benchmarks with the largest percent incrase:

Benchmark	Baseline (ms)	LoopOpt (ms)	% Increase
quadratic	785	705	10.2%
primes-between	574100	490194	14.6%
orders	5352	4928	7.9%
loopfact	116	102	12.1%
fizz-buzz	3652	2930	19.8%

A lot of the benchmarks don't see any improvement, presumably because they lack loop-invariable code, but overall I am pleased with these results!

Drawbacks

As discussed above, this implementation forfeits several opportunities for optimization. Additionally, it doesn't consider nested loops with the same head. It simply may not have come across this in the benchmarks; however, this would create two blocks with the same label which could cause trouble. I did not run into this specific issue on the benchmarks.

1 reply

sampsyo May 27, 2022
Maintainer Author

Neat! Thanks for the interesting explanation of the DF analysis at work here—this seems like an effective way to get some of the low-hanging fruit without too much fuss.

Lesson 8: Loop Optimization #295

sampsyo Feb 18, 2022 Maintainer

Replies: 18 comments · 27 replies

Implementation:

Results:

Testing:

Challenges:

sampsyo Mar 27, 2022 Maintainer Author

Summary

Optimization Results

sampsyo Mar 27, 2022 Maintainer Author

Idea

Limitation

Issues

Not workable workaround

Upd

Upd_1

sampsyo Mar 27, 2022 Maintainer Author

sampsyo Mar 28, 2022 Maintainer Author

LLVM Round 2

Some brief implementation

Benchmarking

Summary

sampsyo Mar 27, 2022 Maintainer Author

Implementation

Evaluation

Summary

sampsyo Mar 27, 2022 Maintainer Author

Introduction

Setting up

Setting up the pass

Testing

Summary

sampsyo Mar 27, 2022 Maintainer Author

Implementation

Evaluation

sampsyo Mar 27, 2022 Maintainer Author

Loop Optimization

Benchmark

Loop Optimization: Optimized Again

Feelings and Evaluation

sampsyo Mar 27, 2022 Maintainer Author

sampsyo Mar 27, 2022 Maintainer Author

LICM: Loop-Invariant Code Motion

Rigorously evaluate its performance impact.

Commands

Discussions

sampsyo Mar 27, 2022 Maintainer Author

Bril - Python

Loop

LoopPass

Example

Experimental Evaluation

LLVM

sampsyo Mar 28, 2022 Maintainer Author

Loop Invariant Code Motion

Testing with a real-life benchmark

Results

Discussion

sampsyo Mar 28, 2022 Maintainer Author

Running LICM

Challenges and Limitations

Evaluation

sampsyo Mar 28, 2022 Maintainer Author

Implementation

Testing and performance analysis

Replication

sampsyo Mar 28, 2022 Maintainer Author

Implementation

Finding loops

Finding backedges

Finding loop contents

Implementing LICM

Code analysis caching

Reachability and other analyses

Preheaders

Marking instructions loop invariant

Moving instructions when its safe

One more thing...

Analysis

sampsyo
Feb 18, 2022
Maintainer

Replies: 18 comments 27 replies

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 28, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 27, 2022
Maintainer Author

sampsyo Mar 28, 2022
Maintainer Author

sampsyo Mar 28, 2022
Maintainer Author

sampsyo Mar 28, 2022
Maintainer Author

sampsyo Mar 28, 2022
Maintainer Author