Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCFToCalyx] Lower SCF Parallel Op To Calyx #7409

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jiahanxie353
Copy link
Contributor

This patch lowers scf.parallel op to Calyx by invoking multiple calyx.components in parallel. I'm not pushing test files for now because some required features are not upstream yet.

But I can give an example, say we have:

module {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %c1 = arith.constant 1 : index
    %c3 = arith.constant 3 : index
    %c0 = arith.constant 0 : index
    %alloc = memref.alloc() : memref<6xi32>
    %alloc_1 = memref.alloc() : memref<6xi32>
    scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
      %4 = arith.shli %arg3, %c2 : index
      %5 = arith.addi %4, %arg2 : index
      %6 = memref.load %alloc_1[%5] : memref<6xi32>
      %7 = arith.shli %arg2, %c1 : index
      %8 = arith.addi %7, %arg3 : index
      memref.store %6, %alloc[%8] : memref<6xi32>
      scf.reduce
    }
    return
  }
}

The output is:

module attributes {calyx.entrypoint = "main"} {
  calyx.component @main(%clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    %c2_i32 = hw.constant 2 : i32
    %c1_i32 = hw.constant 1 : i32
    %c0_i32 = hw.constant 0 : i32
    %mem_1.addr0, %mem_1.write_data, %mem_1.write_en, %mem_1.clk, %mem_1.reset, %mem_1.read_data, %mem_1.done = calyx.memory @mem_1 <[6] x 32> [3] {external = true} : i3, i32, i1, i1, i1, i32, i1
    %mem_0.addr0, %mem_0.write_data, %mem_0.write_en, %mem_0.clk, %mem_0.reset, %mem_0.read_data, %mem_0.done = calyx.memory @mem_0 <[6] x 32> [3] {external = true} : i3, i32, i1, i1, i1, i32, i1
    %par_func_0_3_instance.in0, %par_func_0_3_instance.in1, %par_func_0_3_instance.in2, %par_func_0_3_instance.in4, %par_func_0_3_instance.clk, %par_func_0_3_instance.reset, %par_func_0_3_instance.go, %par_func_0_3_instance.done = calyx.instance @par_func_0_3_instance of @par_func_0_3 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_2_instance.in0, %par_func_0_2_instance.in1, %par_func_0_2_instance.in2, %par_func_0_2_instance.in4, %par_func_0_2_instance.clk, %par_func_0_2_instance.reset, %par_func_0_2_instance.go, %par_func_0_2_instance.done = calyx.instance @par_func_0_2_instance of @par_func_0_2 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_1_instance.in0, %par_func_0_1_instance.in1, %par_func_0_1_instance.in2, %par_func_0_1_instance.in4, %par_func_0_1_instance.clk, %par_func_0_1_instance.reset, %par_func_0_1_instance.go, %par_func_0_1_instance.done = calyx.instance @par_func_0_1_instance of @par_func_0_1 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_0_instance.in0, %par_func_0_0_instance.in1, %par_func_0_0_instance.in2, %par_func_0_0_instance.in4, %par_func_0_0_instance.clk, %par_func_0_0_instance.reset, %par_func_0_0_instance.go, %par_func_0_0_instance.done = calyx.instance @par_func_0_0_instance of @par_func_0_0 : i32, i32, i32, i32, i1, i1, i1, i1
    calyx.wires {
    }
    calyx.control {
      calyx.seq {
        calyx.par {
          calyx.seq {
            calyx.invoke @par_func_0_0_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_0_instance.in0 = %c0_i32, %par_func_0_0_instance.in1 = %c2_i32, %par_func_0_0_instance.in2 = %c0_i32, %par_func_0_0_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_1_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_1_instance.in0 = %c1_i32, %par_func_0_1_instance.in1 = %c2_i32, %par_func_0_1_instance.in2 = %c0_i32, %par_func_0_1_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_2_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_2_instance.in0 = %c0_i32, %par_func_0_2_instance.in1 = %c2_i32, %par_func_0_2_instance.in2 = %c2_i32, %par_func_0_2_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_3_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_3_instance.in0 = %c1_i32, %par_func_0_3_instance.in1 = %c2_i32, %par_func_0_3_instance.in2 = %c2_i32, %par_func_0_3_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
        }
      }
    }
  } {toplevel}
  calyx.component @par_func_0_0(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    %true = hw.constant true
    %std_slice_1.in, %std_slice_1.out = calyx.std_slice @std_slice_1 : i32, i3
    %std_slice_0.in, %std_slice_0.out = calyx.std_slice @std_slice_0 : i32, i3
    %std_add_1.left, %std_add_1.right, %std_add_1.out = calyx.std_add @std_add_1 : i32, i32, i32
    %std_lsh_1.left, %std_lsh_1.right, %std_lsh_1.out = calyx.std_lsh @std_lsh_1 : i32, i32, i32
    %load_0_reg.in, %load_0_reg.write_en, %load_0_reg.clk, %load_0_reg.reset, %load_0_reg.out, %load_0_reg.done = calyx.register @load_0_reg : i32, i1, i1, i1, i32, i1
    %std_add_0.left, %std_add_0.right, %std_add_0.out = calyx.std_add @std_add_0 : i32, i32, i32
    %std_lsh_0.left, %std_lsh_0.right, %std_lsh_0.out = calyx.std_lsh @std_lsh_0 : i32, i32, i32
    %arg_mem_1.addr0, %arg_mem_1.write_data, %arg_mem_1.write_en, %arg_mem_1.clk, %arg_mem_1.reset, %arg_mem_1.read_data, %arg_mem_1.done = calyx.memory @arg_mem_1 <[6] x 32> [3] : i3, i32, i1, i1, i1, i32, i1
    %arg_mem_0.addr0, %arg_mem_0.write_data, %arg_mem_0.write_en, %arg_mem_0.clk, %arg_mem_0.reset, %arg_mem_0.read_data, %arg_mem_0.done = calyx.memory @arg_mem_0 <[6] x 32> [3] : i3, i32, i1, i1, i1, i32, i1
    calyx.wires {
      calyx.group @bb0_2 {
        calyx.assign %std_slice_1.in = %std_add_0.out : i32
        calyx.assign %arg_mem_0.addr0 = %std_slice_1.out : i3
        calyx.assign %load_0_reg.in = %arg_mem_0.read_data : i32
        calyx.assign %load_0_reg.write_en = %true : i1
        calyx.assign %std_add_0.left = %std_lsh_0.out : i32
        calyx.assign %std_lsh_0.left = %in0 : i32
        calyx.assign %std_lsh_0.right = %in1 : i32
        calyx.assign %std_add_0.right = %in2 : i32
        calyx.group_done %load_0_reg.done : i1
      }
      calyx.group @bb0_5 {
        calyx.assign %std_slice_0.in = %std_add_1.out : i32
        calyx.assign %arg_mem_1.addr0 = %std_slice_0.out : i3
        calyx.assign %arg_mem_1.write_data = %load_0_reg.out : i32
        calyx.assign %arg_mem_1.write_en = %true : i1
        calyx.assign %std_add_1.left = %std_lsh_1.out : i32
        calyx.assign %std_lsh_1.left = %in2 : i32
        calyx.assign %std_lsh_1.right = %in4 : i32
        calyx.assign %std_add_1.right = %in0 : i32
        calyx.group_done %arg_mem_1.done : i1
      }
    }
    calyx.control {
      calyx.seq {
        calyx.seq {
          calyx.enable @bb0_2
          calyx.enable @bb0_5
        }
      }
    }
  }
  calyx.component @par_func_0_1(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    // same idea
  }
  calyx.component @par_func_0_2(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
   // same idea
  }
  calyx.component @par_func_0_3(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
  // same idea
}

The design choice is worth discussing together:
Re. why I created multiple FuncOps for the body of scf.parallel: There To invoke something in Calyx in parallel using par:

  • calyx.group
  • calyx.component
    The former is not expressive enough for holding the body information of scf.parallel; the latter can, and since we know FuncOp will eventually be lowered to calyx.component, we can just invoke them in parallel in the control section by merely changing the operands passing to components.

Question:
Can we invoke the same calyx.component together? If so, I can only create one @par_func instance instead of multiple ones.

@jiahanxie353 jiahanxie353 added the Calyx The Calyx dialect label Jul 30, 2024
@rachitnigam
Copy link
Contributor

@jiahanxie353 this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged.

@cgyurgyik
Copy link
Member

I have a few C++/coding related comments, but I'll postpone these since you're in draft mode.

So currently, you've created N components that have the same M cells, and are subsequently invoked in parallel in some main component, i.e.,

component main() -> () {
  cells { 
    A0 = A0(); A1 = A1(); A2 = A2(); ...; An = An(); 
  }
  control {
    par { invoke A0; invoke A1; invoke A2; ...; invokeAn; }
  }
}

component A0() -> () { ... }
...
component An() -> () { ... }

The alternative is what you've questioned,

Can we invoke the same calyx.component together? If so, I can only create one @par_func instance instead of multiple ones.

i.e., create a component that takes as input M cells. This should be possible, but would require wiring the correct inputs and outputs.

component main() -> () {
  cells {
    A0 = A(); A1 = A(); ... ; An = A();
  }
  control {
    par { invoke A0(...); invoke A1(...); ...; invoke An(...); }
  }
}

component A() -> () { ... }

@jiahanxie353
Copy link
Contributor Author

this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged.

The reason why I mark it draft for two reasons:

  1. The test cases can only be integrated when other PRs got merged first since we are missing some features here;
  2. The high-level design is not fixed since it'd be great to get some feedback from reviewers, and once we have decided a certain design choice (for instance, how many components to create, etc.), we can move it out of draft. With that said, the implementation for the current design is solid.

This should be possible, but would require wiring the correct inputs and outputs.

Makes sense!


Discussion needed:
After translating to Calyx, I found the issue of Calyx memory write_en driven by multiple sources, the root of which is memref.store %6, %alloc[%8] : memref<6xi32>.
This is because I'm passing the same memory by reference to different components. In the scf.parallel's semantics, we are certain that we won't be storing to the same memory address by different memref.stores. However, when lowering down to Calyx, we lost this semantics and only have

calyx.group @bb0_5 {
  ....
  calyx.assign %arg_mem_1.write_en = %true : i1
  ....
}

across different component. And since it has multiple drivings, we can't lower it to Verilog.

Any thoughts?

@rachitnigam
Copy link
Contributor

Okay, the reasons for marking it a draft makes sense @jiahanxie353!

In the scf.parallel's semantics, we are certain that we won't be storing to the same memory address by different memref.stores.

It is not sufficient to show that we are going to write to different memory locations. The problem is that a hardware memory can only support a single read or write in a clock cycle (because it is a single-ported memory). For cases like this, you need to "bank" the memory into separate parts so that it access entirely different physical memories. First, I'd recommend reading the Dahlia paper to get a sense of what is going on. Next, we'd need to think more carefully about how this should be supported. @andrewb1999's work is adding this capability in a more general fashion but we might need to bank memories somehow.

Also tagging @ethanuppal since he's been thinking about similar problems in his language.

@jiahanxie353
Copy link
Contributor Author

Is there a general algorithm to determine an "optimal" bank size for each memory?

Consider a somewhat made-up example with appropriate complexity:

module {
  func.func @main(%buffer : memref<16xi32>) {
    %idx_one = arith.constant 1 : index
    %idx_two = arith.constant 2 : index
    %idx_three = arith.constant 3 : index
    %idx_four = arith.constant 4 : index
    %idx_five = arith.constant 5 : index

    %one = arith.constant 1 : i32
    %six = arith.constant 6 : i32
    %val = arith.addi %one, %six : i32

    %step1 = arith.constant 1 : index
    %step2 = arith.constant 2 : index
    %lb1 = arith.constant 0 : index
    %lb2 = arith.constant 1 : index
    %ub1 = arith.constant 5 : index
    %ub2 = arith.constant 4 : index
    %mem1 = memref.alloc() : memref<25xi32>

    scf.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%step1, %step2) {
      %mul1 = arith.muli %iv1, %idx_five : index
      %load_idx1 = arith.addi %mul1, %idx_four : index
      %load_val1 = memref.load %mem1[%load_idx1] : memref<25xi32>

      %mul2 = arith.muli %iv1, %idx_two : index
      %load_idx2 = arith.addi %mul2, %idx_three : index
      %load_val2 = memref.load %mem1[%load_idx2] : memref<25xi32>
      %sum1 = arith.addi %load_val1, %load_val2 : i32

      %mul3 = arith.muli %iv2, %idx_one : index
      %load_idx3 = arith.addi %mul3, %idx_one : index
      %load_val3 = memref.load %mem1[%load_idx3] : memref<25xi32>

      %mul4 = arith.muli %iv1, %idx_four : index
      %load_idx4 = arith.addi %mul4, %idx_two : index
      %load_val4 = memref.load %buffer[%load_idx4] : memref<16xi32>
      %sum2 = arith.addi %load_val3, %load_val4 : i32

      %store_idx = arith.muli %iv2, %idx_three : index
      %store_val = arith.addi %sum1, %sum2 : i32
      memref.store %store_val, %buffer[%store_idx] : memref<16xi32>
    }

    return
  }
}

And I have generated the memory access patterns:

Memory access (iv1=0, iv2=1) (0, 3) (1, 1) (1, 3) (2, 1) (2, 3) (3, 1) (3, 3) (4, 1) (4, 3)
load mem1[5iv1+4] 4 4 9 9 14 14 19 19 24 24
load mem1[2iv1+3] 3 3 5 5 7 7 9 9 11 11
load mem1[3iv2+1] 4 10 4 10 4 10 4 10 4 10
load buffer[4iv1+2] 2 2 6 6 10 10 14 14 18 18
store buffer[3iv2] 3 9 3 9 3 9 3 9 3 9

And I realize, for this example, and according to the documentation:

Semantically we require that the iteration space can be iterated in any order, and the loop body can be executed in parallel. If there are data races, the behavior is undefined.

Since (iv1=0, iv2=1) and (iv1=0, iv2=3) all access mem1's 4th, 3rd element, there will be potential data race and we cannot do any parallelism here. First, this example is completely artificial and may I'm over-complicating it. But if it were to happen, should we reject this kind of access patterns in the first place?


And if we take a step back to consider a somewhat simplified access pattern with only one induction variable:

Memory access iv1=0 1 2 3 4
load mem1[5iv1+4] 4 9 14 19 24
load mem1[2iv1+3] 3 5 7 9 11

What is the algorithm for determining the banking factor?
If we were to use some brute force algorithm, we search from bankingFactor = the size of access patterns (5 in the example above) to 1.
Let's arbitrarily take `bankingFactor = 4. we find out that,

  • Group 1:
Memory access iv1=0
load mem1[5iv1+4] 4
load mem1[2iv1+3] 3

"consume"s two banks: 1. the first bank (4 % 4 = 0); 2. the fourth bank (3 % 4 = 3)

  • Group 2:
Memory access iv1=1
load mem1[5iv1+4] 9
load mem1[2iv1+3] 5

"consume"s 1 bank, i.e., the second bank (9 % 4 = 1, 5 % 4 = 1);
And Group 1 & 2 can be executed in parallel, since they are not sharing any bank:

par {
  component1 {
    iv1 = 0;
    load mem1[5iv1+4];
    load mem1[2iv1+3];
  }
  component2 {
    iv1 = 1;
    load mem1[5iv1+4];
    load mem1[2iv1+3];
  }
}
  • Group 3:
Memory access iv1= 2
load mem1[5iv1+4] 14
load mem1[2iv1+3] 7

"consume"s two banks: 1. third (14 % 4 = 2); 2. fourth (7 % 4 = 3).
So Group 3 cannot be executed in parallel with Group 1; but can be executed with Group 2.

  • Group 4:
Memory access iv1=3
load mem1[5iv1+4] 19
load mem1[2iv1+3] 9

"consume"s two banks: 1. fourth (19 % 4 = 3); 2. second (9 % 4 = 1)
So Group 4 conflicts with Group 1 (they all use the fourth bank) & Group 2 (they all use the second bank) & Group 3 (all use the fourth bank).

  • Group 4:
Memory access iv1=4
load mem1[5iv1+4] 24
load mem1[2iv1+3] 1

"consume"s two banks: 1. first (24 % 4 = 0); 2. second (1 % 4 = 1).
So Group 5 conflicts with Group 4, resulting Group 4 forced to be executed in sequence isolated from other groups; more analysis.

To put different groups together and determine which ones can be executed in parallel, this will almost sound like graph coloring problem. Is there any general way to do it?

  • In addition, we might move on to other bankingFactor and check if we can achieve a higher parallelism.
  • Worst case, if the work described above is too tedious; or if none is infeasible, we fall back the banking factor to 1, meaning there will be no parallelism at all: there will only be one component in the par group; and everything inside scf.parallel will be unrolled and executed in sequence inside the component.

Any suggestion to tackle this?

@rachitnigam
Copy link
Contributor

You can look at the Tracebank paper for details on how to automatically determine good banking factors. I think optimality exists but for large programs, you might have to use heuristics.

@ethanuppal
Copy link

I second TraceBank.

@jiahanxie353
Copy link
Contributor Author

Whoo, finally finished implementing everything! It's a lot trickier than I thought.

Please let me know what you think!

@jiahanxie353 jiahanxie353 marked this pull request as ready for review September 22, 2024 23:18
class CliqueGraph {
public:
CliqueGraph(
const std::unordered_map<VertexType, std::set<VertexType>> &adjacencyList)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing needs to be fixed is the data type used. I had to use standard vector, maps, etc to get it working, such as here; but I thought it'd be better to use LLVM native types such as DenseMap, DenseSet, etc. But somehow I had to make some extra work like implementing customized getTombstoneKey and more. Can I get some pointers on that?

redundantRemoved.push_back(simplifiedStep);
}
// 2. Combine steps with identical access into a single step
std::set<std::set<int>> seenStep;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here, I couldn't get a DenseSet of DenseSet working unless I implement extra methods, so not entirely sure what to do.

@cgyurgyik
Copy link
Member

Hey @jiahanxie353, I'm going to focus on the implementation first. There is >1,300 new lines of code here; is it possible to break this up into multiple PRs? Second, can you give some insight behind the approach you've taken?

uint endIdx = bankSize * (bankCnt + 1);
auto allAttrs = cstAttr.getValues<Attribute>();
SmallVector<Attribute, 8> extractedElements;
for (auto idx = beginIdx; idx < endIdx; idx++)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be:

Suggested change
for (auto idx = beginIdx; idx < endIdx; idx++)
for (uint i = 0; i < bankSize; i++) {
uint idx = bankCnt + availableBanks * i;
extractedElements.push_back(allAttrs[idx]);
}

for round robin

@jiahanxie353
Copy link
Contributor Author

jiahanxie353 commented Sep 23, 2024

Hey @jiahanxie353, I'm going to focus on the implementation first. There is >1,300 new lines of code here; is it possible to break this up into multiple PRs? Second, can you give some insight behind the approach you've taken?

Thanks, @cgyurgyik !

is it possible to break this up into multiple PRs

Sure, I'll try!

can you give some insight behind the approach you've taken

Sure! My strategy is to

  1. Run partial evaluation on each scf.parallelOp by
  • manually compute the static value of each induction variable (needed for running the trace bank algorithm since it needs the memory trace value, which has to be known at compile time) and
  • copy the body block and insert those new blocks into scf.parallelOp's body region, wherein each block corresponds to each induction variable combination.
  • For example, if we have:
module {
 func.func @main() {
   %c2 = arith.constant 2 : index
   %c1 = arith.constant 1 : index
   %c3 = arith.constant 3 : index
   %c0 = arith.constant 0 : index
   %alloc = memref.alloc() : memref<6xi32>
   %alloc_1 = memref.alloc() : memref<6xi32>
   scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
     %4 = arith.shli %arg3, %c2 : index
     %5 = arith.addi %4, %arg2 : index
     %6 = memref.load %alloc_1[%5] : memref<6xi32>
     %7 = arith.shli %arg2, %c1 : index
     %8 = arith.addi %7, %arg3 : index
     memref.store %6, %alloc[%8] : memref<6xi32>
     scf.reduce
   }
   return
 }
}

it will have 4 blocks in total, corresponding to induction variables (0, 0), (0, 1), (2, 0), (2, 1), respectively. Then, I will copy the block and turn scf.parallel into:

    scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
     ^bb0:
      %induction_var1, %induction_var2 = 0, 0
      %4 = arith.shli %induction_var2, %c2 : index
      %5 = arith.addi %4, %induction_var1 : index
      %6 = memref.load %alloc_1[%5] : memref<6xi32>
      %7 = arith.shli %induction_var1, %c1 : index
      %8 = arith.addi %7, %induction_var2 : index
      memref.store %6, %alloc[%8] : memref<6xi32>
      scf.reduce
      ^bb1:
      %induction_var1, %induction_var2 = 0, 1
      %4 = arith.shli %induction_var2, %c2 : index
      %5 = arith.addi %4, %induction_var1 : index
      %6 = memref.load %alloc_1[%5] : memref<6xi32>
      %7 = arith.shli %induction_var1, %c1 : index
      %8 = arith.addi %7, %induction_var2 : index
      memref.store %6, %alloc[%8] : memref<6xi32>
     ^bb2:
     similar...
    ^bb3:
    similar...
    }
  1. Verify that each memory read/write can indeed run in parallel;
  • The algorithm I used is the tracebank algorithm. Basically, we group all addresses that need be accessed in parallel in to different "steps", and for each "step", we need to make sure that there's no memory access conflicts.
    • For example, if we have:
    scf.parallel (%arg2) = (%c0) to (%c2) step (%c1) {
     ^bb0:
      %induction_var = 0
      memref.store %1, %alloc[%induction_var] : memref<2xi32>
      scf.reduce
      ^bb1:
      %induction_var = 1
      memref.store %1, %alloc[%induction_var] : memref<2xi32>

it will raise an error because we can't access %alloc in parallel even though we are indexing using different addresses (0 in bb0 and 1 in bb1);

  • To this end, we need to separate the memories into banks, and we give users full control of how many banks are there for each memory by passing in a json file. If the number of available banks is sufficient, we separate the original memory into corresponding banks in a round-robin way; if no, we will raise an error saying the banks are not enough for parallel access.
  • The details of the algorithm can be found here. To summarize the algorithm, there are two main steps: (1) findMasksBits, which finds all masks bits (mask bits "mask"s the memory trace/memory access values) that result in zero address-conflicts; (2) mapMaskIDsToBanks maps memory trace value (the name is called mapMaskIDsTo because we use mask bits produced in the previous step to mask the memory trace values, resulting in more condensed, shortened values) to their corresponding banks. And the main data structure I used is ConflictGraph and CliqueGraph.
    • ConflictGraph calculates potential memory access conflicts by graph coloring (using maximal clique for color's lower bound);
    • CliqueGraph finds maximal cliques in a graph. It's used in two places:
      • During the testing of graph colorability, as mentioned above;
      • And second is a very interesting case, where we need to find maximal independent sets to identify the "steps" in the tracebank algorithm:
        • By definition of a "step", it's a list of addresses that needs to be accessed in parallel. It's intuitive to map the concept of a "step" to a block inside scf.parallel's region, since each block runs in parallel. Unfortunately, this didn't work out for me.
        • The caveat is when we have nested scf.parallels, such as @rachitnigam 's Dahlia matmul example. And my criteria for parallelizable blocks is they are not in the same child block of a common ancestor scf.parallel, code here.
        • For example, if we have:
scf.parallel0{
^bb0_0:
  scf.parallel0_0 {
  ^bb0_0_0:
    op1;
    op2;
  ^bb0_0_1:
    op3;
    op4;
  }
^bb0_1:
  scf.parallel0_1 {
  ^bb0_1_0:
    op5;
    op6;
  ^bb0_1_1:
    op7;
    op8;
  }
  op9
}

op1 cannot run in parallel with op2 because they share the closest common ancestor scf.parallel0_0, which has two children blocks: bb0_0_0 and bb0_0_1, and op1, op2 happen to be in the same child block. But op1 can be accessed in parallel with op3 or op4. And op1 can be accessed in parallel with op5 as well since they share the closest common ancestor scf.parallel0, and op1 is in its child block bb0_0 but op5 is in bb0_1(and it makes sense if we look at the semantics). Similarly, op1 can be run in parallel with op9; but op5 cannot. Thus, there are certain blocks that can be executed in parallel while some blocks cannot. And I tried to find all possible blocks that can be run in parallel, which is done through modeling every block as a node in a graph, and connect non-parallelizable blocks with edges, and then the problem is reduced to finding the maximal independent sets. Interestingly, finding the maximal independent sets is a dual of finding the maximal clique, as long as we change the adjacency list/matrix into its complement form.

  • After identifying the parallelizable blocks, I picked operations out of those blocks to generate "step"s, code here. For example, {op1, op3, op5, op8} is a valid step; {op1, op4, op9} is also a valid step; etc.
  • Then after verification, it's just a bunch of transformation to divide the original memories based on the mapped mask IDs to banks, code starts here
  1. Finally, we build the control for scf.parallel: each scf.parallel corresponds to a calyx.par; and each block inside scf.parallel will be a calyx.seq, which will be nested inside calyx.par.

Please let me know if you have any question and/or suggestion, thanks!

@cgyurgyik
Copy link
Member

Very cool! Thanks for the detailed explanation, it definitely clears things up. It would be great if you could split this up in several PRs, e.g., leave the nested par for a future PR. In general, I think the approach looks good, but am concerned about introduce an exponential algorithm (bron-kerbosch maximal clique) into a compiler setting. At the very least, this should probably not be turned on by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Calyx The Calyx dialect
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants