Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Schedule] Support for intra-kernel data placement #436

Open
wants to merge 19 commits into
base: tvm
Choose a base branch
from

Conversation

hecmay
Copy link
Collaborator

@hecmay hecmay commented Feb 4, 2022

This PR aims to enhance .systolic() and .to() primitive to better support intra-kernel data placement for systolic array generation using AutoSA backend.

.systolic() primitive is a push-button API that maps the compute kernel to a systolic array automatically (while the dataflow pattern is left to compiler's decision). .to() primitive provides more flexibility for expert designers to explore the trade-offs of different systolic dataflows.

I have successfully solved the dependency issues and installed AutoSA on our local server. In this PR, i will also add the CI/CD local testing for systolic array programs with AutoSA backend.

@hecmay hecmay changed the title [Schedule] Improve support for intra-kernel data placement [Schedule] Support for intra-kernel data placement Feb 6, 2022
@hecmay
Copy link
Collaborator Author

hecmay commented Mar 6, 2022

@zzzDavid @chhzh123 can you maybe take a quick pass on this PR? Thanks!

Copy link
Member

@chhzh123 chhzh123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I’ve looked through the code and think maybe you could add more descriptions for this PR. Seems you have added several new features besides the AutoSA backend.

  1. I notice you introduced new APIs like transpose and pack, and new passes like transform_layout and explicit_unroll, could you also describe the changes in this PR?
  2. Just a small question: You are not writing a C++ codegen for AutoSA right? All the compilation happens at the Python level (except for some transformation passes).

@@ -26,5 +26,5 @@ jobs:
source $VITIS/settings64.sh
source /opt/xilinx/xrt/setup.sh
export LOCAL_CI_TEST=1
which vivado_hls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vivado_hls has been included in the previous paths?

Comment on lines +24 to +35
def indent(num):
return " " * num

def get_function_code(name, code):
pos = code.find(name)
start_pos = pos - len("inline void")
end_pos = code.find("/* Helper", pos)
return code[start_pos:end_pos]


def get_ser_size(code):
lines = code.split("\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what Python formatter HeteroCL uses, but mixing one-line space and two-line space seems weird.

Comment on lines +117 to +119
PART = "10,16"
LAT = "2,2"
SIMD = 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these magic numbers? Could you add comments or use more specific variable names here?

Comment on lines +126 to +129
print(f"[ INFO ] input size OC({OC}), OH({OH}), OW({OW}), IC({IC}), R({R}), C({C})")
PART = "16,13,13,1"
LAT = "2,1,2"
SIMD = "1,1,2,4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "16,13,13,1"? I suppose this is not a test file but a general implementation.

@@ -1,6 +1,6 @@
/*!
* Copyright (c) 2019 by Contributors
* \file adjust_buffer_binding.cc
* \file loop_partition.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you changed the header by mistake? The file name remains the same.

@@ -265,6 +265,61 @@ def join(self, srcs, dest=None):
"inconsistent tensor joining"
self.sch.join(target, dest, self[src])

def transpose(self, tensor=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is one in compute_api.py. What's the difference between these two transpose?

return Y

# Note that you have to make sure AutoSA binary
# in on the PATH by running which command, otherwise HCL runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in on" typo. Better add quotation marks for which.

extra_flags = "--simd-info=./autosa_tests/cnn/simd_info.json "
return ST, PART, LAT, SIMD, extra_flags

def generate_systolic_array(keys, values, code, backend):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the codegen, copying files, and generating headers are done in this function? Maybe it would be better if this function can be separated into several subfunctions or several steps like what we did in runtime.py.

@hecmay
Copy link
Collaborator Author

hecmay commented Mar 7, 2022

Sorry for the late review. I’ve looked through the code and think maybe you could add more descriptions for this PR. Seems you have added several new features besides the AutoSA backend.

  1. I notice you introduced new APIs like transpose and pack, and new passes like transform_layout and explicit_unroll, could you also describe the changes in this PR?
  2. Just a small question: You are not writing a C++ codegen for AutoSA right? All the compilation happens at the Python level (except for some transformation passes).

Thanks for pointing that out.

  1. These new APIs (e.g., packing, layout transformation) are necessary to generate a high-throughput memory subsystem for the GEMM systolic array. I will add more explanations on these new APIs.
  2. The AutoSA codegen in HCL is a mix of C++ and python rn - the HLS/OpenCL code generator (i.e., C++ part) will call a utility function (i.e., python part) that is responsible for inferring the CLI arguments and then invoking AutoSA. I can probably implement that utility function in C++, which would make the flow a bit cleaner


self.cascade_tensor = tensor
self.cascade_source_stage = None
self.sch.transpose(src, tensor, new_shape)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about this self.sch.transpose function: is data packing actually done by this function? It seems to me that this pack function only calculates the new shape.

if (top_arg_names_.find(var_name) != top_arg_names_.end()) {
placement_info += "[0]"; // located on off-chip memory
} else {
placement_info += "[1]"; // loacted on on-chip memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny typo

Copy link
Collaborator

@zzzDavid zzzDavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few general questions about the newly added passes:

  1. From the test case, we are wrapping a piece of imperative gemm code into a stage and calls systolic() on the stage. Do we have any checks to see if the imperative code can be mapped to AutoSA? Or AutoSA would complain if it can't map the algorithm?
  2. I want to check if my understanding is correct. For a piece of code that is targeted to AutoSA backend, we first generate C code from it, and then calls AutoSA to generate systolic array HLS code + serialization/de-serialization code, which is then wrapped into a stage. Is that correct?
  3. About the "explicit unroll" pass, is it unrolling a loop and then outline the loop body to become PEs (function calls)?
  4. What does "transform layout" do?

@@ -380,6 +380,9 @@ def lower(sch,
stmt = ir_pass.AdjustBufferBinding(stmt, arg_list)
stmt = ir_pass.InferStream(stmt, arg_list)
stmt = ir_pass.AdjustBufferBinding(stmt, arg_list)
# perform layout transformation
stmt = ir_pass.TransformLayout(stmt, arg_list)
stmt = ir_pass.AdjustBufferBinding(stmt, arg_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does AdjustBufferBinding do? Why is it called multiple times after each pass?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants