Skip to content

Commit

Permalink
trace generation (Xilinx#592)
Browse files Browse the repository at this point in the history
* trace generation

* fix format and test

* revert change on Passes.td format

* redo Passes.td format

* fix format

* remove experimental bit

* add trace.md

* update trace.md

* add unit tests

* increase default buffer size

* fix test

* update CHECK

---------

Co-authored-by: erwei-xilinx <[email protected]>
  • Loading branch information
Yu-Zhewen and erwei-xilinx authored Jun 7, 2024
1 parent 3026abd commit e8f5f8c
Show file tree
Hide file tree
Showing 11 changed files with 545 additions and 80 deletions.
37 changes: 37 additions & 0 deletions docs/trace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Auto Trace Generation in AIR

## Usage

To enable this feature,

* provide the `insert-trace-packet-flow=true` option to the `air-to-aie` pass, and
* specify the `trace-size`, `trace-offset` options to the `airrt-to-npu` pass.

Trace can then be generated for all compute tiles (cores) and memtiles, unless there is a routing congestion when the build might fail.

`trace-size` defines the buffer size allocated to hold the trace data, represented in bytes. Currently, this value is chosen by the user empirically, depending on the number of cores traced and how frequent the event might be triggered.

`trace-offset` defines the offset when the trace data are appended to the output. It might be inferred from the code in the future. In addition, it is for now hard coded that the trace data are dumped to `ddr_id = 2`.

One such example is provided in `test/xrt/01_air_to_npu`, and the generated trace file can be further processed through [parse_trace.py](https://github.com/Xilinx/mlir-aie/blob/main/programming_examples/utils/parse_trace.py).


Currently, in this pariticular example and when trace is enabled, the entire column of core tiles is shifted to the right by one and all trace data comes out via the second column's shim tile. This is a workaround for the congestion that the `South` port is running out and the bottom row of core tiles (i.e. the 2nd row of the whole array) cannot be routed as `Trace->South->West/East`, once it hits the switchbox of memtile.

## air-to-aie
Inside this pass, the packet flows are inserted when `insert-trace-packet-flow=true`. The source of the flow is `channel = 0` of the trace port and the destination is `channel = 1` of the shim tile in the same column.

One possible future improvement can be allowing user to specify which channel/shim tile to use, or having an allocation algorithm in place. In addition, the current assumption is everything else apart from the trace are using circult-switch connections, without detecting any potential conflict in the packet id.

## airrt-to-npu
This pass is responsible for inserting trace-related `NpuWrite32Op` to `func.func`. The details of these operations have already been documented in [MLIR-AIE](https://github.com/Xilinx/mlir-aie/blob/resnet/docs/Tracing.md), except the extra support for timestamp synchronization across multiple traces.

To have the synchronization, the following steps are required:

* make the internal timer of each tile reset, when the event `BROADCAST_15` is detected. The address is `0x34000` and `0x94000` for the NPU compute tile and memtile respectively. The event id is `122` and `157` respectively according to this [header file](https://github.com/Xilinx/aie-rt/blob/main-aie/driver/src/events/xaie_events_aieml.h).
* set the start of the trace triggered by `BROADCAST_15` as well, with the address as `0x340D0` and `0x940D0`.
* for the bottom left tile (0, 0), reset the timer when `USER_EVENT_1` is detected. The address to write is `0x34000` and the event id is `127`.
* use `USER_EVENT_1` to trigger `BROADCAST_15`. This is done by writing `127` to address `0x3404C`.
* actually trigger `USER_EVENT_1` by writing `127` to address `0x34008`.

So far, the values of these operations (such as specifying which events or ports to monitor) and the addresses are all hard coded. In the future, they might also be exposed as user options and depend on the `TargetModel` as well.
12 changes: 11 additions & 1 deletion mlir/include/air/Conversion/Passes.td
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,9 @@ def AIRToAIE : Pass<"air-to-aie", "ModuleOp"> {
/*default=*/"false",
"Choose whether to schedule shim data movement via generating AIE "
" shim DMA program, or AIR runtime.">,
Option<"clInsertTracePacketFlow", "insert-trace-packet-flow", "bool",
/*default=*/"false",
"Create packet routed traces for cores and memtiles">,
];
let description = [{
This pass converts AIR dialect `herd` and `segment` operations into AIE
Expand Down Expand Up @@ -452,7 +455,14 @@ def AIRRtToNpu : Pass<"airrt-to-npu", "ModuleOp"> {
```

}];
let options = [];
let options = [
Option<"clTraceSize", "trace-size", "unsigned",
/*default=*/"0",
"Trace buffer size for cores and memtiles (in bytes)">,
Option<"clTraceOffset", "trace-offset", "unsigned",
/*default=*/"0",
"Trace buffer offset appended to ddr_id=2">
];
let dependentDialects = ["xilinx::AIEX::AIEXDialect"];
}

Expand Down
148 changes: 147 additions & 1 deletion mlir/lib/Conversion/AIRRtToNpuPass.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ using namespace xilinx;
using namespace xilinx::airrt;

namespace {
#define GEN_PASS_DECL_AIRRTTONPU
#define GEN_PASS_DEF_AIRRTTONPU
#include "air/Conversion/Passes.h.inc"

Expand All @@ -53,6 +54,7 @@ namespace {
// %1 = unrealized_conversion_cast %0
// %2 = memref.assume_alignment %1
//

struct RelocateAssumeAlignmentOp
: public mlir::OpRewritePattern<memref::AssumeAlignmentOp> {
using OpRewritePattern::OpRewritePattern;
Expand Down Expand Up @@ -452,7 +454,8 @@ void hoistTargetOpsToNewAffineFor(OpBuilder builder, affine::AffineForOp for_op,
}
}

template <typename T> void push_back_if_unique(SmallVector<T> &vec, T entry) {
template <typename T>
void push_back_if_unique(SmallVector<T> &vec, T entry) {
if (std::find(vec.begin(), vec.end(), entry) == vec.end()) {
vec.push_back(entry);
}
Expand Down Expand Up @@ -967,6 +970,10 @@ struct AIRRtToNpuPass : public impl::AIRRtToNpuBase<AIRRtToNpuPass> {

// Renumber npu dma ops
renumberNpuDmaOps(module.getBody());

// Configure the tile trace units and the shimDMA
if (clTraceSize > 0)
insertNpuWrite32ForTrace(module, clTraceSize, clTraceOffset);
}

void moveFuncOpToEndOfDeviceOp(ModuleOp module) {
Expand Down Expand Up @@ -1251,6 +1258,145 @@ struct AIRRtToNpuPass : public impl::AIRRtToNpuBase<AIRRtToNpuPass> {
}
}

void insertNpuWrite32ForTrace(ModuleOp module, int64_t trace_size,
int64_t trace_offset) {
SmallVector<mlir::func::FuncOp> funcOps;
module.walk([&](mlir::func::FuncOp f) { funcOps.push_back(f); });

for (auto f : funcOps) {
OpBuilder builder(f);
auto d = f->getParentOfType<AIE::DeviceOp>();
if (!d)
continue;

auto &target_model = d.getTargetModel();
std::map<int, int> chanToIdMap;
builder.setInsertionPointToStart(&f.front());
for (auto pktFlow : d.getOps<AIE::PacketFlowOp>()) {
Region &r = pktFlow.getPorts();
Block &b = r.front();
int flowID = pktFlow.IDInt();
AIE::Port sourcePort, destPort;
AIE::TileOp srcTile, destTile;

// find all packet flow with trace port as source
for (Operation &Op : b.getOperations()) {
if (auto pktSrc = dyn_cast<AIE::PacketSourceOp>(Op)) {
srcTile = dyn_cast<AIE::TileOp>(pktSrc.getTile().getDefiningOp());
sourcePort = pktSrc.port();
} else if (auto pktDest = dyn_cast<AIE::PacketDestOp>(Op)) {
destTile = dyn_cast<AIE::TileOp>(pktDest.getTile().getDefiningOp());
destPort = pktDest.port();
}
}
if (sourcePort.bundle != AIE::WireBundle::Trace)
continue;

int srcColIndex = srcTile.colIndex();
int srcRowIndex = srcTile.rowIndex();
int dstColIndex = destTile.colIndex();
int dstRowIndex = destTile.rowIndex();
assert((target_model.isCoreTile(srcColIndex, srcRowIndex) ||
target_model.isMemTile(srcColIndex, srcRowIndex)) &&
"unsupported trace src");
assert(target_model.isShimNOCTile(dstColIndex, dstRowIndex) &&
"unsupported trace dest");
int pkt_type = 0;
if (target_model.isMemTile(srcColIndex, srcRowIndex))
pkt_type = 3;
else if (sourcePort.channel == 1)
pkt_type = 1;
int buff_size = trace_size / target_model.columns();
int buff_offset = trace_offset; // todo: get from func args?
buff_offset += dstColIndex * buff_size;

// configure tile trace
if (target_model.isCoreTile(srcColIndex, srcRowIndex)) {
// event boardcast to sync timer
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(),
srcColIndex, srcRowIndex, 0x34000,
122 << 8);
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(),
srcColIndex, srcRowIndex, 0x340D0,
122 << 16);
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(),
srcColIndex, srcRowIndex, 0x340D4,
pkt_type << 12 | flowID);
// configure events to monitor
// todo: allow user to specify?
builder.create<AIEX::NpuWrite32Op>(
builder.getUnknownLoc(), srcColIndex, srcRowIndex, 0x340E0,
(1 << 24) | (33 << 16) | (34 << 8) | 37);
builder.create<AIEX::NpuWrite32Op>(
builder.getUnknownLoc(), srcColIndex, srcRowIndex, 0x340E4,
(44 << 24) | (45 << 16) | (75 << 8) | 79);
// configure ports to monitor
// todo: allow user to specify?
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(),
srcColIndex, srcRowIndex, 0x3FF00,
(1 << 8) | ((1 << 5) | 1));
// builder.create<AIEX::NpuWrite32Op>(
// builder.getUnknownLoc(), srcColIndex, srcRowIndex, 0x3FF04, 0);
} else if (target_model.isMemTile(srcColIndex, srcRowIndex)) {
// event boardcast to sync timer
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(),
srcColIndex, srcRowIndex, 0x94000,
157 << 8);
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(),
srcColIndex, srcRowIndex, 0x940D0,
157 << 16);
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(),
srcColIndex, srcRowIndex, 0x940D4,
pkt_type << 12 | flowID);
// configure events to monitor
// todo: allow user to specify?
builder.create<AIEX::NpuWrite32Op>(
builder.getUnknownLoc(), srcColIndex, srcRowIndex, 0x940E0,
(1 << 24) | (80 << 16) | (84 << 8) | 88);
builder.create<AIEX::NpuWrite32Op>(
builder.getUnknownLoc(), srcColIndex, srcRowIndex, 0x940E4,
(92 << 24) | (96 << 16) | (100 << 8) | 104);
// configure ports to monitor
// todo: allow user to specify?
builder.create<AIEX::NpuWrite32Op>(
builder.getUnknownLoc(), srcColIndex, srcRowIndex, 0xB0F00,
((1 << 21) | (2 << 16)) | ((1 << 13) | (1 << 8)) | (1 << 5));
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(),
srcColIndex, srcRowIndex, 0xB0F04,
(3 << 16) | (2 << 8) | 1);
}

// configure shim tile
if (chanToIdMap.count(dstColIndex) == 0)
chanToIdMap[dstColIndex] = 15;
int bdID = chanToIdMap[dstColIndex];
int ddr_id = 2; // todo: let user specify
assert(bdID >= 4 && "run out of bd_id");
builder.create<AIEX::NpuWriteBdExShimTileOp>(
builder.getUnknownLoc(), dstColIndex, 1, ddr_id, bdID, buff_size,
buff_offset, 1, 0, flowID, pkt_type, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0);
int address;
if (destPort.channel == 0)
address = 0x1D204;
else if (destPort.channel == 1)
address = 0x1D20C;
else
assert(false && "unknown trace dest");
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(), dstColIndex,
dstRowIndex, address, bdID--);
}

// broadcast event to sync timer
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(), 0, 0, 0x34000,
127 << 8);
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(), 0, 0, 0x3404C,
127);
builder.create<AIEX::NpuWrite32Op>(builder.getUnknownLoc(), 0, 0, 0x34008,
127);
}
}

// Renumber aiex.npu.dma_memcpy_nd ops per column of AIEs.
void renumberNpuDmaOps(Block *blk) {
std::map<int, int> chanToIdMap;
Expand Down
60 changes: 59 additions & 1 deletion mlir/lib/Conversion/AIRToAIEPass.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ struct AIRToAIEConversionOptions {
bool emit_while;
bool emit_herd_lock;
bool generate_shim_dma;
bool insert_trace_packet_flow;
AIE::AIEDevice device;
};

Expand Down Expand Up @@ -431,7 +432,8 @@ void outlineAIEMemtiles(OpBuilder &builder, AIE::DeviceOp aie_device,
}
}

template <typename T> void push_back_if_unique(std::vector<T> &vec, T entry) {
template <typename T>
void push_back_if_unique(std::vector<T> &vec, T entry) {
if (std::find(vec.begin(), vec.end(), entry) == vec.end())
vec.push_back(entry);
}
Expand Down Expand Up @@ -3013,6 +3015,55 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
}
}

void createTracePacketFlow(AIE::DeviceOp device) {
OpBuilder builder(device);
const auto &target_model = device.getTargetModel();

// Collect existing TileOps
DenseMap<AIE::TileID, AIE::TileOp> tiles;
for (auto tile : device.getOps<AIE::TileOp>()) {
int colIndex = tile.colIndex();
int rowIndex = tile.rowIndex();
tiles[{colIndex, rowIndex}] = tile;
}

// Create packet flows
int flowID = 0; // todo: check any existing?
for (auto srcTile : device.getOps<AIE::TileOp>()) {
int srcColIndex = srcTile.colIndex();
int srcRowIndex = srcTile.rowIndex();
AIE::TileOp destTile;

if (target_model.isCoreTile(srcColIndex, srcRowIndex) ||
target_model.isMemTile(srcColIndex, srcRowIndex)) {
int destColIndex = srcColIndex; // todo: allocation?
int destRowIndex = 0;
if (!tiles[{destColIndex, destRowIndex}]) {
builder.setInsertionPointToStart(device.getBody());
destTile = builder.create<AIE::TileOp>(builder.getUnknownLoc(),
destColIndex, destRowIndex);
tiles[{destColIndex, destRowIndex}] = destTile;
} else {
destTile = tiles[{destColIndex, destRowIndex}];
}
int destChan = 1; // todo: allocation?

builder.setInsertionPointToEnd(device.getBody());
auto keep_pkt_header = builder.getBoolAttr(true);
AIE::PacketFlowOp pktFlow = builder.create<AIE::PacketFlowOp>(
builder.getUnknownLoc(), flowID++, keep_pkt_header);
Region &r_pktFlow = pktFlow.getPorts();
Block *b_pktFlow = builder.createBlock(&r_pktFlow);
builder.setInsertionPointToStart(b_pktFlow);
builder.create<AIE::PacketSourceOp>(builder.getUnknownLoc(), srcTile,
AIE::WireBundle::Trace, 0);
builder.create<AIE::PacketDestOp>(builder.getUnknownLoc(), destTile,
AIE::WireBundle::DMA, destChan);
builder.create<AIE::EndOp>(builder.getUnknownLoc());
}
}
}

void runTestPatterns() {

auto m = getOperation();
Expand All @@ -3038,6 +3089,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
/*.emit_while = */ clEmitWhileLoop,
/*.emit_herd_lock = */ clEmitHerdLock,
/*.generate_shim_dma = */ clGenerateShimDMA,
/*.insert_trace_packet_flow = */ clInsertTracePacketFlow,
/*.device = */ *device};
createAIEModulesAndOutlineCores(m, aie_modules, tileToHerdMap, options);
std::set<ModuleOp> seen;
Expand All @@ -3064,6 +3116,8 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
renumberChannelOps(&d.getBodyRegion().front(),
chan_renumber_reverse_map);
}
if (options.insert_trace_packet_flow)
createTracePacketFlow(d);
}
}

Expand Down Expand Up @@ -3135,6 +3189,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
/* .emit_while = */ clEmitWhileLoop,
/* .emit_herd_lock = */ clEmitHerdLock,
/* .generate_shim_dma = */ clGenerateShimDMA,
/*.insert_trace_packet_flow = */ clInsertTracePacketFlow,
/* .device = */ *device};
createAIEModulesAndOutlineCores(module, aie_devices, tileToHerdMap,
options);
Expand Down Expand Up @@ -3200,6 +3255,8 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
lowerAIRMemcpyOp<air::DmaMemcpyNdOp>(device, shimDmaAlloc, options);

// lowerPipelineGetPut(device, tileToHerdMap);
if (options.insert_trace_packet_flow)
createTracePacketFlow(device);

SmallVector<air::HerdOp, 4> herds;
SmallVector<air::SegmentOp, 4> segs;
Expand Down Expand Up @@ -3486,6 +3543,7 @@ FailureOr<ModuleOp> convertAIRToAIE(mlir::RewriterBase &rewriter,
/* .emit_while = */ false,
/* .emit_herd_lock = */ false,
/* .generate_shim_dma = */ false,
/*.trace_size = */ 0,
/* .device = */ *device};
std::vector<std::pair<ModuleOp, xilinx::air::HerdOp>> aie_modules;
p.walk([&](xilinx::air::HerdOp h) {
Expand Down
Loading

0 comments on commit e8f5f8c

Please sign in to comment.