Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove func blocks unifier indirections #774

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

tilk
Copy link
Member

@tilk tilk commented Dec 10, 2024

This PR simplifies the announcement mechanism using the dependency system. At the same time, two layers of Collectors for accepting results were flattened to a single collector. The func_blocks_unifier module became trivial, and it might make sense to remove it later.

Benchmark results dropped slightly for some reason, but device utilization also seems to be reduced.

@tilk tilk added refactor Doesn't change functionality, but makes stuff nicer benchmark Benchmarks should be run for this change labels Dec 10, 2024
Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▼ 0.409 (-0.008) ▼ 0.513 (-0.000) ▼ 0.336 (-0.002) ▼ 0.604 (-0.051) ▼ 0.352 (-0.008) ▼ 0.285 (-0.005) ▼ 0.324 (-0.002) ▼ 0.431 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 13517 (-726) ▼ 4258 (-140) 1456 (0) 1164 (0) ▼ 48 (-6)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 20438 (-3037) ▼ 6873 (-140) ▼ 1786 (-32) 1216 (0) ▼ 41 (-0)

Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▼ 0.408 (-0.008) ▼ 0.525 (-0.000) ▼ 0.368 (-0.002) ▼ 0.589 (-0.042) ▼ 0.350 (-0.009) ▼ 0.287 (-0.004) ▼ 0.326 (-0.002) ▼ 0.438 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 14379 (+55) ▼ 4258 (-140) 1456 (0) 1164 (0) ▼ 47 (-5)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 24033 (+1866) ▼ 6873 (-140) ▲ 1818 (+32) 1216 (0) ▼ 36 (-9)

@@ -116,30 +113,24 @@ async def producer(sim: TestbenchContext):

async def consumer(self, sim: TestbenchContext):
# TODO: this test doesn't do anything, fix it!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment up-to-date?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. The condition in while looks to be false in the beginning. Maybe a negation was intended there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding the negation, the test fails. Created issue #775 for this.

self.insert.proxy(m, self.rs.insert)
self.select.proxy(m, self.rs.select)
self.update.proxy(m, self.rs.update)
self.get_result.proxy(m, collector.method)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.get_result should be removed (+ in docstring too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▼ 0.408 (-0.008) ▼ 0.525 (-0.000) ▼ 0.368 (-0.002) ▼ 0.589 (-0.042) ▼ 0.350 (-0.009) ▼ 0.287 (-0.004) ▼ 0.326 (-0.002) ▼ 0.438 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 14015 (-309) ▼ 4258 (-140) ▼ 1424 (-32) 1164 (0) ▲ 53 (+1)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 21828 (-339) ▼ 6873 (-140) ▲ 1818 (+32) 1216 (0) ▼ 36 (-9)

Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▼ 0.408 (-0.008) ▼ 0.525 (-0.000) ▼ 0.368 (-0.002) ▼ 0.589 (-0.042) ▼ 0.350 (-0.009) ▼ 0.287 (-0.004) ▼ 0.326 (-0.002) ▼ 0.438 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 13297 (-2602) ▼ 4258 (-140) 1456 (0) 1164 (0) ▲ 53 (+0)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 24019 (-2116) ▼ 6873 (-140) ▲ 1818 (+32) 1216 (0) ▼ 38 (-3)

@tilk tilk force-pushed the remove_func_blocks_unifier_indirections branch from afc3583 to df115b7 Compare December 16, 2024 19:05
Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▼ 0.408 (-0.008) ▼ 0.525 (-0.000) ▼ 0.368 (-0.002) ▼ 0.589 (-0.042) ▼ 0.350 (-0.009) ▼ 0.287 (-0.004) ▼ 0.326 (-0.002) ▼ 0.438 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 17139 (+1240) ▼ 4258 (-140) ▼ 1424 (-32) 1164 (0) ▼ 47 (-7)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 21258 (-4877) ▼ 6873 (-140) 1786 (0) 1216 (0) ▼ 33 (-8)

@lekcyjna123
Copy link
Contributor

The Fmax drop is a little bit worrying and it looks like the FuncBlockUnifier is on critical path now. I checked the synthesis results (https://github.com/kuznia-rdzeni/coreblocks/actions/runs/12359494095/job/34492356220) and it looks like:

  • In sync->sync most of 30ns critical path is travelling between different parts of unifier.
  • We should add registers on wishbone input from gpio. It takes 13ns to route data from GPIO and this impact LSU scheduling.
  • There is a path from wishbone GPIO, LSU, FuncUnitResultKey_unifier, RF to the CSR. All that in one cycle.

@@ -18,19 +17,10 @@ def __init__(
):
self.rs_blocks = [(block.get_module(gen_params), block.get_optypes()) for block in blocks]

self.result_collector = Collector([block.get_result for block, _ in self.rs_blocks])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I correctly see, removing that Collector cause that all FUs get_results methods are joined with the announcement methods (so with RS and RF), which make scheduling more complex and critical path longer. In Collector there is hidden a Forwarder which cut the critical path on data.

Copy link
Contributor

@lekcyjna123 lekcyjna123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LNGTM

The changes are affecting Fmax and make transactron network more complicated. Additionaly they removed buffers in announcement.

@@ -52,6 +52,16 @@ class FetchResumeKey(UnifierKey, unifier=Collector):
pass


@dataclass(frozen=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should start adding the doc strings to our keys? In practice they are a global variables and we haven't documented them...

@@ -87,12 +85,9 @@ def elaborate(self, platform):
m.submodules[f"func_unit_{n}"] = func_unit
m.submodules[f"wakeup_select_{n}"] = wakeup_select

m.submodules.collector = collector = Collector([func_unit.accept for func_unit, _ in self.func_units])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also complicates the transactron network. Probably its is connected with observed IPC drop. Previously, when two results were ready in the same cycle, one have been announced and second stored in Forwarder for a cycle, what made the FU ready to process the next instruction. Now the FU have to stall till it can push out its result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Benchmarks should be run for this change refactor Doesn't change functionality, but makes stuff nicer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants