Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make RS feed FUs with garbage if flushing #740

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Arusekk
Copy link
Contributor

@Arusekk Arusekk commented Oct 29, 2024

See #598; does not skip FUs but shows the concept.

@Arusekk Arusekk added the optimization This is *just* an optimization! label Oct 29, 2024
@tilk tilk added the benchmark Benchmarks should be run for this change label Oct 29, 2024
Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
0.421 0.513 0.339 0.655 0.364 0.29 0.328 0.43

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
15885 6043 834 1068 43

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
28877 9298 1790 1248 40

Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.421 (+0.004) 0.513 (0.000) ▲ 0.339 (+0.002) ▲ 0.655 (+0.000) ▲ 0.364 (+0.003) 0.290 (0.000) ▲ 0.328 (+0.002) ▼ 0.430 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 14911 (+396) 6043 (0) 834 (0) 1068 (0) ▼ 41 (-14)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 24301 (-546) 9298 (0) ▲ 1790 (+32) 1248 (0) ▼ 35 (-10)

@Arusekk Arusekk force-pushed the rsflush-598 branch 2 times, most recently from e96d13d to c61a1ee Compare November 22, 2024 10:47

This comment was marked as outdated.

Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.421 (+0.004) 0.513 (0.000) ▲ 0.339 (+0.002) ▲ 0.655 (+0.000) ▲ 0.364 (+0.003) 0.290 (0.000) ▲ 0.328 (+0.002) ▼ 0.430 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 13982 (-282) 6043 (0) 834 (0) 1068 (0) ▼ 39 (-18)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 23200 (-1676) 9298 (0) 1790 (0) 1248 (0) ▼ 33 (-9)

@piotro888
Copy link
Member

[additional comments to discussion from meeting]

I checked that synchronous flushing signal would work in RSInsertion - because FreeRF/RF valid bits are also updated in sync domain, effect would be visible next cycle (and old RF entry inserted into RS). Change in RSInsertion would also not cause any performance loss. (and is definitely safe in RS too)

Proposition with resetting RF valid in Register Allocation would be problematic with checkpointing, that pushes new instruction immediately.

The last place is LSU:
LSU operations have a very high cost, I don't see why we should de-optimize it if this part of LSU is not on critical path (unless it is).

Copy link

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.421 (+0.004) 0.513 (0.000) ▲ 0.339 (+0.002) ▲ 0.655 (+0.000) ▲ 0.364 (+0.003) 0.290 (0.000) ▲ 0.328 (+0.002) ▼ 0.430 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 14964 (+721) 4398 (0) 1456 (0) 1164 (0) ▼ 37 (-16)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 20268 (-3207) 7013 (0) 1818 (0) 1216 (0) ▼ 33 (-9)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Benchmarks should be run for this change optimization This is *just* an optimization!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants