Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nextpnr sometimes produces invalid bitstream for LIFCL-17 #903

Closed
danc86 opened this issue Feb 2, 2022 · 6 comments
Closed

nextpnr sometimes produces invalid bitstream for LIFCL-17 #903

danc86 opened this issue Feb 2, 2022 · 6 comments

Comments

@danc86
Copy link
Contributor

danc86 commented Feb 2, 2022

Sorry for the vague issue title, I think I have found a problem with nextpnr routing but I'm not sure where exactly.

I'm working on a Litex-based design in CFU-Playground with a Vexriscv CPU and a large ML accelerator, targetting Crosslink-NX 17 (LIFCL-17). I arrived at a design which meets timing but the bitstream doesn't work on my board. The CPU never produces any UART output and the Litex LED chaser pattern does not light up, which usually means that the FPGA configuration logic rejected the bitstream.

I'm testing with nextpnr current master branch (commit c306ef1) and yosys current master branch (commit bc027b2cae9a85b887684930705762fac720b529).

You can build the "bad" design from my nextpnr-bug-jan2022 branch, in proj/hps_accel.

I'm also attaching the generated Verilog sources, the build script, and the intermediate files and resulting bitstream here:
nextpnr-bug-jan2022.zip

If I change anything in the design, including both making it smaller (by taking out some small Litex CSR) or larger (by putting back some small Litex CSR), it starts working.

If I use --router router1 on the same design, it works.

If I use a different seed with router2, it works.

If I pass the new --estimate-delay-mult 30 option, it works.

I think it means that there is some rarely-used configuration that doesn't work properly, this design got unlucky but any change to the routing happens to avoid the problem. It feels similar to gatecat/prjoxide#10 which was fixed by #730.

@gatecat
Copy link
Member

gatecat commented Feb 2, 2022

Thanks, looking into this now, it would be really useful to have a couple of the known good designs (with the bitstreams/fasm) to compare too if that's easily doable.

@gatecat
Copy link
Member

gatecat commented Feb 2, 2022

The CPU never produces any UART output and the Litex LED chaser pattern does not light up, which usually means that the FPGA configuration logic rejected the bitstream.

If you have an easy way of checking, can you see if the outputs are high-impedance or being actively driven but stuck? This would distinguish between totally rejected bitstream, and accepted but the clock not running, although either of those are easier to debug than subtlety broken logic!

@danc86
Copy link
Contributor Author

danc86 commented Feb 2, 2022

Here's a tarball of the exact same design, built with seed 1038 instead of 38. This bitstream works fine:
nextpnr-bug-jan2022-good.zip

This board has a test point for the DONE pin. I can check if it stays low after loading the bad bitstream. That's probably the best way to know if the configuration logic has rejected it. I can visit the lab tomorrow and try that.

@gatecat
Copy link
Member

gatecat commented Feb 3, 2022

Thanks for your help with this!

Would you be able to give #905 a try? (I need to look at the routethru situation a bit in general, but it's the most obvious potentially wrong thing I see).

@danc86
Copy link
Contributor Author

danc86 commented Feb 3, 2022

Thanks for the quick fix! I tried PR#905 on the bad design with the bad seed and it does indeed fix it.

The only thing that worries me a little is that changing almost anything about the routing would "fix" the issue before. And disabling these elements would probably send the router down a totally different solution in the same way that changing the seed would. So it might just be pure luck that it starts working now?

On the other hand, if the bad design was indeed using these unimplemented DCS elements then it makes sense that it was the cause of the problem.

@gatecat
Copy link
Member

gatecat commented Feb 3, 2022

Yes, it permuting the routing just enough to work again seems plausible as well unfortunately. I'll keep the ticket open in case this reoccurs but merge that PR as I don't think it makes anything else worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants