-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs in the provided examples #362
Comments
Hello @KC-Kevin, I'm sorry to hear you have issues during setting up TaPaSCo and SVM. Unfortunately, you ran into different issues with our main branch. It is a bit outdated, so we now plan the merge of develop into the main branch until end of June. Your primary issue is related to Vitis HLS, which changed the control register layout. In #345, we fixed this so that it is running with recent Vitis/Vivado versions as well. The second issue is that, currently, we do not officially support SVM on the U250. It is actually no good design that bitstream generation probably does not fail, but still not include the extension. We fixed that in the develop branch as well and print a corresponding error message. Which Linux kernel version are you running? If your running a kernel newer than 5.16 you also must use the develop branch until our release since we required a fix due to changes in the Linux kernel as well here. So my suggestion would be you try again with our develop branch till our release on the master branch. I will leave this issue open so that I can support you if you have further issues or questions. Best regards |
Hi, Thank you for the support! I try out in the develop branch (with commit 1f3e6d1). I created plugins folder in The OS kernel version is GNU/Linux 5.15.0-69-generic x86_64. I also checked that However, when I load the kernel, the terminal hangs and here is the output from dmesg:
One thing I noticed is that it has follow error in the terminal when the bitstream is programmed: The build process is attached at the end in case it is helpful. Will this issue related to the U250 board? or is it something else. If it because of U250 board, I may consider switch to U280 boards. Some other follow up questions are: My understanding is following order should work (with minimum building process/less repetition):
Is the step 3 a standalone/self-contained step? Because, our system setup is that we have a machine B that is dedicated to the bitstream generation and it can program the bitstream onto FPGA through JTAG. The FPGAs are inserted into machine A's PCI-E slots and the machine A will run the actual program. Two machine have a network file system to share the same folder. So, is it possible to run everything on machine A except the bitstream generation/programing the bitstream part on machine B? (2) if I want to test SVM in Multi-FPGA, should I also copy the Thanks again! Here is the build command I use:
|
Let me first answer your follow-up questions: (1) TaPaSCo consists of two more or less independent parts, the toolflow for creating bitstreams and the runtime to write and execute corresponding software. You can even use different workspaces on distinct machines for this. So it is no problem to build bitstreams on machine B in your setup. On this machine you only need to do steps 1 (run Steps 2, 4 and then need to be done on machine A in your setup. There you need to run
(2) No, svm.tcl is sufficient as it is completely self-contained. (3) See point 1. Now on your issues: I did not encounter these two errors yet, and I am also not sure whether they are somehow connected. The complete stack trace would be interesting here to see where exactly this kernel bug is triggered, if it is inside out kernel module at all. What I see in you build commands is that you compile the runtime library with As a side note: After I would suggest you try once again with the optimal toolflow I sum up in the following. If this does not solve the issues I would ask you to provide more dmesg output to me so I can try to debug it further. Here would be the summed up toolflow: On machine B in your setup build the bitstream:
On machine A build the toolflow, load the driver and run the software (distinct workspace possible):
Have a nice weekend! |
Hi, Thanks for the detailed reply! I follow the procedures you suggested. Now, I find I do not need sudo and can use the binary name directly (e.g. However, I encountered the same issue and I attached the full dmesg log at the end. Here is the output of the terminal when loading the driver/hot-reset
Here is the command I use: On machine B:
On machine A:
Since you said the bitstream generation and the runtime is relatively independent, I run commands on machine B and A concurrently, except that I first program bitstream from Vivado in machine B and then run Here is the full dmesg output: Thanks again! |
Hi, I just did another experiment, following the suggested order, but with an extra step of warm-reboot of machine A after program FPGA through machine B's vivado JTAG: On machine B:
On machine A:
The output of hot-plug is:
There are two U250s in machine A. The FPGA devices in dmesg should be c4:00.0 and c1:00.0. The one get programmed in Vivado should be c4:00.0 Here is the full dmesg log after the program get stuck at hot-plug: |
Hi, I would like to provide more information/things I tried from my side to help identify the issues. Here is the screen shot of Vivado after we program the arraysum bitstream. It shows the memory controller is properly calibrated I also tried the Another question is: if I can use JTAG to read/write CSRs (control status register) to check status of hardware? Thanks again! |
Hi, thanks for the additional information. I suspect an issue in the Linux kernel itself. As you can see in the stack trace in the dmesg log, the error (or BUG as printed in the error message), occurs in the Linux kernel. The function which is called in our driver is Maybe some additional question to find out, what is different in your setup than in ours. Does your Ubuntu run in a virtual machine? Do you have other PCIe devices than the U250s plugged in? What is loaded on the other U250? How is the chance that you can install a newer Linux kernel? I do not know if this would solve the issue, however, we are running newer kernel versions currently on our machines. |
Hi, Thanks for the reply! Here is the answer to your question: Another thing I tried is enable the IOMMU. I get the same error message on the kernel bug. Beyond the same issue I already have, the dmesg now give more errors information at the end. Attached is the full dmesg I have here. Thanks again! |
Hi, you could also try to run a bitstream with SVM disabled and see if at least this works. (4) We are using RedHat Enterprise Linux 9 with kernel version 6.3.2 currently. |
Hi, I am considering setup a new OS now in our machine. Will the latest Rocky Linux 9.2 with OS kernel version 5.14.0 work as an alternative of OS you have now? The main concern here is that your OS kernel version is 6.3.2 in RedHat. Also, did you test the SVM feature part on Ubuntu with OS kernel version 5.15.x before? Currently, if I use the main branch without SVM, I am able to get the driver loading work and start the program, but there is still issues (detailed in the very beginning of the issue page). Another reason is that I would like to try out the SVM features (single-FPGA and multi-FPGA) in the system. Thanks again! |
Hi, I would like to update with more information on the debugging process. We fix the timing issue of the bitstream generated. For bitstream without SVM, I am able to get the array sum example work and the example software can run through. Thanks again! |
Hi, I would like to update with more information on the debugging process. I installed Rocky Linux 9 with OS kernel version 5.14.0-284.11.1.el9_2.x86_64. Now, the driver loading with SVM feature enabled is good. I can program the bitstream (with correct timing), do warm reboot and load the driver with SVM without any issue. However, when I run the program, it just get the stuck. I attached the dmesg log here for your reference. Also, the 84:00.2 device is a NIC, and its error message may not matter so much. The bugs happened with and without IOMMU enabled. Currently, the attached dmesg log is with IOMMU enabled. Another question is how to run counter examples? I am able to synthesis it successfully, but I do not see a good instruction/host code to interact with it. Because I think get it running may help the debugging process. Thanks again! |
Hi, thank you very much for your additional debugging effort! The new kernel at least solved the issue which was unrelated to the actual TaPaSCo code. As background information: What I can see from your log, is that there are CPU page faults on the same page again and again, so the migration from device memory to host memory seems to not succeed. I am currently trying to figure out what is different in this particular kernel version than in other versions. In this part of the kernel there is much development activity and changes between different versions. As I remember correctly I started developing two years ago with version 5.13 and used different other versions till now, but of course I could not test every version. I could not figure out the exact issue until now. However, I remember I had a similar issue with even newer kernel versions, and could fix it by introducing this version check: tapasco/runtime/kernel/pcie/pcie_svm.c Line 689 in 1f3e6d1
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,14,0) , but I cannot exactly be sure if this solves our issue or will introduce other issues. Otherwise I must probably try setting up a system with the exact same kernel version in our lab.
On RockyLinux you can also install newer kernels with kernel-ml (https://wiki.crowncloud.net/?How_to_Install_Kernel_6_x_on_RockyLinux_9). But of course this might not be possible if other users are using the same machine at your lab to reproduce your bugs. Regarding timing issues in your bitstream: On the U280 we create a pblock for the PCIe core and constrain it to the bottom SLR (see here
I hope we can fix your issue soon and get everything running on your system! Edit: |
Hi, Thank you so much for the continuous help. Currently, I upgraded the OS kernel version to 6.3.7-1.el9.elrepo.x86_64 with Rocky Linux 9. This kernel version is able to successfully load the driver and run the single FPGA SVM example with array sum. One issue with this OS kernel version is that: when I program the bitstream from machine B using JTAG, the machine A (where the actual runtime environment) will auto-reboot upon programming. I speculate that this is the surprise link down on PCI-E and cause OS to do it automatically. Are you aware of an option to not do an auto reboot in the Does Rocky Linux or the OS kernel version? Maybe there is a linux parameter to set? Currently, I am trying out the multi-FPGAs SVM feature and want to get some performance number/benchmarking for both single FPGA SVM and multi-FPGA SVM. I have some questions as I tried out the examples: (1) How to run the bandwidth example properly in
Do you have any input on this bugs? (2) I see (3) Is there any documentation on how to properly program the instance of multi-FPGAs SVM? Naively, I guess program two FPGAs with same bitstream should work. A more complicated cases is, for example, can I have different IP core on two FPGAs and then both program can talk through SVM (with PCI-E endpoint to endpoint or Ethernet)? (4) The last question is that for Ethernet SVM, in the compose command, is the mac address and port refer to source or destination? If I have two FPGA boards, how should I compose the two program to generate two bitstreams? Is it the same bitstream or not? Thanks again! |
Hi, I'm happy to hear that you finally got it running! I will consider to update the kernel requirement in the documentation. Regarding the auto-reboot I will have to ask my colleagues. Do you have this issue with non-Tapasco bitstream as well? And did it also occur with Ubuntu or is it related to Rocky Linux? (1) The (2) The (3) Both is possible. You can either have the same PEs, or different PEs on both FPGAs. They can talk through SVM in both cases. The only exception is if you want to use migrations over Ethernet (see 4). Otherwise, there is no issue to use the same bitstream on both FPGAs. (4) During compose you set the mac address and the QSFP+ slot (port parameter) you want to use in this specific bitstream. So currently, the mac address is hard-coded in the bitstream. This means you always need to generate distinct bitstreams for different FPGAs so that you do not have two times the same mac address in the Ethernet network. I hope my remarks are helpful for you. |
Hi, Thank you so much! The auto-reboot issue only shows up after I switched from Ubuntu to Rocky Linux (with OS kernel version 6.3.7). When I was experimenting the Tapasco under Ubuntu, such issue did not exist. Only in the current Rocky Linux, this issue pops up, and it also exists for other non-tapasco bitstream in the OS 6.3.7. So, I speculate that there are some OS/kernel parameters to set in the Rocky Linux to disable auto-reboot upon bitstream programming. Thanks for considering enhancing the bandwidth benchmark with SVM features. That will be helpful. For SVM implementation, I am currently working on verifying the bitstream and software running correctly. A higher level question is: if the bitstream has 1. PCI-E endpoint to endpoint and 2. Ethernet (of course bouncing through host is the third approach), will driver/tapasco system automatically choose suitable mechanism (1 or 2) during execution of user program(e.g. array init) or is there a way that user can specify the communication mechanism? The question is basically how the communication mechanism is determined give multiple devices. Do you have any support for substitute the user IP in a more flexible way? Based on my understanding, the user IP (e.g. array sum) is integrated with the system wrapper to generate as a whole. So, if the bitstream is generated with array sum and array update and later I want to change array update IP to array init IP, I have to re-generated the whole bitstream. A more flexible way is to substitute the user IP dynamically (i.e. change from array sum update array init). Thanks again for all the reply/remark/help! |
Hi, A little bit more progress made. Based on the dmesg log below, do I run For example, do lines like below indicate multi-FPGA SVM successfully run with bouncing through host
and this line means below indicate multi-FPGA SVM successfully run with pci-e endpoint to endpoint
I do see the log indicating Ethernet successfully runs. Probably, because I set the QSFP port as 0 for both bitstream, but the physical connection is the zig-zag in the machine (FPGA A port 0 connect to FPGA B port 1 and vice versa). I will generate new bit stream that reflect the real connection. Thanks! |
Hi, I did a study on the dmesg log above with Similarly, for Ethernet, I add the Have a good weekend! Thanks again! |
Hi, please, see the answers to your different questions below. I hope I covered everything. Auto-reboot: Copy method: Dynamic exchange of IPs: SVM example:
with an aligned allocation (from cstdlib-header):
You will then see Cheers! |
Hi, Thanks for detailed response! With the aligned allocation, I am able to see the
I also see this from output of
So, there are something wrong with PCI-E BAR. Even bouncing through the host is not working. I found disable IOMMU may help to resolve this issue from this link. So, do you have IOMMU enabled or not in the current system setup? or do you have any debugging tips of running into this issue? For auto-reboot issue, I tried to unload the driver and it still does a auto-reboot when a new bitstream is programmed. Then, I guess there is no quick/easy fix to this. Thank you! |
Hi, I think these are two distinct issues. The first error message is related to the Intel IOMMU which seems to block DMA to host memory for some reason, if I understand the error message correctly. Similar to you, I found some bug reports where other devices (e.g. GPUs) have similar issues with the Intel IOMMU. We use AMD servers so I did not encounter this by myself yet. But on one server our AMD IOMMU is switched of, so maybe switching it off in your setup could resolve this. The second problem is that the second PCIe BAR, which is used for direct PCIe E2E memory access, is not assigned. Maybe you can check in your dmesg log if you have any error message related to this? The BAR is quite large with 4 GB, however, this is required. Without having memory assigned, data cannot be written through the PCIe bus from one to another device. |
Hi, Thanks for the response! The first issue is resolved by turning intel IOMMU off. The Ethernet also works. However, the second issue still exists, and it seems related to some OS parameter. I see following message from
So, I tried with I do have some other question when I play with examples: (1) If I understand correctly, the HLS code of array update is in (2) How do you time the code properly? In
However, I found changing the array size from 16kb to 4MB does not change the timing result a lot. The variation is around the 20%. For example, changing from 16kb array I noticed that HLS code also define the array size by Changing both the array size in HLS kernel or the host code does not change the latency much. Does it mean the I sometime observe up to 20% of variation when I do the timing for exactly the same command, is it a common experience? (3) Since the tapasco has Ethernet between two FPGAs. Do you have support of two FPGAs that are located in two different nodes such that one fpga in node A can talk with the other fpga in node B through Ethernet? Based on the paper/github readme, I do not see a support of across node FPGAs' SVM, but I just want to confirm on this point. (4) Do you support huge page like 2MB? Since there is HMM integration, I guess the answer is no. But I would like to get a confirmation. Thanks again! Appendix:
Timing with 16Kb array in host code (with Ethernet as P2P) and SZ is HLS array is 2050
The similar timing result is observed with SZ=256 in HLS code |
Hi, (1) These example kernels are not optimized in any way. They should only demonstrate how to use TaPaSCo in general. (2) The Enlarging the allocated host buffer only does not affect latency at all, because it does not imply it is also completely migrated to device memory. This is set by the In order to see a change in latency, you have to explicitely enforce migration of a larger buffer and/or modify the However, the problem sizes of these example cores are very small. So runtime of the HLS core does not really take into account, and it does not really matter if you migrate one, two or three pages, because the overhead of launching the PE, migrating the data and handling the interrupt(s) is too large. So you have to increase the problem size much further until you can neglect these effects. (3) No, all FPGAs must be in the same node as they all share the address space of the same host application and need to be managed by one driver. (4) No is the correct answer. As far as I know, HMM is still limited to 4 kB pages. I hope I could clarify your questions. |
Hi, Thanks for the response! I will exploring performance measurement with your suggestion. For on-demand latency measurement, does that mean only bouncing through host may not include the CPU page fault time. On the other hand, for on-demand latency measurement that use PCI-E endpoint-to-endpoint or Ethernet, the measurement should be accurate, because it does not go through host. As I explore the examples/documentation, I have following two questions: (1) All HLS examples I saw use static array. Is there any support for dynamic array? By dynamic array, here is an example provided in Vitis kernel bandwidth example. I tried to port this example into the TaPaSCo framework. The HLS kernel synthesis is fine. When it generate the bitstream, the vivado got error like
I am still debugging this issue. But it may be helpful if you have any insight on dynamic array in HLS. (2) In HLS Thanks! Have a good weekend! |
Hi Another quick question I have is on the semantic of argument passing in kernel.json for HLS. I notice the passing filed in argument of kernel.json is either value or reference. What is the meaning/semantic of this field. Also, what is the relationship/connection with function argument passing in the function signature of HLS code. Because, in HLS, for example, the array is usually passed as pointer. Some time, the output may be treated as a reference with streaming interface. I am confused with the argument type used by function signature itself and the pass by reference/value used by json file. Thanks! |
Hi, (1) TaPaSCo only uses Vitis/Vivado tools. Hence, everything which is supported by Vitis you can use, please consult Vitis documentation for dynamic-sized arrays. (2)
AFAIK |
Hi, Thanks for the response! You mentioned that everything which is supported by Vitis you can use, and I tried out specifying the input argument as a pointer in the HLS function definition. Using pointer is a common practice in HLS and it should work. However, the test code that adapted from Vitis example that use the pointer can generate bitstream and my host program can run, but it will not produce the desired output (the output value is untouched). To further eliminate other changing factor and verify whether the pointer in function definition will work or not, I also change the way to pass array in array update example from
The question I have is about the proper way to work with passing by pointer in function definition. Because passing by pointer can associate an internal buffer with DDR array and it enable me to work with dynamic size array. Am I missing something in using HLS? Since you mentioned the ports and configuration register layout may be a problem, so I also checked the AXI interface generated in HLS. It is AXI_MM, which should work. The last question is on static array: do you have any limitation on the size of it? Because, I tried to allocate 32MB array in HLS code without changing anything else, but the program got stuck again. I am currently debugging this issue with 1 MB static array in HLS code, but your insight will be helpful. Thanks again! |
Hi, the error message indicates that your PE tries to access a virtual memory address which is not backed by a physical memory page, so probably has not been allocated on the host before. All memory you want to access must be properly allocated on the host first, otherwise the TaPaSCo driver cannot migrate the pages to device memory. Regarding your general issues with Vitis HLS, please consult the official AMD/Xilinx documentation. |
Hi, Thanks! I think a simple array copying from input array to output array is working with pass by pointer in a single FPGA host setup. As I am testing the user-managed and on-demand page migration, I would like to know whether the system sets any limit on the size of array to move between host and device other than the U250 board constraints. Because, for user managed, I am able to copy a 1GB array from host to a single device. For on-demand, I am able to copy a 32 MB array from host to a single device. When I further double the size of array, the user-managed or on-demand fails. The program stuck and the dmesg log looks like:
Any insight/suggestion is appreciated! Thanks! |
Hi, I tried to reproduce your issue with 64 MB buffers, however, it worked on my side. The given error message indicates that the Linux Kernel cannot resolve the requested page. That is why I assumed you might not allocate the memory before the migration. It is a bit confusing that it seems to work with user-managed migrations, are you using the same host software? |
Hi, Thanks for the reply! I attached the host code and HLS code below for your reference. The host code is modified based on the given SVM example to test single-FPGA cases and multi-FPGA P2P cases with multiple RUNs. The HLS code is able to take in an input array and copy the value into output array. The element in the array is determined by the num_block argument. If possible, could you please do a test with your system setup to verify the correctness of HLS and host code? It is ready to synthesis and run. Because I am not sure whether the OS kernel version (mine is 6.3.9 -1.el9.elrepo.x86_64 and your is 6.3.2) or any other minor system setup can cause this strange issue or not. The current test is done in commit 0da497f in the develop branch. In the test for user managed/on demand with a single FPGA or multi-FPGA, I use the table below to record the array size that I hit the error mentioned before
I have two questions: Thanks again! Have a great weekend! Attached file: dmesg log: p2p_pcie_user_managed_4page.log |
Hi, I also did a test with U280 for single FPGA host setup. The OS/system setup is the same as the previous setup. (only changed the FPGA board) The table below is the array size where we see dmesg error (attached below) when execute it. The RUN is 10 times, meaning for each array, I repeat the execution 10 time to ensure the stability. The error shows up after 3-7 repetition. So, for example, in single FPGA host, with user managed method, I can run 32 MB array testing for 5 runs, but then I run into the dmesg errors below.
The table below is similar to the result obtained in previous reply using U250.
Any insight/comment will be helpful! Thanks in advance! |
Dear developers,
I have some bugs/issues with running the examples design/code that has SVM support (array update, array sum and array init) in main barnch. Currently, I am looking at SVM for single FPGA in the main branch (but I do intend to try SVM for multi-FPGAs in develop branch as well. so, I am also wondering in the timeline to merge into main branch). Basically, the issue is that these three programs can run the first iterations, but are not able to run the second iteration of example code and it just stuck in some place.
For example, for the array sum, here is the output when I run with sudo:
The first iteration can run, but it stuck at the second iteration. Based on simple printf of arraysum_example.c, it seems stuck at here:
If I change the macro of the arraysum program iteration number into 1. It works fine:
So, I am wondering what is wrong with the system or building process that make the second run stall.
Here is the build process:
System setup:
Code base: latest main branch (d7768b3)
OS: Ubuntu 20.04.6
FPGA: U250
Vivado version: 2022.1
Thanks!
The text was updated successfully, but these errors were encountered: