Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656

Open
NoamDualBird opened this issue Nov 26, 2024 · 14 comments

Comments

@NoamDualBird
Copy link

When we run our workload on 1 or 2 FPGA's we do not have any issues but when we try to run on 4 or 8 FPGA's

We usually get an indication of shell pci master timeout error in one of the FPGA slots during high bandwidth DMA.
our setup:

  1. F1.16xlarge (8 FPGA's running in parallel)
  2. Amazom Linux AMI
  3. Small shell version - 0x04182104
  4. linux XDMA driver

From our internal debug this is what we see:
Our PCI AXI master (CL) is trying to write to the shell AXI transactions with typical burst size of 4KB.
At some point we see that the shell is reporting on Timeout Error on the W channel (i.e. pcim-axi-protocol-wchannel-error).
After debugging it we see that there is indeed a timeout violation between some WDATA transfers,
but this violation is caused because the WREADY is de-asserted during this period (while WVALID is asserted).
As a result of the WREADY backpressure, the CL can’t complete the transaction during the timeout period.

Some time after the timeout occurs, all writes and reads from FPGA towards PCI are stuck, including interrupts.

@czfpga
Copy link
Contributor

czfpga commented Nov 26, 2024

Hi @NoamDualBird,

Thank you for reaching out.

In order for us to better investigate this, can you please provide more info about your application?

  • Does your CL only send traffic to a memory space on the host? Or it sends to both the host and neighbor cards (in case on a multi-card instance)
  • When multiple cards are used, how often dose the card send traffic? Are they sending roughly around the same time with a similar frequency?
  • If you have a flow control mechanism enabled in the CL, can you run a test to tune down the frequency of request generated to the AW/W channels, and see if that eliminates or mitigate this issue?

In addition, as you're using the small shell, it's unclear to me why the XDMA driver is utilized as there is no DMA engine in the shell (this is not directly related to this timeout issue, but I just want to call it out)

Thanks,

Chen

@NoamDualBird
Copy link
Author

Hi Chen,
Thanks for the quick response. With regards to your questions:

  • Our custom logic sends traffic only to host memory space
  • When multiple cards are used they are working at the same time. we managed to re-create the failure in a dedicated test where all CL's write in parallel many GB's of data to host
  • We tried to tune down the frequency by reducing the number of outstanding transactions, which gives us some control over the AW/W channels. When trying this we do not see the timeout error. However, we wish to better understand how we should work with the shell interface so we can guarantee it cant happen.

Regarding XDMA driver, the driver is needed for the main reason of enabling registers in the shell interrupt controller (otherwise MSI-X does not work). We do not use it for DMA. since we had to use it to enable interrupts, we also use its devices (_user/_events) for mapping of BAR0 and interrupts to userspace

We were wondering if there may be an issue in the shell, where the CL AW/W channel works in burst sizes that are larger than the PCIe MTU. If PCIe interface receives backpressure it may be propagated to the AW/W channel multiple times during a single transaction and thus reaching timeout, although the protocol wasn't violated.

Thanks,

Noam

@czfpga
Copy link
Contributor

czfpga commented Dec 3, 2024

Hi Noam,

Thank you for the details about your application and use of the XDMA driver.

A burst size exceeding the PCIe MTU isn't problematic, as packets will be automatically fragmented. However, multiple cards sending simultaneous bursts to the host may exceed the hardware interface's bandwidth capacity, causing back pressure to the CL and triggering AW/W channel timeouts.

If reducing traffic request frequency isn't a preferred option, please consider implementing staggered transmissions from the cards to see if this helps minimize the peak BW spikes and potentially eliminate the timeout issues.

Hope this helps.

Chen

@NoamDualBird
Copy link
Author

Hi Chen,

Thank you for your response and we will implement all mechanisms in order to avoid this timeout, however I still think there's an issue here.

  1. First of all, AXI timeouts causes the system to crash since the shell's response is to shut down all AXI interfaces. If AXI timeouts are expected in a fully functional system I would expect the shell to recover from the timeout gracefully and not crash. In our understanding, AXI timeout indicates a broken system such as inaccessible DRAM or PCIe link down. In these cases I would expect other indications from the instance in AWS monitor for some critical event. Backpressure on the PCIe channel is common and can happen for any number of reasons that may or may not involve high bandwidth from the FPGA. It does not indicate a broken system.
  2. When we used 4KB burst on the AW/W channel, we assumed that if the shell doesn't have the capacity to absorb the entire burst it will apply backpressure de-assert aw_ready signal thus avoiding the timeout altogether. Can we somehow alter the shell logic to do this? If not, the only "bullet proof" way to avoid timeouts is to adjust the AXI MTU to be no larger than the PCIe MTU, which imposes many adjustments in our CL in order to keep the required BW. Reducing the FPGA BW will only reduce the probability of the issue, but it is not a valid solution.
  3. In any case, if timeout is detected by the shell, why is slverr not asserted?

Thanks,
Noam

@czfpga
Copy link
Contributor

czfpga commented Dec 4, 2024

Hi Noam,

Timeout protect the shell and facilitate debugging if, for example, a misbehaved AXI master continuously drives the AW/W channels without properly terminating the transactions.

Your application is experiencing timeout for a totally different reason. However, I suspect the issue is related to the PCIe MTU. Your tests on 1/2 FPGA instances would have shown similar problems if that were the case. It's more probable that your application is encountering a system bottleneck that occurs when the peak bandwidth from all FPGAs exceeds the system's capacity.

The crashing needs more investigation. Timeout shouldn't cause system crash because the shell should return OKAY even in a timeout event.

Thanks,

Chen

@NoamDualBird
Copy link
Author

Hi Chen,

Please advise how you propose to further investigate the system crash.

Thanks
Noam

@czfpga
Copy link
Contributor

czfpga commented Dec 18, 2024

Hi Noam,

Please first check to see if there is any critical error reported by kernel or driver.

Thanks,
Chen

@NoamDualBird
Copy link
Author

Hi Chen

There are no errors in kernel drivers at all. Only indication in shell status.

Thanks
Noam

@NoamDualBird
Copy link
Author

Hi Chen,
Following the previous messages, I haven't heard from you on this subject and I'll inform you that this issue persists on F2. I opened a new AWS case about it as well, as this time the issue doesn't have a workaround and this is a showstopper for our system.

same setup as the previous case but with f2.48xlarge instance (small shell, XDMA driver).

Error:
pcim-axi-protocol-error-addr=0x10fb5d48b00
pcim-axi-protocol-error-count=1

In addition to the above data we also noticed some other errors:
The FPGA shell indicates PCIE max payload is 512B, while the kernel indicates that it is 128B:
b4:00.0 Memory controller: Amazon.com, Inc. Device f001
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 2048 bytes
Which of them is correct? In any case we tried tuning our CL to support max payload of 128B, 256B, 512B and non of these options worked.

Our internal debug shows the same issue as the one in F1 - the timeout is caused by the shell logic in response to external backpressure even though our custom logic is AXI spec compliant.

If this problem is similar to the previous one from F1, to speed this debug I'll state that I believe that there's a bug in the shell logic HW. When backpressure is applied from the PCIE interface the shell logic should buffer a whole AXI write transaction in order to close the transaction gracefully. The timeout counter should indicate an underrun from the master side which is not the case in this matter.

Thanks
Noam

@NoamDualBird
Copy link
Author

Please note that the kernel changed the PCI MAX payload on all FPGA devices:
[ 2.099447] pci 0000:9f:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.100270] pci 0000:9f:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.103184] pci 0000:a1:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.103997] pci 0000:a1:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.106930] pci 0000:a3:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.107746] pci 0000:a3:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.110710] pci 0000:a5:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.111527] pci 0000:a5:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.123475] pci 0000:ae:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.124277] pci 0000:ae:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.127231] pci 0000:b0:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.128025] pci 0000:b0:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.131001] pci 0000:b2:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.131798] pci 0000:b2:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.134771] pci 0000:b4:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.135562] pci 0000:b4:00.1: Max Payload Size set to 128 (was 512, max 1024)

It looks like since the PCIe switch that is connected to all the FPGA's exposes max capability of 128B mtu, the kernel changed
The mtu of the devices to match all the hierarchy (but shell is not aware)?

In addition, we also tried working with 64B MTU and the system still failed with the same timeout:
pcim-axi-protocol-error=1
pcim-axi-protocol-wchannel-error=1

Thanks
Noam

@AWSNB
Copy link
Contributor

AWSNB commented Jan 6, 2025 via email

@NoamDualBird
Copy link
Author

That's interesting. Do you have any solution to the broken system?

@czfpga
Copy link
Contributor

czfpga commented Jan 7, 2025

Hi Noam,

The shell should not buffer the transaction because from the shell perspective, there is no guarantee that the data source in the CL would terminate the traffic properly. If the shell interfaces with a malfunctioning logic, which keeps driving a data interface, the internal buffer would overflow and the shell must terminate the traffic anyway.

What is the traffic BW that causes this timeout issue on the PCIM interface? Is it still the case that after you lower the aggregate the BW, the problem disappear, just like on F1?

If you reload the AFI, does that bring the slot back to its operational state?

Thanks,

Chen

@NoamDualBird
Copy link
Author

Hi Chen,

Thank you for your reply. Regarding the buffering I suggested - I meant buffering of a single legal transaction, which doesn't exceed AXI MTU. In case of an illegal AXI transaction of course the shell must abort and proceed to error flow. Sorry if I wasn't clear on the subject.

After the previous issue with F1 we inserted rate limiters to our design in order to try to mitigate this issue. We are currently set to limit of ~6GB/s. We observe average BW of ~1GB/s with short peak BW of ~6GB/s. In F1 this, with conjunction with internal AXI MTU set to be equal to PCIe MTU solved the issue. We don't see the same effect here. We will try to further limit our system to see if it helps.

Regarding AFI reload - yes, it's operational again, however, it will receive timeout again in the next run.

I believe we have to address 2 problems here: one is technical - the timeout counter issue and if it's counting correctly. The second one is more concerning - how come the FPGA receives such a backpressure from PCIe interface? From my understanding a healthy system shouldn't get 100us backpressure from the PCIe interface on every run. This will imply that even if the FPGA will not get into timeout state we're facing an unexpected performance degradation.
Do you have any understanding to this behavior?

Thanks
Noam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants