-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example for loading weights from external RAM #903
Comments
I did some digging and it doesn't look like this scenario is currently supported (I might be wrong), although it is described in the current documentation.
It looks like load ops are automatically replaced with loadflash ops when the ai_tools/xformer/Transforms/WriteWeights.cpp Lines 141 to 142 in 3c0a431
|
We tried running a model with a much larger amount of parameters and we are definitely seeing a significant bottleneck in I/O.
Any help with this will be appreciated. |
Hi @andresovela , I need to check this in more detail. Are the weights small enough to be kept in SRAM itself (not using flash or DDR), as that would be the fastest option? |
Hi @panickal-xmos, the models we're testing out right now have around 1.4 MB of weight data, so SRAM is not an option for us at the moment. We're trying out other architectures that require fewer parameters but bigger tensors in order to get around the I/O bound. However it'd be nice to be able to just move the weights to external RAM. |
I found this option in xcore-opt: ai_tools/xformer/XCoreOptMain.cpp Lines 75 to 77 in 175ca98
It's not currently documented, but it looks like we could potentially use this to load the weights from external RAM? |
Your point is very valid. It would ideally be useful to use DDR to quickly fit in a larger model. We had deprioritized DDR support as there is currently a hardware limitation which prevents us from using multiple threads when directly accessing DDR. In the generated model cpp file, you can see this,
Because of this limitation, the model runs much slower and the benefit of using DDR instead of flash goes down. The undocumented |
I did see this limitation written down somewhere. Unfortunately I have no choice but to use LPDDR. A bit off-topic, now that you mention multiple threads, the documentation states that maximum 5 threads can be used. Why is this the case? Doesn't the xcore.ai have 8 cores per tile? Why the 5-thread limitation? I have seen this elsewhere as well, for example: Thank you for looking into the issue for me btw, I appreciate it :) |
Btw, I had not yet tested the Considering that
|
Regarding,
xcore.ai can support upto eight threads. However, five threads are capable of using all compute available on xcore.ai. Eight threads using all compute will be only as fast as five threads using all compute. Each of the eight threads are running slightly slower in that case.
It's good that you checked the timings. The model seems very memory-bound. We will investigate DDR support and report back on if we can do something about it. |
@andresovela , how does the performance look like with weights in DDR, based on this example, #914? |
I'll try it out in a bit and I'll report back :) |
@panickal-xmos I made the modifications according to the DDR example and I see 60% reduction in the execution time of
|
I somehow expected that the performance gain would be much larger than 2.5x, considering that the bandwidth difference is so large (DDR 6,400 Mbit/s vs Flash 200 Mbit/s according to Henk). |
Yes, it's slower due to using the tile ram server interface. I'm looking into another simpler option to directly copy weights from DDR. |
Let me know if there's anything I can do to help :) |
Hi @panickal-xmos, I wanted to do some tests with both weights and tensor arena in DDR, but since the current example uses tile[0] for reading weights and tile[1] for running the model, I get this error:
So while the DDR example in #914 is very informative, we won't be able to use it with this limitation. |
Hi @andresovela, I have merged the updated example in #914. Along with compiler and runtime changes, DDR should be a lot faster. I have changed the DDR frequency to 166MHz for the example. Copying from DDR is slower for smaller weights, hence we use |
Hi @panickal-xmos, thanks for the example! I'll try it out in a bit. Can you explain what does
Are the weights somehow duplicated in internal RAM for performance? I tried to find out from the source code but I couldn't figure it out. |
Only if the constant tensor (weight) is larger than the amount set by |
I modified my test application as shown by the DDR example and I can see a further 53% reduction in the time spent loading weights, without modifying the LPDDR clock frequency.
With the system frequency set to 800 MHz and LPDDR clock set to 166 MHz I get an extra 40% reduction.
After that I was able to keep some weights in internal RAM using the
In total with all the optimizations, I was able to get from 26.60 ms using flash, to 2.59 ms using LPDDR. That's more than 90% reduction in time spent loading weights! Thanks for all the help so far, I appreciate it a lot! With this, I'm closing this issue :) |
Note that the model I'm testing is an int8 one we have for reference, not the int16x8 one we sent to you via email. Unfortunately we can't test that one yet due to the tensor arena being too large. |
The existing models that showcase loading weights from flash are very useful, but I'd like to see an example of loading the weights on flash, and then transfer the weights to the LPDDR1 memory for faster IO.
I took a look at the LPDDR1 docs but I'm not sure how exactly would you tie together the current
xf.generate_flash()
+xflash
approach together with the__attribute__ ((section(".ExtMem.data")))
approach from LPDDR1.Presumably you could do
and annotate the generated array with
__attribute__ ((section(".ExtMem.data")))
?Would that work?
More than 10% of the execution time is spent running
OP_XC_ld_flash
when running the example model using the--xcore-conv-err-threshold=0.5
option.This is significantly worse on some of the models I'm trying to run, e.g.
Therefore I'd like to see how performance is affected by moving the weights to RAM.
The text was updated successfully, but these errors were encountered: