-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ld-decode Possible performance optimizations #802
Comments
It may be worth checking whether pyFFTW is still faster than |
Re 32-bit vs. 64-bit float, the same is true in some of the tools since they only need ~16 bits of precision. ld-chroma-decoder uses |
Another thing that seems to take up a bit of time is concatenating the input/output arrays in demodcache. When reading from file it reads into a list of blocks (arrays) and then concatenates them afterwards. Maybe it would be possible to read into the larger array in one go and/or read a bit more at a time to avoid spending as much time on that. The output side might be more complicated. |
Just noticed I got a mention here Another thing I had started to work on but didn't have the time to see it through: I think it's not impossible, maybe even easy(?) to migrate the code to work with CuPy (i.e., run on GPU). There are some fundamental structures in use with ld-decode that need some tweaking in order to port the code over, but for the most part it is a drop-in replacement for scipy. This would also have the benefit of being backward compatible. |
I converted the filters and FFT processing to float32/complex64 (which should convert much of the TBC code further down the line), and performance is 15% higher on my AVX1 Sandy Bridge Xeon. (haven't benchmarked Haswell yet) |
master issue for various performance bottlenecks that could be improved on
Memory bandwidth/use between threads
As identified by several people, there is a fair bit of time spend on shuffling data to and back from the demod threads, and to concatenate the data afterwards, just removing the completely unused data in the shared recarray in #796 gave a notable improvement in performance, but there is more that could be improved
demod_raw
is only used in one spot in the dropout detect function to check where the data exceeds a threshold, this could as well be done in the demod threads themselves, storing the boolean array data on where the thr is exceeded instead which should be much smaller.demod_burst
would likely be sufficient to store as 32-bit instead of 64-bit float since the data will be around where the floating point precision is high anyhow.FFT
The real-part only rfft functions should be used rather than fft where we don't need the imaginary part (which is only needed for the hilbert/demod function afaik), as they are gonna be faster and we don't need to store much data for the fft filters either.
We're using pyfft rather than numpy's fft for speed improvements as of now. It has a bunch of settings/caching one could maybe play around with to improve things. It's currently not used on windows as it seems to conflict with using Thread instead of Process (which doesn't work on win with the current code).
numba/native code optimization
Some of the tbc/sync stuff could benefit a ton from using numba (or alternatively cython or similar) as a lot of logic is being done in loops which is slow in python -
dropout_detect_demod
,refine_linelocs_pilot
andrefine_linelocs_hsync
in particular, but probably more. (The last one I've implemented partially in cython in vhs-decode)Any runs involving EFM will have a fair bit of extra startup time as it uses numba classes which the compilation can't be cached for, so it has to be re-compiled on every run. If we start using cython or similar in ld-decode it might be worth using that for this purpose instead.
JSON
I don't know if this has a large performance hit in practice but as of now we are rewriting the whole json rather than appending to the file, which can get pretty large on large runs. Might be worth looking if it's feasible to just append the file and modify the needed stuff at the start instead.
The text was updated successfully, but these errors were encountered: