Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsp: p_fir: Fix FIR filter implementation #152

Merged
merged 1 commit into from
Jun 19, 2015
Merged

dsp: p_fir: Fix FIR filter implementation #152

merged 1 commit into from
Jun 19, 2015

Conversation

lfochamon
Copy link
Contributor

This is a fix to issue #151. It maintains the idea of evaluating 8 samples simultaneously to leverage vectorization (can get almost 4 times faster on my PC using -O3). I also removed the reference to dbuf from the function description as it does not exist anymore (it was removed from the API in commit #1fa28ced55ebddb522cd383ce41bf0a792aff1). A few details that are worth noticing:

  • I used malloc to dynamically allocate the delay line as p_malloc is not yet implemented. It should, however, be a simple drop replacement as soon as that function is available.
  • The function behaves exactly like MATLAB's filter, i.e., the delay line is initialized with 0s (the first nh samples are not operating on a "filled buffer").
  • As in MATLAB's filter, the filter coefficients are assumed to be in increasing delay order, i.e., this function implements the filter H(z) = h[0] + h[1] z^{-1} + h[2] z^{-2} + ... + h[nh] z^(-nh+1) (sorry, no math parser in GitHub). Although it is very easy to do it in reverse, this is by far the most common convention in DSP.

I have a test file for these functions, but I'm unsure how to post them (check_p_* files are in .gitignore). I'm somewhat new to git, so I didn't include them in this PR. In any case, I can easily push them.

The new implementation follows the same idea as the previous one: simultaneously processes blocks of 8 samples to leverage vectorization. On an i7-3770 PC, this function is almost 4 times faster than the evaluating the output sample-by-sample (compiled with -O3). Also removed the reference to dbuf from the function description as it does not exist anymore (it was removed from the API in commit #1fa28ced55ebddb522cd383ce41bf0a792aff177).

Signed-off-by: Luiz Chamon <[email protected]>
@lfochamon
Copy link
Contributor Author

Just to make the PR a little better documented, here is a small benchmark of this function compared with a straightforward circular buffer implementation. Since the proposed implementation (p_fir)does use an array of length 2*nh+1 compared to nh for the trivial implementation (p_fir_1), I thought it would be a good idea to justify that showing the speed increase. The tests were run on my PC (i7-3770, windows) and on a BeagleBone Black (AM3358, linux).

Function Time (nh = 16 and nx = 30) Code size (bytes) Delay line length
p_fir 0.250 us (i7) / 11.573 us (BBB) 52.452 (i7) / 9305 (BBB) 2*nh + 1
p_fir_1 0.535 us (i7) / 14.228 us (BBB) 51.428 (i7) / 7263 (BBB) nh

@aolofsson
Copy link
Member

Aweseome! The ~25x factor between Intel and BBB/Zynq based board seems to be repeatable. (not the first time I have seen this). Still not sure I can explain the full factor? Is the vectorization on ARM broken?

aolofsson added a commit that referenced this pull request Jun 19, 2015
dsp: p_fir: Fix FIR filter implementation
@aolofsson aolofsson merged commit 087bda2 into parallella:master Jun 19, 2015
@lfochamon
Copy link
Contributor Author

This is really way beyond my expertise (I'm a DSP guy), but could it have something to do with all the Intel's hyperthreading and multicore stuff that goes on in there? Although to be honest, I couldn't see any difference in the CPU monitors while running the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants