-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to add chunked samples to buffer via DataBuffer::addToBuffer? #556
Comments
I think this would work, but before you go down the route of modifying the |
Thank you for your response. I have compared my modification with the Neuropixels plugin and found that my modification would affect the Neuropixels plugin as it uses a default chunk size of 1 and arranges its input data with each row "idx" representing the next "numChans" at time "idx". Unfortunately, I don't have a Neuropixels system to verify my claim.
However, based on my rough estimation, my modification can provide a significant performance improvement of up to x10 with a tuned optimal chunk size, mostly due to improved memory locality and reducing the number of calls to "buffer.copyFrom" (Source/Processors/DataThreads/DataBuffer.cpp#L89) by chunkSize times. This improvement could apply to most modern cached CPUs with various chunk sizes. It is worth noting that while the Neuropixels plugin saves "numItems" calls to "DataBuffer::addToBuffer". But it still uses the same amount of "buffer.copyFrom" calls because only one sample is copied to a single channel with a single call. Therefore, I plan to deploy my plugin without changing the "DataBuffer::addToBuffer" function. Still, I think it would be better if a chunk size larger than 1 is an option for future improvement. Thank you again for your help! |
Thanks, this is helpful! We will test out this method with the Neuropixels plugin and measure the performance improvement. We are currently working on updating the plugin API for the next major release, and it's likely we can incorporate these changes then. What is the plugin that you're building? |
I'm from KonteX, currently developing a plugin that supports our XDAQ system. The plugin is nearly finished, and we were wondering if you could help us publish it. Here's the link to our plugin: https://github.com/kontex-neuro/OE-XDAQ/tree/xdaq_dev. And here is the benchmark for DataBuffer::addToBuffer in case you wants to know how we measured, https://github.com/kontex-neuro/OE-XDAQ/tree/benchmark. The result with different chunk size
Thanks! |
Thanks for running these benchmarks! So it seems like the two methods are more or less equivalent in terms of copy speed? I will follow up via email about plugin distribution. |
It's expected to be the same if the majority of the performance improvement comes from chunked copy. Consider the original DataBuffer::addToBuffer taking 3 channel data with 4 samples. for (int j = 0; j < bs[i]; j+= chunkSize)
{ // for each chunk...
cSize = chunkSize <= bs[i] - j ? chunkSize : bs[i] - j; // figure our how much you can write
for (int chan = 0; chan < numChans; ++chan) // write that much, per channel
{
buffer.copyFrom (chan, // (int destChannel)
si[i] + j, // (int destStartSample)
data + (idx * numChans) + chan, // (const float* source)
cSize); // (int num samples)
} It's not possible to arrange data correctly for passing it into buffer.copyFrom when chunk size > 1, since it will copy overlapped data. Following is examples when the buffer is empty and the buffer size is greater than 4 samples. i.e. i = 0, idx = 0. Original memory layout
Chunk Size 1 / Original Method:
|
Loop | buffer.copyFrom | Data Copied |
---|---|---|
Loop j = 0 chan = 0 | (0, si[0], data + (0 * 3) + 0, 1) | data[0] ok |
Loop j = 0 chan = 1 | (1, si[0], data + (0 * 3) + 1, 1) | data[1] ok |
Loop j = 0 chan = 2 | (2, si[0], data + (0 * 3) + 2, 1) | data[2] ok |
Chunk Size 2 / Original Method: data + (idx * numChans) + chan
Loop | buffer.copyFrom | Data Copied |
---|---|---|
Loop j = 0 chan = 0 | (0, si[0], data + (0 * 3) + 0, 2) | data[0] data[1] ! Unexpected copy |
Loop j = 0 chan = 1 | (1, si[0], data + (0 * 3) + 1, 2) | data[1] data[2] ! Unexpected copy |
Loop j = 0 chan = 2 | (2, si[0], data + (0 * 3) + 2, 2) | data[2] data[3] ! Unexpected copy |
Chunk Size 4 / Original Method: data + (idx * numChans) + chan
Loop | buffer.copyFrom | Data Copied |
---|---|---|
Loop j = 0 chan = 0 | (0, si[0], data + (0 * 3) + 0, 4) | data[0] data[1] data[2] data[3]! Unexpected copy |
Loop j = 0 chan = 1 | (1, si[0], data + (0 * 3) + 1, 4) | data[1] data[2] data[3] data[4]! Unexpected copy |
Loop j = 0 chan = 2 | (2, si[0], data + (0 * 3) + 2, 4) | data[2] data[3] data[4] data[5]! Unexpected copy |
Channel major memory layout
data | Sample 0 | Sample 1 | Sample 2 | Sample 3 |
---|---|---|---|---|
Chan 0 | 0 | 1 | 2 | 3 |
Chan 1 | 4 | 5 | 6 | 7 |
Chan 2 | 8 | 9 | 10 | 11 |
Chunk Size 2 / Original Method: data + (idx * numChans) + chan
Loop | buffer.copyFrom | Data Copied |
---|---|---|
Loop j = 0 chan = 0 | (0, si[0], data + (0 * 3) + 0, 2) | data[0] data[1] ! Unexpected copy |
Loop j = 0 chan = 1 | (1, si[0], data + (0 * 3) + 1, 2) | data[1] data[2] ! Unexpected copy |
Loop j = 0 chan = 2 | (2, si[0], data + (0 * 3) + 2, 2) | data[2] data[3] ! Unexpected copy |
Chunk Size 2 / Modified: data + (chan * numItems) + idx
Loop | buffer.copyFrom | Data Copied |
---|---|---|
Loop j = 0 chan = 0 | (0, si[0], data + (0 * 4) + 0, 2) | data[0] data[1] ok |
Loop j = 0 chan = 1 | (1, si[0], data + (1 * 4) + 1, 2) | data[4] data[5] ok |
Loop j = 0 chan = 2 | (2, si[0], data + (2 * 4) + 2, 2) | data[8] data[9] ok |
Chunk Size 4 / Modified: data + (chan * numItems) + idx
Loop | buffer.copyFrom | Data Copied |
---|---|---|
Loop j = 0 chan = 0 | (0, si[0], data + (0 * 4) + 0, 4) | data[0] data[1] data[2] data[3] ok |
Loop j = 0 chan = 1 | (1, si[0], data + (1 * 4) + 1, 4) | data[4] data[5] data[6] data[7] ok |
Loop j = 0 chan = 2 | (2, si[0], data + (2 * 4) + 2, 4) | data[8] data[9] data[10] data[11] ok |
Let me know if i missed something or misunderstood the API, hope it can actually improve the performance!
Thanks for the detailed explanation, this definitely makes sense! Based on the benchmarking you did, it seemed like there wasn't any significant reduction in copy times after switching to channel-major memory layout, but maybe I'm misunderstanding those numbers. In any case, we will run some tests on our end to measure the copy speeds prior to the API update. |
I've run into the same issue; that any chunkSize that is not 1 does not copy the data correctly. I didn't see this thread until after I had resolved the issue with a slightly different implementation. One item that I found that the above solution does not manage: When a full chunkSize is not copied at the end of the circular buffer - the first part of the chunk is copied, but the rest is not. There were also opportunities for reducing the overhead (and hopfully added clarity) for some of the copies by doing 'all at once' copies for:
I'm going to submit a PR for the changes and will include a snippet of code I used to test the algorithm. I'll update this thread with a link to the PR. |
PR Submitted: #639 |
As Anjal mentioned in the PR comment, based on this discussion and others we have decided to make samples from the same channel contiguous in memory in GUI version 1.0, which will be released very soon. This also fixes the issue of the I would strongly recommend developing any new plugins against the |
I'm trying to add more than one sample to databuffer at once, but I end up getting it to work by making some changes to the implementation of
DataBuffer::addToBuffer
.Specifically, change Source/Processors/DataThreads/DataBuffer.cpp#L91 from
data + (idx * numChans) + chan,
intodata + (chan * numItems) + idx
My changes are intended to copy "cSize" samples into channel "chan" from a continuous memory.
To achieve this, the memory layout should be organized as a two-dimensional matrix where each row represents the next "numItems" samples of a channel. Here is an example table to illustrate this:
I believe that my modified code will correctly copy the desired samples into the buffer using this memory layout.
However, I would appreciate it if someone could review my changes and verify that they are correct.
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: