Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add chunked samples to buffer via DataBuffer::addToBuffer? #556

Open
StoneLin0708 opened this issue Feb 23, 2023 · 10 comments
Open

Comments

@StoneLin0708
Copy link

I'm trying to add more than one sample to databuffer at once, but I end up getting it to work by making some changes to the implementation of DataBuffer::addToBuffer.
Specifically, change Source/Processors/DataThreads/DataBuffer.cpp#L91 from data + (idx * numChans) + chan, into data + (chan * numItems) + idx

    for (int i = 0; bs[i] != 0; ++i)
    {                                // for each of the dest blocks we can write to...
        blkIdx = 0;
        for (int j = 0; j < bs[i]; j+= chunkSize) 
        {                     // for each chunk...
            cSize = chunkSize <= bs[i] - j ? chunkSize : bs[i] - j;     // figure our how much you can write
            for (int chan = 0; chan < numChans; ++chan)         // write that much, per channel
            {
                buffer.copyFrom (chan,                           // (int destChannel)
                                 si[i] + j,                      // (int destStartSample)
L91 =>                           data + (chan * numItems) + idx, // (const float* source)
                                 cSize);                         // (int num samples)
            }

My changes are intended to copy "cSize" samples into channel "chan" from a continuous memory.
To achieve this, the memory layout should be organized as a two-dimensional matrix where each row represents the next "numItems" samples of a channel. Here is an example table to illustrate this:

data 0 1 2 ... idx ... numItems
0 0 1 2 ... idx ... numItems-1
1 0+numItems 1+numItems 2+numItems ... idx+numItems ... numItems-1+numItems
2
...
chan idx+chan*numItems
...
numChans

I believe that my modified code will correctly copy the desired samples into the buffer using this memory layout.
However, I would appreciate it if someone could review my changes and verify that they are correct.
Thank you in advance for your help!

@jsiegle
Copy link
Member

jsiegle commented Feb 23, 2023

I think this would work, but before you go down the route of modifying the DataBuffer class, can you compare what you're doing to the implementation in the Neuropixels plugin? This plugin calls addToBuffer with multiple samples on each loop. If you can organize your sample buffers in a way that's consistent with what's expected in the official GUI distribution, it will be much easier to deploy your plugin.

@StoneLin0708
Copy link
Author

Thank you for your response. I have compared my modification with the Neuropixels plugin and found that my modification would affect the Neuropixels plugin as it uses a default chunk size of 1 and arranges its input data with each row "idx" representing the next "numChans" at time "idx". Unfortunately, I don't have a Neuropixels system to verify my claim.

data 0 1 2 ... chan ... numChans
0 0 1 2 ... chan ... numChans-1
1 0+numChans 1+numChans 2+numChans ... chan+numChans ... numChans-1+numChans
2
...
idx chan+idx*numChans
...
numItems

However, based on my rough estimation, my modification can provide a significant performance improvement of up to x10 with a tuned optimal chunk size, mostly due to improved memory locality and reducing the number of calls to "buffer.copyFrom" (Source/Processors/DataThreads/DataBuffer.cpp#L89) by chunkSize times. This improvement could apply to most modern cached CPUs with various chunk sizes.

It is worth noting that while the Neuropixels plugin saves "numItems" calls to "DataBuffer::addToBuffer". But it still uses the same amount of "buffer.copyFrom" calls because only one sample is copied to a single channel with a single call.

Therefore, I plan to deploy my plugin without changing the "DataBuffer::addToBuffer" function. Still, I think it would be better if a chunk size larger than 1 is an option for future improvement.

Thank you again for your help!

@jsiegle
Copy link
Member

jsiegle commented Mar 1, 2023

Thanks, this is helpful! We will test out this method with the Neuropixels plugin and measure the performance improvement. We are currently working on updating the plugin API for the next major release, and it's likely we can incorporate these changes then.

What is the plugin that you're building?

@StoneLin0708
Copy link
Author

StoneLin0708 commented Mar 2, 2023

I'm from KonteX, currently developing a plugin that supports our XDAQ system. The plugin is nearly finished, and we were wondering if you could help us publish it. Here's the link to our plugin: https://github.com/kontex-neuro/OE-XDAQ/tree/xdaq_dev.

And here is the benchmark for DataBuffer::addToBuffer in case you wants to know how we measured, https://github.com/kontex-neuro/OE-XDAQ/tree/benchmark.

The result with different chunk size

  • Test Condition
    • Original: data + (idx * numChans) + chan
    • Modified: data + (chan * numItems) + idx
    • numItems = 300
Chunk Size Modified Original
1 2432 2316
4 469 618
16 259 227
64 180 184
256 284 251
  • Unit : us
  • Average of 20 test samples

Thanks!

@jsiegle
Copy link
Member

jsiegle commented Mar 3, 2023

Thanks for running these benchmarks! So it seems like the two methods are more or less equivalent in terms of copy speed?

I will follow up via email about plugin distribution.

@StoneLin0708
Copy link
Author

It's expected to be the same if the majority of the performance improvement comes from chunked copy.
The purpose of the new method is it will copy data correctly when the chunk size is greater than 1.

Consider the original DataBuffer::addToBuffer taking 3 channel data with 4 samples.

for (int j = 0; j < bs[i]; j+= chunkSize) 
        {                     // for each chunk...
            cSize = chunkSize <= bs[i] - j ? chunkSize : bs[i] - j;     // figure our how much you can write
            for (int chan = 0; chan < numChans; ++chan)         // write that much, per channel
            {
                buffer.copyFrom (chan,                           // (int destChannel)
                                 si[i] + j,                      // (int destStartSample)
                                 data + (idx * numChans) + chan, // (const float* source)
                                 cSize);                         // (int num samples)
            }

It's not possible to arrange data correctly for passing it into buffer.copyFrom when chunk size > 1, since it will copy overlapped data.

Following is examples when the buffer is empty and the buffer size is greater than 4 samples. i.e. i = 0, idx = 0.

Original memory layout

data Chan 0 Chan 1 Chan 2
Sample 0 0 1 2
Sample 1 3 4 5
Sample 2 6 7 8
Sample 3 9 10 11

Chunk Size 1 / Original Method: data + (idx * numChans) + chan

Loop buffer.copyFrom Data Copied
Loop j = 0 chan = 0 (0, si[0], data + (0 * 3) + 0, 1) data[0] ok
Loop j = 0 chan = 1 (1, si[0], data + (0 * 3) + 1, 1) data[1] ok
Loop j = 0 chan = 2 (2, si[0], data + (0 * 3) + 2, 1) data[2] ok

Chunk Size 2 / Original Method: data + (idx * numChans) + chan

Loop buffer.copyFrom Data Copied
Loop j = 0 chan = 0 (0, si[0], data + (0 * 3) + 0, 2) data[0] data[1] ! Unexpected copy
Loop j = 0 chan = 1 (1, si[0], data + (0 * 3) + 1, 2) data[1] data[2] ! Unexpected copy
Loop j = 0 chan = 2 (2, si[0], data + (0 * 3) + 2, 2) data[2] data[3] ! Unexpected copy

Chunk Size 4 / Original Method: data + (idx * numChans) + chan

Loop buffer.copyFrom Data Copied
Loop j = 0 chan = 0 (0, si[0], data + (0 * 3) + 0, 4) data[0] data[1] data[2] data[3]! Unexpected copy
Loop j = 0 chan = 1 (1, si[0], data + (0 * 3) + 1, 4) data[1] data[2] data[3] data[4]! Unexpected copy
Loop j = 0 chan = 2 (2, si[0], data + (0 * 3) + 2, 4) data[2] data[3] data[4] data[5]! Unexpected copy

Channel major memory layout

data Sample 0 Sample 1 Sample 2 Sample 3
Chan 0 0 1 2 3
Chan 1 4 5 6 7
Chan 2 8 9 10 11

Chunk Size 2 / Original Method: data + (idx * numChans) + chan

Loop buffer.copyFrom Data Copied
Loop j = 0 chan = 0 (0, si[0], data + (0 * 3) + 0, 2) data[0] data[1] ! Unexpected copy
Loop j = 0 chan = 1 (1, si[0], data + (0 * 3) + 1, 2) data[1] data[2] ! Unexpected copy
Loop j = 0 chan = 2 (2, si[0], data + (0 * 3) + 2, 2) data[2] data[3] ! Unexpected copy

Chunk Size 2 / Modified: data + (chan * numItems) + idx

Loop buffer.copyFrom Data Copied
Loop j = 0 chan = 0 (0, si[0], data + (0 * 4) + 0, 2) data[0] data[1] ok
Loop j = 0 chan = 1 (1, si[0], data + (1 * 4) + 1, 2) data[4] data[5] ok
Loop j = 0 chan = 2 (2, si[0], data + (2 * 4) + 2, 2) data[8] data[9] ok

Chunk Size 4 / Modified: data + (chan * numItems) + idx

Loop buffer.copyFrom Data Copied
Loop j = 0 chan = 0 (0, si[0], data + (0 * 4) + 0, 4) data[0] data[1] data[2] data[3] ok
Loop j = 0 chan = 1 (1, si[0], data + (1 * 4) + 1, 4) data[4] data[5] data[6] data[7] ok
Loop j = 0 chan = 2 (2, si[0], data + (2 * 4) + 2, 4) data[8] data[9] data[10] data[11] ok

Let me know if i missed something or misunderstood the API, hope it can actually improve the performance!

@jsiegle
Copy link
Member

jsiegle commented Mar 6, 2023

Thanks for the detailed explanation, this definitely makes sense! Based on the benchmarking you did, it seemed like there wasn't any significant reduction in copy times after switching to channel-major memory layout, but maybe I'm misunderstanding those numbers. In any case, we will run some tests on our end to measure the copy speeds prior to the API update.

@kevkdev
Copy link

kevkdev commented Dec 6, 2024

I've run into the same issue; that any chunkSize that is not 1 does not copy the data correctly. I didn't see this thread until after I had resolved the issue with a slightly different implementation.

One item that I found that the above solution does not manage:

When a full chunkSize is not copied at the end of the circular buffer - the first part of the chunk is copied, but the rest is not.

There were also opportunities for reducing the overhead (and hopfully added clarity) for some of the copies by doing 'all at once' copies for:

  • timestamps, sampleNumbers, eventCodes, and when the data has full sequences of data across channels (data for a channel is contiguous).

I'm going to submit a PR for the changes and will include a snippet of code I used to test the algorithm. I'll update this thread with a link to the PR.

@kevkdev
Copy link

kevkdev commented Dec 6, 2024

PR Submitted: #639

@jsiegle
Copy link
Member

jsiegle commented Dec 9, 2024

As Anjal mentioned in the PR comment, based on this discussion and others we have decided to make samples from the same channel contiguous in memory in GUI version 1.0, which will be released very soon. This also fixes the issue of the chunkSize variable not working as expected.

I would strongly recommend developing any new plugins against the development-juce8 branch, which will be merged to main in the coming weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants