How to add chunked samples to buffer via DataBuffer::addToBuffer? #556

StoneLin0708 · 2023-02-23T09:28:18Z

I'm trying to add more than one sample to databuffer at once, but I end up getting it to work by making some changes to the implementation of DataBuffer::addToBuffer.
Specifically, change Source/Processors/DataThreads/DataBuffer.cpp#L91 from data + (idx * numChans) + chan, into data + (chan * numItems) + idx

    for (int i = 0; bs[i] != 0; ++i)
    {                                // for each of the dest blocks we can write to...
        blkIdx = 0;
        for (int j = 0; j < bs[i]; j+= chunkSize) 
        {                     // for each chunk...
            cSize = chunkSize <= bs[i] - j ? chunkSize : bs[i] - j;     // figure our how much you can write
            for (int chan = 0; chan < numChans; ++chan)         // write that much, per channel
            {
                buffer.copyFrom (chan,                           // (int destChannel)
                                 si[i] + j,                      // (int destStartSample)
L91 =>                           data + (chan * numItems) + idx, // (const float* source)
                                 cSize);                         // (int num samples)
            }

My changes are intended to copy "cSize" samples into channel "chan" from a continuous memory.
To achieve this, the memory layout should be organized as a two-dimensional matrix where each row represents the next "numItems" samples of a channel. Here is an example table to illustrate this:

data	0	1	2	...	idx	...	numItems
0	0	1	2	...	idx	...	numItems-1
1	0+numItems	1+numItems	2+numItems	...	idx+numItems	...	numItems-1+numItems
2
...
chan					idx+chan*numItems
...
numChans

I believe that my modified code will correctly copy the desired samples into the buffer using this memory layout.
However, I would appreciate it if someone could review my changes and verify that they are correct.
Thank you in advance for your help!

The text was updated successfully, but these errors were encountered:

jsiegle · 2023-02-23T18:31:43Z

I think this would work, but before you go down the route of modifying the DataBuffer class, can you compare what you're doing to the implementation in the Neuropixels plugin? This plugin calls addToBuffer with multiple samples on each loop. If you can organize your sample buffers in a way that's consistent with what's expected in the official GUI distribution, it will be much easier to deploy your plugin.

StoneLin0708 · 2023-03-01T05:52:55Z

Thank you for your response. I have compared my modification with the Neuropixels plugin and found that my modification would affect the Neuropixels plugin as it uses a default chunk size of 1 and arranges its input data with each row "idx" representing the next "numChans" at time "idx". Unfortunately, I don't have a Neuropixels system to verify my claim.

data	0	1	2	...	chan	...	numChans
0	0	1	2	...	chan	...	numChans-1
1	0+numChans	1+numChans	2+numChans	...	chan+numChans	...	numChans-1+numChans
2
...
idx					chan+idx*numChans
...
numItems

However, based on my rough estimation, my modification can provide a significant performance improvement of up to x10 with a tuned optimal chunk size, mostly due to improved memory locality and reducing the number of calls to "buffer.copyFrom" (Source/Processors/DataThreads/DataBuffer.cpp#L89) by chunkSize times. This improvement could apply to most modern cached CPUs with various chunk sizes.

It is worth noting that while the Neuropixels plugin saves "numItems" calls to "DataBuffer::addToBuffer". But it still uses the same amount of "buffer.copyFrom" calls because only one sample is copied to a single channel with a single call.

Therefore, I plan to deploy my plugin without changing the "DataBuffer::addToBuffer" function. Still, I think it would be better if a chunk size larger than 1 is an option for future improvement.

Thank you again for your help!

jsiegle · 2023-03-01T18:49:32Z

Thanks, this is helpful! We will test out this method with the Neuropixels plugin and measure the performance improvement. We are currently working on updating the plugin API for the next major release, and it's likely we can incorporate these changes then.

What is the plugin that you're building?

StoneLin0708 · 2023-03-02T13:17:16Z

I'm from KonteX, currently developing a plugin that supports our XDAQ system. The plugin is nearly finished, and we were wondering if you could help us publish it. Here's the link to our plugin: https://github.com/kontex-neuro/OE-XDAQ/tree/xdaq_dev.

And here is the benchmark for DataBuffer::addToBuffer in case you wants to know how we measured, https://github.com/kontex-neuro/OE-XDAQ/tree/benchmark.

The result with different chunk size

Test Condition
- Original: data + (idx * numChans) + chan
- Modified: data + (chan * numItems) + idx
- numItems = 300

Chunk Size	Modified	Original
1	2432	2316
4	469	618
16	259	227
64	180	184
256	284	251

Unit : us
Average of 20 test samples

Thanks!

jsiegle · 2023-03-03T18:08:59Z

Thanks for running these benchmarks! So it seems like the two methods are more or less equivalent in terms of copy speed?

I will follow up via email about plugin distribution.

StoneLin0708 · 2023-03-04T05:45:12Z

It's expected to be the same if the majority of the performance improvement comes from chunked copy.
The purpose of the new method is it will copy data correctly when the chunk size is greater than 1.

Consider the original DataBuffer::addToBuffer taking 3 channel data with 4 samples.

for (int j = 0; j < bs[i]; j+= chunkSize) 
        {                     // for each chunk...
            cSize = chunkSize <= bs[i] - j ? chunkSize : bs[i] - j;     // figure our how much you can write
            for (int chan = 0; chan < numChans; ++chan)         // write that much, per channel
            {
                buffer.copyFrom (chan,                           // (int destChannel)
                                 si[i] + j,                      // (int destStartSample)
                                 data + (idx * numChans) + chan, // (const float* source)
                                 cSize);                         // (int num samples)
            }

It's not possible to arrange data correctly for passing it into buffer.copyFrom when chunk size > 1, since it will copy overlapped data.

Following is examples when the buffer is empty and the buffer size is greater than 4 samples. i.e. i = 0, idx = 0.

Original memory layout

data	Chan 0	Chan 1	Chan 2
Sample 0	0	1	2
Sample 1	3	4	5
Sample 2	6	7	8
Sample 3	9	10	11

Chunk Size 1 / Original Method: `data + (idx * numChans) + chan`

Loop	buffer.copyFrom	Data Copied
Loop j = 0 chan = 0	(0, si[0], data + (0 * 3) + 0, 1)	data[0] ok
Loop j = 0 chan = 1	(1, si[0], data + (0 * 3) + 1, 1)	data[1] ok
Loop j = 0 chan = 2	(2, si[0], data + (0 * 3) + 2, 1)	data[2] ok

Chunk Size 2 / Original Method: `data + (idx * numChans) + chan`

Loop	buffer.copyFrom	Data Copied
Loop j = 0 chan = 0	(0, si[0], data + (0 * 3) + 0, 2)	data[0] data[1] ! Unexpected copy
Loop j = 0 chan = 1	(1, si[0], data + (0 * 3) + 1, 2)	data[1] data[2] ! Unexpected copy
Loop j = 0 chan = 2	(2, si[0], data + (0 * 3) + 2, 2)	data[2] data[3] ! Unexpected copy

Chunk Size 4 / Original Method: `data + (idx * numChans) + chan`

Loop	buffer.copyFrom	Data Copied
Loop j = 0 chan = 0	(0, si[0], data + (0 * 3) + 0, 4)	data[0] data[1] data[2] data[3]! Unexpected copy
Loop j = 0 chan = 1	(1, si[0], data + (0 * 3) + 1, 4)	data[1] data[2] data[3] data[4]! Unexpected copy
Loop j = 0 chan = 2	(2, si[0], data + (0 * 3) + 2, 4)	data[2] data[3] data[4] data[5]! Unexpected copy

Channel major memory layout

data	Sample 0	Sample 1	Sample 2	Sample 3
Chan 0	0	1	2	3
Chan 1	4	5	6	7
Chan 2	8	9	10	11

Chunk Size 2 / Original Method: `data + (idx * numChans) + chan`

Loop	buffer.copyFrom	Data Copied
Loop j = 0 chan = 0	(0, si[0], data + (0 * 3) + 0, 2)	data[0] data[1] ! Unexpected copy
Loop j = 0 chan = 1	(1, si[0], data + (0 * 3) + 1, 2)	data[1] data[2] ! Unexpected copy
Loop j = 0 chan = 2	(2, si[0], data + (0 * 3) + 2, 2)	data[2] data[3] ! Unexpected copy

Chunk Size 2 / Modified: `data + (chan * numItems) + idx`

Loop	buffer.copyFrom	Data Copied
Loop j = 0 chan = 0	(0, si[0], data + (0 * 4) + 0, 2)	data[0] data[1] ok
Loop j = 0 chan = 1	(1, si[0], data + (1 * 4) + 1, 2)	data[4] data[5] ok
Loop j = 0 chan = 2	(2, si[0], data + (2 * 4) + 2, 2)	data[8] data[9] ok

Chunk Size 4 / Modified: `data + (chan * numItems) + idx`

Loop	buffer.copyFrom	Data Copied
Loop j = 0 chan = 0	(0, si[0], data + (0 * 4) + 0, 4)	data[0] data[1] data[2] data[3] ok
Loop j = 0 chan = 1	(1, si[0], data + (1 * 4) + 1, 4)	data[4] data[5] data[6] data[7] ok
Loop j = 0 chan = 2	(2, si[0], data + (2 * 4) + 2, 4)	data[8] data[9] data[10] data[11] ok

Let me know if i missed something or misunderstood the API, hope it can actually improve the performance!

jsiegle · 2023-03-06T19:45:16Z

Thanks for the detailed explanation, this definitely makes sense! Based on the benchmarking you did, it seemed like there wasn't any significant reduction in copy times after switching to channel-major memory layout, but maybe I'm misunderstanding those numbers. In any case, we will run some tests on our end to measure the copy speeds prior to the API update.

kevkdev · 2024-12-06T13:38:40Z

I've run into the same issue; that any chunkSize that is not 1 does not copy the data correctly. I didn't see this thread until after I had resolved the issue with a slightly different implementation.

One item that I found that the above solution does not manage:

When a full chunkSize is not copied at the end of the circular buffer - the first part of the chunk is copied, but the rest is not.

There were also opportunities for reducing the overhead (and hopfully added clarity) for some of the copies by doing 'all at once' copies for:

timestamps, sampleNumbers, eventCodes, and when the data has full sequences of data across channels (data for a channel is contiguous).

I'm going to submit a PR for the changes and will include a snippet of code I used to test the algorithm. I'll update this thread with a link to the PR.

kevkdev · 2024-12-06T14:28:00Z

PR Submitted: #639

jsiegle · 2024-12-09T22:42:26Z

As Anjal mentioned in the PR comment, based on this discussion and others we have decided to make samples from the same channel contiguous in memory in GUI version 1.0, which will be released very soon. This also fixes the issue of the chunkSize variable not working as expected.

I would strongly recommend developing any new plugins against the development-juce8 branch, which will be merged to main in the coming weeks.

kevkdev mentioned this issue Dec 6, 2024

Issue: submitting interleaved or sequential data (chunkSize != 1) to DataThread/DataBuffer #639

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add chunked samples to buffer via DataBuffer::addToBuffer? #556

How to add chunked samples to buffer via DataBuffer::addToBuffer? #556

StoneLin0708 commented Feb 23, 2023

jsiegle commented Feb 23, 2023

StoneLin0708 commented Mar 1, 2023

jsiegle commented Mar 1, 2023

StoneLin0708 commented Mar 2, 2023 •

edited

Loading

jsiegle commented Mar 3, 2023

StoneLin0708 commented Mar 4, 2023

jsiegle commented Mar 6, 2023

kevkdev commented Dec 6, 2024

kevkdev commented Dec 6, 2024

jsiegle commented Dec 9, 2024

How to add chunked samples to buffer via DataBuffer::addToBuffer? #556

How to add chunked samples to buffer via DataBuffer::addToBuffer? #556

Comments

StoneLin0708 commented Feb 23, 2023

jsiegle commented Feb 23, 2023

StoneLin0708 commented Mar 1, 2023

jsiegle commented Mar 1, 2023

StoneLin0708 commented Mar 2, 2023 • edited Loading

jsiegle commented Mar 3, 2023

StoneLin0708 commented Mar 4, 2023

Original memory layout

Chunk Size 1 / Original Method: data + (idx * numChans) + chan

Chunk Size 2 / Original Method: data + (idx * numChans) + chan

Chunk Size 4 / Original Method: data + (idx * numChans) + chan

Channel major memory layout

Chunk Size 2 / Original Method: data + (idx * numChans) + chan

Chunk Size 2 / Modified: data + (chan * numItems) + idx

Chunk Size 4 / Modified: data + (chan * numItems) + idx

jsiegle commented Mar 6, 2023

kevkdev commented Dec 6, 2024

kevkdev commented Dec 6, 2024

jsiegle commented Dec 9, 2024

StoneLin0708 commented Mar 2, 2023 •

edited

Loading

Chunk Size 1 / Original Method: `data + (idx * numChans) + chan`

Chunk Size 2 / Original Method: `data + (idx * numChans) + chan`

Chunk Size 4 / Original Method: `data + (idx * numChans) + chan`

Chunk Size 2 / Original Method: `data + (idx * numChans) + chan`

Chunk Size 2 / Modified: `data + (chan * numItems) + idx`

Chunk Size 4 / Modified: `data + (chan * numItems) + idx`