Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Check] Entire column of Timeseries is NaN #229

Draft
wants to merge 13 commits into
base: dev
Choose a base branch
from

Conversation

CodyCBakerPhD
Copy link
Contributor

Solves half of #155

Only applies to 1D and 2D TimeSeries data - any higher than that and the multi-axis structure could itself contain information; e.g., if writing an ImageSeries (frames x width x height x 3 color channels), if a single (x,y) pixel is NaN for all frames and color channels, you'd still want to keep it in the dataset to maintain the video structure.

But for 1D data, if the entire data is NaN, I can't think of a case why this data was even written to the NWBFile.

Likewise for 2D data, it would be suggested for example, on an ElectricalSeries, to simply not write those channels that never seem to specify any actual data.

Note that it is fairly common to find blocks of NaN data interspersed within TimeSeries - the subframe_selection method thus attempts to sample an even spread of the nelems over the range of the number of frames.

@CodyCBakerPhD CodyCBakerPhD added the category: new check a new best practices check to apply to all NWBFiles and their contents label Jul 12, 2022
@CodyCBakerPhD CodyCBakerPhD requested a review from bendichter July 12, 2022 15:52
@CodyCBakerPhD CodyCBakerPhD self-assigned this Jul 12, 2022
@CodyCBakerPhD CodyCBakerPhD marked this pull request as draft July 13, 2022 19:59
@CodyCBakerPhD
Copy link
Contributor Author

Just going to wait until #231 is merged then I can refactor it to use the util function and nelems=None functionality

@codecov-commenter
Copy link

Codecov Report

Merging #229 (23d2e6a) into dev (2ee62e2) will increase coverage by 0.13%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #229      +/-   ##
==========================================
+ Coverage   94.87%   95.01%   +0.13%     
==========================================
  Files          17       17              
  Lines         936      962      +26     
==========================================
+ Hits          888      914      +26     
  Misses         48       48              
Flag Coverage Δ
unittests 95.01% <100.00%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
nwbinspector/checks/tables.py 96.59% <100.00%> (+0.38%) ⬆️
nwbinspector/checks/time_series.py 98.03% <100.00%> (+0.98%) ⬆️

Comment on lines 104 to 106
subframe_selection = np.unique(
np.round(np.linspace(start=0, stop=time_series.data.shape[0] - 1, num=nelems)).astype(int)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you could move this to line 110

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the clever logic we devised from the tables PR now (direct slicing with 'by' amount calculated from shape)

)
)
elif n_dims == 2:
for col in range(time_series.data.shape[1]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there are 1000s of columns?

Copy link
Contributor Author

@CodyCBakerPhD CodyCBakerPhD Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 SpikeGLX probes would hit that amount... Not that I've seen anyone do more than 2 currently, but we can think towards the future.

Could be faster if we did the np.isnan calculation vectorized over the entire extracted array of [subselected_rows, all_cols]...

Same case could be made for the check_table_cols_nan for DynamicTables (could be thousands of columns there that we iterate over), but I get that the TimeSeries data is likely to be much larger.

If you're suggesting to subselect over columns as well then here, I'd just point out that could ultimately reduce the accuracy if there are only a small number of channels/ROIs that had nothing in them but NaN at every time point - the likelihood of those offending columns being selected may not be very high, so I'd actually prefer limiting the nelems of the row data selection to examine fewer data values per row of each column...

Now, that's aside from the other idea I have for pulling data chunk-wise whenever possible, and fully utilizing each chunks data (described below).

if not all(np.isnan(time_series.data[:nelems, col]).flatten()):
continue

if all(np.isnan(time_series.data[subframe_selection, col]).flatten()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there a way of modifying this code to minimize the number of h5py dataset reads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could take the approach of pulling out and checking data in packets of 'chunks' (would be much more efficient for streaming).

In the worst case, as it currently stands, the slicing might span nelems total number of chunks for each column, which might be cached for a bit on the HDF5 end but unlikely that's much benefit at normal data scales. Either way is definitely a bit inefficient for that.

One idea could be to check if the time_series being read is chunked, and if it is, alter our interpretation of nelems from spanning nelems individual data values to spanning some smaller number of chunks throughout the dataset, but utilize all of the data retrieved from those chunks.

This is also all separate from the first early exit condition of this entire check, which only examines the first nelems consecutive data values, which are fairly unlikely to span multiple chunks unless the chunk_shape is something extreme like by-frame for all channels.

I can think more about the total data usage that this function might require and report some approximate numbers. But yeah, we might need to be a bit more restrictive than usual with the nelems subsetting, the interaction with chunking could give some poor performance in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: new check a new best practices check to apply to all NWBFiles and their contents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Add Check]: entire col/row of table is not NaN, and entire axis of TimeSeries.data is not NaN
3 participants