[New Check] Entire column of a table is not NaN #231

CodyCBakerPhD · 2022-07-13T14:35:08Z

Along with #229, finishes #155

Whenever an entire column of a table is NaN, informs the user that there probably isn't a need to include it. Samples the nelems uniformly across the length of the column since it's quite possible to have a block of consecutive entries (even nelems many) that are purposefully and even informatively NaN.

Only giving it a SUGGESTION level of importance since this will fail for large columns of sparse non-NaN, and also could be a design choice of the user.

for more information, see https://pre-commit.ci

nwbinspector/checks/tables.py

bendichter · 2022-07-13T14:54:49Z

nwbinspector/checks/tables.py

+        subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int))
+        if np.any(~np.isnan(column[subindex_selection])):
+            continue
+        else:
+            yield InspectorMessage(
+                message=f"Column {column.name} has all NaN values. Consider removing it from the table."
+            )


Suggested change

subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int))

if np.any(~np.isnan(column[subindex_selection])):

continue

else:

yield InspectorMessage(

message=f"Column {column.name} has all NaN values. Consider removing it from the table."

)

if np.all(np.isnan(column[:nelems])):

yield InspectorMessage(

message=f"Column {column.name} has all NaN values. Consider removing it from the table."

)

It's written this way in case np.any has an early return condition once it detects a single non-NaN value

np.all can have the same early stopping

Well, interesting results...

Indeed as you say, all does indeed have early stopping just like any...

nelems = 1000000 array1 = [True for _ in range(nelems)] array1 += [False] array2 = [False] array2 += [True for _ in range(nelems)] start = time() all(array1) runtime = time() - start print(f"Took {runtime}s!") start = time() all(array2) runtime = time() - start print(f"Took {runtime}s!") ---------------------------------- Took 0.006882190704345703s! Took 3.719329833984375e-05s!

... however numpy does not appear to (likely because it's vectorized). Maybe that gives better performance for very large in-memory arrays, but for sparse inputs the early return from Python built-ins might be nicer.

nelems = 1000000 array1 = [True for _ in range(nelems)] array1 += [False] array2 = [False] array2 += [True for _ in range(nelems)] start = time() np.all(array1) runtime = time() - start print(f"Took {runtime}s!") start = time() np.all(array2) runtime = time() - start print(f"Took {runtime}s!") ---------------------------------- Took 0.05324196815490723s! Took 0.05084705352783203s!

By sparse do you mean short?

Either approach is fine with me, either using all or np.all.

I guess 'sparsity' (% of the number of False items being passed into all) isn't as relevant - I just mean for any case where there is at least one False element occurring somewhat early on in the input to all (as opposed to occurring later in the vector, as seen above), it will exit a few ~ms faster per check (which could add up to seconds when streaming over an entire dataset).

Anyway, I implemented the all logic here.

…awithoutborders/nwbinspector into check_table_col_not_nan

for more information, see https://pre-commit.ci

…awithoutborders/nwbinspector into check_table_col_not_nan

bendichter · 2022-07-13T19:18:43Z

The idea is if nelems = None then it reads all the data. Could you make that true for this usage mode too?

for more information, see https://pre-commit.ci

CodyCBakerPhD · 2022-07-13T19:57:49Z

The idea is if nelems = None then it reads all the data. Could you make that true for this usage mode too?

Done

nwbinspector/utils.py

for more information, see https://pre-commit.ci

nwbinspector/utils.py

Co-authored-by: Ben Dichter <[email protected]>

nwbinspector/checks/tables.py

nwbinspector/utils.py

nwbinspector/checks/tables.py

bendichter · 2022-07-19T12:54:45Z

how would you feel about moving this one line into the check instead of making it its own function?

codecov-commenter · 2022-07-19T13:00:56Z

Codecov Report

Merging #231 (ecc129d) into dev (2ee62e2) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #231      +/-   ##
==========================================
+ Coverage   94.87%   94.93%   +0.06%     
==========================================
  Files          17       17              
  Lines         936      948      +12     
==========================================
+ Hits          888      900      +12     
  Misses         48       48

Flag	Coverage Δ
unittests	`94.93% <100.00%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
nwbinspector/checks/tables.py	`96.62% <100.00%> (+0.42%)`	⬆️
nwbinspector/utils.py	`89.36% <100.00%> (+0.23%)`	⬆️

Co-authored-by: Ben Dichter <[email protected]>

CodyCBakerPhD · 2022-07-19T13:57:22Z

how would you feel about moving this one line into the check instead of making it its own function?

Yeah that's fine since it's so parred down now, the only complicated bit is the logic for determining the 'by' of the slice but that should be readable enough as is - thanks for helping make that so much simpler!

Cody Baker added 2 commits July 13, 2022 09:25

saving state

abbe8f1

added test and debug

2a6516b

CodyCBakerPhD added the category: new check a new best practices check to apply to all NWBFiles and their contents label Jul 13, 2022

CodyCBakerPhD requested a review from bendichter July 13, 2022 14:35

CodyCBakerPhD self-assigned this Jul 13, 2022

[pre-commit.ci] auto fixes from pre-commit.com hooks

a322c04

for more information, see https://pre-commit.ci

bendichter reviewed Jul 13, 2022

View reviewed changes

nwbinspector/checks/tables.py Outdated Show resolved Hide resolved

bendichter reviewed Jul 13, 2022

View reviewed changes

Cody Baker added 2 commits July 13, 2022 12:13

refactor logic call

db1b426

Merge branch 'check_table_col_not_nan' of https://github.com/neurodat…

28fc042

…awithoutborders/nwbinspector into check_table_col_not_nan

CodyCBakerPhD changed the title ~~[New Check] Entire column of a table is not nan~~ [New Check] Entire column of a table is not NaN Jul 13, 2022

Cody Baker and others added 4 commits July 13, 2022 13:46

add early data access skip

29a171d

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ac54f8

for more information, see https://pre-commit.ci

add flatten for indexed cols

73d4978

Merge branch 'check_table_col_not_nan' of https://github.com/neurodat…

c72a095

…awithoutborders/nwbinspector into check_table_col_not_nan

Cody Baker and others added 2 commits July 13, 2022 15:57

generalized to util function; added None slicing

48c5e60

[pre-commit.ci] auto fixes from pre-commit.com hooks

b67f2c8

for more information, see https://pre-commit.ci

CodyCBakerPhD mentioned this pull request Jul 13, 2022

[New Check] Entire column of Timeseries is NaN #229

Draft

Merge branch 'dev' into check_table_col_not_nan

e5f2618

bendichter reviewed Jul 18, 2022

View reviewed changes

nwbinspector/utils.py Outdated Show resolved Hide resolved

CodyCBakerPhD and others added 4 commits July 18, 2022 15:09

swapped util to return only slice

75a687e

[pre-commit.ci] auto fixes from pre-commit.com hooks

62350db

for more information, see https://pre-commit.ci

Merge branch 'dev' into check_table_col_not_nan

fd35f34

Merge branch 'dev' into check_table_col_not_nan

50f81ac

bendichter reviewed Jul 18, 2022

View reviewed changes

nwbinspector/utils.py Outdated Show resolved Hide resolved

CodyCBakerPhD and others added 2 commits July 18, 2022 16:40

Merge branch 'dev' into check_table_col_not_nan

44cec05

Update nwbinspector/utils.py

1e83c6b

Co-authored-by: Ben Dichter <[email protected]>

CodyCBakerPhD requested a review from bendichter July 18, 2022 21:07

Merge branch 'dev' into check_table_col_not_nan

ecc129d

bendichter reviewed Jul 19, 2022

View reviewed changes

nwbinspector/checks/tables.py Outdated Show resolved Hide resolved

bendichter reviewed Jul 19, 2022

View reviewed changes

nwbinspector/utils.py Outdated Show resolved Hide resolved

bendichter reviewed Jul 19, 2022

View reviewed changes

nwbinspector/checks/tables.py Outdated Show resolved Hide resolved

CodyCBakerPhD and others added 3 commits July 19, 2022 09:56

Update nwbinspector/checks/tables.py

09b0474

Co-authored-by: Ben Dichter <[email protected]>

Update nwbinspector/utils.py

10c2073

Co-authored-by: Ben Dichter <[email protected]>

Update nwbinspector/checks/tables.py

41056ea

Co-authored-by: Ben Dichter <[email protected]>

debug

8def458

CodyCBakerPhD requested a review from bendichter July 19, 2022 14:36

bendichter approved these changes Jul 20, 2022

View reviewed changes

bendichter merged commit c85cb0a into dev Jul 20, 2022

CodyCBakerPhD deleted the check_table_col_not_nan branch July 20, 2022 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Check] Entire column of a table is not NaN #231

[New Check] Entire column of a table is not NaN #231

CodyCBakerPhD commented Jul 13, 2022

bendichter Jul 13, 2022

CodyCBakerPhD Jul 13, 2022

bendichter Jul 13, 2022

CodyCBakerPhD Jul 13, 2022

bendichter Jul 13, 2022 •

edited

Loading

CodyCBakerPhD Jul 13, 2022

bendichter commented Jul 13, 2022

CodyCBakerPhD commented Jul 13, 2022

bendichter commented Jul 19, 2022

codecov-commenter commented Jul 19, 2022

CodyCBakerPhD commented Jul 19, 2022

[New Check] Entire column of a table is not NaN #231

[New Check] Entire column of a table is not NaN #231

Conversation

CodyCBakerPhD commented Jul 13, 2022

bendichter Jul 13, 2022

Choose a reason for hiding this comment

CodyCBakerPhD Jul 13, 2022

Choose a reason for hiding this comment

bendichter Jul 13, 2022

Choose a reason for hiding this comment

CodyCBakerPhD Jul 13, 2022

Choose a reason for hiding this comment

bendichter Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

CodyCBakerPhD Jul 13, 2022

Choose a reason for hiding this comment

bendichter commented Jul 13, 2022

CodyCBakerPhD commented Jul 13, 2022

bendichter commented Jul 19, 2022

codecov-commenter commented Jul 19, 2022

Codecov Report

CodyCBakerPhD commented Jul 19, 2022

bendichter Jul 13, 2022 •

edited

Loading