-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Check] Entire column of a table is not NaN #231
Conversation
for more information, see https://pre-commit.ci
nwbinspector/checks/tables.py
Outdated
subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int)) | ||
if np.any(~np.isnan(column[subindex_selection])): | ||
continue | ||
else: | ||
yield InspectorMessage( | ||
message=f"Column {column.name} has all NaN values. Consider removing it from the table." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int)) | |
if np.any(~np.isnan(column[subindex_selection])): | |
continue | |
else: | |
yield InspectorMessage( | |
message=f"Column {column.name} has all NaN values. Consider removing it from the table." | |
) | |
if np.all(np.isnan(column[:nelems])): | |
yield InspectorMessage( | |
message=f"Column {column.name} has all NaN values. Consider removing it from the table." | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's written this way in case np.any
has an early return condition once it detects a single non-NaN
value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.all
can have the same early stopping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, interesting results...
Indeed as you say, all
does indeed have early stopping just like any
...
nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
all(array2)
runtime = time() - start
print(f"Took {runtime}s!")
----------------------------------
Took 0.006882190704345703s!
Took 3.719329833984375e-05s!
... however numpy
does not appear to (likely because it's vectorized). Maybe that gives better performance for very large in-memory arrays, but for sparse inputs the early return from Python built-ins might be nicer.
nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
np.all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
np.all(array2)
runtime = time() - start
print(f"Took {runtime}s!")
----------------------------------
Took 0.05324196815490723s!
Took 0.05084705352783203s!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By sparse do you mean short?
Either approach is fine with me, either using all
or np.all
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 'sparsity' (% of the number of False
items being passed into all
) isn't as relevant - I just mean for any case where there is at least one False
element occurring somewhat early on in the input to all
(as opposed to occurring later in the vector, as seen above), it will exit a few ~ms
faster per check (which could add up to seconds when streaming over an entire dataset).
Anyway, I implemented the all
logic here.
…awithoutborders/nwbinspector into check_table_col_not_nan
for more information, see https://pre-commit.ci
…awithoutborders/nwbinspector into check_table_col_not_nan
The idea is if nelems = None then it reads all the data. Could you make that true for this usage mode too? |
Done |
Co-authored-by: Ben Dichter <[email protected]>
how would you feel about moving this one line into the check instead of making it its own function? |
Codecov Report
@@ Coverage Diff @@
## dev #231 +/- ##
==========================================
+ Coverage 94.87% 94.93% +0.06%
==========================================
Files 17 17
Lines 936 948 +12
==========================================
+ Hits 888 900 +12
Misses 48 48
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Co-authored-by: Ben Dichter <[email protected]>
Co-authored-by: Ben Dichter <[email protected]>
Co-authored-by: Ben Dichter <[email protected]>
Yeah that's fine since it's so parred down now, the only complicated bit is the logic for determining the 'by' of the slice but that should be readable enough as is - thanks for helping make that so much simpler! |
Along with #229, finishes #155
Whenever an entire column of a table is
NaN
, informs the user that there probably isn't a need to include it. Samples thenelems
uniformly across the length of the column since it's quite possible to have a block of consecutive entries (evennelems
many) that are purposefully and even informativelyNaN
.Only giving it a
SUGGESTION
level of importance since this will fail for large columns of sparse non-NaN
, and also could be a design choice of the user.