Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Check] Entire column of a table is not NaN #231

Merged
merged 23 commits into from
Jul 20, 2022

Conversation

CodyCBakerPhD
Copy link
Contributor

Along with #229, finishes #155

Whenever an entire column of a table is NaN, informs the user that there probably isn't a need to include it. Samples the nelems uniformly across the length of the column since it's quite possible to have a block of consecutive entries (even nelems many) that are purposefully and even informatively NaN.

Only giving it a SUGGESTION level of importance since this will fail for large columns of sparse non-NaN, and also could be a design choice of the user.

@CodyCBakerPhD CodyCBakerPhD added the category: new check a new best practices check to apply to all NWBFiles and their contents label Jul 13, 2022
@CodyCBakerPhD CodyCBakerPhD requested a review from bendichter July 13, 2022 14:35
@CodyCBakerPhD CodyCBakerPhD self-assigned this Jul 13, 2022
Comment on lines 194 to 200
subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int))
if np.any(~np.isnan(column[subindex_selection])):
continue
else:
yield InspectorMessage(
message=f"Column {column.name} has all NaN values. Consider removing it from the table."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int))
if np.any(~np.isnan(column[subindex_selection])):
continue
else:
yield InspectorMessage(
message=f"Column {column.name} has all NaN values. Consider removing it from the table."
)
if np.all(np.isnan(column[:nelems])):
yield InspectorMessage(
message=f"Column {column.name} has all NaN values. Consider removing it from the table."
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's written this way in case np.any has an early return condition once it detects a single non-NaN value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.all can have the same early stopping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, interesting results...

Indeed as you say, all does indeed have early stopping just like any...

nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
all(array2)
runtime = time() - start
print(f"Took {runtime}s!")

----------------------------------

Took 0.006882190704345703s!
Took 3.719329833984375e-05s!

... however numpy does not appear to (likely because it's vectorized). Maybe that gives better performance for very large in-memory arrays, but for sparse inputs the early return from Python built-ins might be nicer.

nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
np.all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
np.all(array2)
runtime = time() - start
print(f"Took {runtime}s!") 


----------------------------------

Took 0.05324196815490723s!
Took 0.05084705352783203s!

Copy link
Contributor

@bendichter bendichter Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By sparse do you mean short?

Either approach is fine with me, either using all or np.all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 'sparsity' (% of the number of False items being passed into all) isn't as relevant - I just mean for any case where there is at least one False element occurring somewhat early on in the input to all (as opposed to occurring later in the vector, as seen above), it will exit a few ~ms faster per check (which could add up to seconds when streaming over an entire dataset).

Anyway, I implemented the all logic here.

@CodyCBakerPhD CodyCBakerPhD changed the title [New Check] Entire column of a table is not nan [New Check] Entire column of a table is not NaN Jul 13, 2022
@bendichter
Copy link
Contributor

The idea is if nelems = None then it reads all the data. Could you make that true for this usage mode too?

@CodyCBakerPhD
Copy link
Contributor Author

The idea is if nelems = None then it reads all the data. Could you make that true for this usage mode too?

Done

nwbinspector/utils.py Outdated Show resolved Hide resolved
nwbinspector/utils.py Outdated Show resolved Hide resolved
@CodyCBakerPhD CodyCBakerPhD requested a review from bendichter July 18, 2022 21:07
nwbinspector/utils.py Outdated Show resolved Hide resolved
@bendichter
Copy link
Contributor

how would you feel about moving this one line into the check instead of making it its own function?

@codecov-commenter
Copy link

Codecov Report

Merging #231 (ecc129d) into dev (2ee62e2) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #231      +/-   ##
==========================================
+ Coverage   94.87%   94.93%   +0.06%     
==========================================
  Files          17       17              
  Lines         936      948      +12     
==========================================
+ Hits          888      900      +12     
  Misses         48       48              
Flag Coverage Δ
unittests 94.93% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
nwbinspector/checks/tables.py 96.62% <100.00%> (+0.42%) ⬆️
nwbinspector/utils.py 89.36% <100.00%> (+0.23%) ⬆️

@CodyCBakerPhD
Copy link
Contributor Author

how would you feel about moving this one line into the check instead of making it its own function?

Yeah that's fine since it's so parred down now, the only complicated bit is the logic for determining the 'by' of the slice but that should be readable enough as is - thanks for helping make that so much simpler!

@CodyCBakerPhD CodyCBakerPhD requested a review from bendichter July 19, 2022 14:36
@bendichter bendichter merged commit c85cb0a into dev Jul 20, 2022
@CodyCBakerPhD CodyCBakerPhD deleted the check_table_col_not_nan branch July 20, 2022 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: new check a new best practices check to apply to all NWBFiles and their contents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants