Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RemoveNaNColumns - compute nans in advance #6532

Merged
merged 1 commit into from
Aug 16, 2023

Conversation

noahnovsak
Copy link
Contributor

Current implementation works fine for numpy, but because dask is lazy it will iterate over the entire table multiple times (once for each attribute).
On a 3000 by 3000 example dataset this reduced the performance of logistic regression by a factor of 10. Without preprocessing it runs in ~500 msec, with preprocessing it took ~5 sec.

solution: compute nans once in advance and store them in memory.

@noahnovsak noahnovsak added the dask Related (discovered in or needed) to the Dask adaptation label Aug 16, 2023
@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Merging #6532 (e0a3988) into dask (c1aa47f) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             dask    #6532      +/-   ##
==========================================
- Coverage   87.70%   87.70%   -0.01%     
==========================================
  Files         322      322              
  Lines       70045    70045              
==========================================
- Hits        61432    61430       -2     
- Misses       8613     8615       +2     

@markotoplak markotoplak merged commit 5ae6389 into biolab:dask Aug 16, 2023
14 of 22 checks passed
@noahnovsak noahnovsak deleted the dask-fix-removenancolumns branch August 16, 2023 12:45
markotoplak added a commit that referenced this pull request Aug 17, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Aug 21, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Sep 4, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Sep 14, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit to markotoplak/orange3 that referenced this pull request Sep 14, 2023
markotoplak added a commit that referenced this pull request Sep 18, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Sep 26, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Oct 10, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Oct 13, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Oct 21, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Oct 29, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Nov 6, 2023
RemoveNaNColumns - compute nans in advance
markotoplak added a commit that referenced this pull request Jan 23, 2024
RemoveNaNColumns - compute nans in advance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Related (discovered in or needed) to the Dask adaptation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants