Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added TfidfTransformer and TfidfVectorizer to feature_extraction.text #869

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ParticularMiner
Copy link

Hi,

Thanks to all dask-developers for your outstanding work!

In this PR, I have attempted to apply my rudimentary knowledge of dask to include dask-implementations of TfidfTransformer and TfidfVectorizer (found in the sklearn.feature_extraction.text module) in the dask-ml.feature_extraction.text module. For now, just a minimal working code (no unit-test yet) is available. Though the examples I've hard-coded into their docstrings should be able to run without incident.

I think it requires someone with proper dask-expertise to inspect it and give me some pointers. In the meantime, I'll draw-up the tests.

Hopefully, this will prove to be a useful extension to dask-ml. At least it would for me, if it would eventually be merged upstream.

@@ -12,9 +12,13 @@
import scipy.sparse
import sklearn.base
import sklearn.feature_extraction.text
import sklearn.preprocessing
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is needed for its normalize() function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sklearn.processing.normalize eagerly return a NumPy array? Or does it operate lazily on Dask Arrays?

If it's eager, we would need to reimplement it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I use sklearn.processing.normalize() only through dask.array.Array.map_blocks(), the operation is lazy even though sklearn.processing.normalize() is itself not lazy.

I hope this is acceptable.

params = self.get_params()
subclass_instance_params = self.get_params()
excluded_keys = getattr(self, '_non_CountVectorizer_params', [])
params = {key: subclass_instance_params[key]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my "patch-up" solution to get params to hold only parameters from CountVectorizer and not its subclasses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about a parent class needing to know about the details of its subclasses.

Is it possible for each subclass to override get_params to do the right thing?

--------
sklearn.feature_extraction.text.TfidfTransformer

Examples
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples have worked for me.

--------
sklearn.feature_extraction.text.TfidfVectorizer

Examples
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples have worked for me.

@ParticularMiner
Copy link
Author

Tests have now been added.

@ParticularMiner
Copy link
Author

I have also just now added support for dask.dataframe.Series to CountVectorizer and TfidfVectorizer.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started to review, it'll be a while before I can finish.

Can you share a bit about:

  1. How does this scale for large inputs
  2. Where all does computation occur during initialization, fitting, and transforming, and why can't it be done lazily?

.gitignore Outdated
@@ -122,3 +122,5 @@ docs/source/auto_examples/
docs/source/examples/mydask.png

dask-worker-space
/.project
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend putting this in a global gitignore file: https://stackoverflow.com/questions/7335420/global-git-ignore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the recommendation. I was unaware of this trick.

@@ -12,9 +12,13 @@
import scipy.sparse
import sklearn.base
import sklearn.feature_extraction.text
import sklearn.preprocessing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sklearn.processing.normalize eagerly return a NumPy array? Or does it operate lazily on Dask Arrays?

If it's eager, we would need to reimplement it.

aggregate=np.sum,
axis=0,
concatenate=False,
dtype=dtype).compute().astype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the astype needed? Shouldn't passing dtype to reduction ensure it's already the right type?

Also, do we need to compute in this function? Or can it be done lazily (I haven't looked at how this is used yet)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I should have removed those.

params = self.get_params()
subclass_instance_params = self.get_params()
excluded_keys = getattr(self, '_non_CountVectorizer_params', [])
params = {key: subclass_instance_params[key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about a parent class needing to know about the details of its subclasses.

Is it possible for each subclass to override get_params to do the right thing?

*vocabularies.to_delayed()
)
vocabulary = vocabulary_for_transform = (
_merge_vocabulary( *vocabularies.to_delayed() ))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it'll cause a linting error. The contributing docs should have some info about setting up pre-commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check the contributing docs. Does this repo execute Action Workflow Scripts that also lint PRs? If so, that would make it easier to standardize coding style.

result = raw_documents.map_partitions(
_count_vectorizer_transform, vocabulary_for_transform, params)
result = build_array(result, n_features, meta)
result.compute_chunk_sizes()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Ideally we avoid all unnecessary computation.

@ParticularMiner
Copy link
Author

Hi @TomAugspurger

Thanks for your review.

From your comments I realize that the dask programming paradigm is to delay all computations until such a time that the user executes compute() outside of the class, right? I guess that's the challenge for me right now. I'll see what I can do to achieve this.

@TomAugspurger
Copy link
Member

is to delay all computations until such a time that the user executes compute() outside of the class, right? I guess that's the challenge for me right now. I'll see what I can do to achieve this.

If possible, yes. But sometimes intermediate computation is inevitable. .fit will often require a compute to learn the parameters (e.g. StandardScaler), we just want to make sure it's actually required.

@ParticularMiner
Copy link
Author

Hi @TomAugspurger

If possible, yes. But sometimes intermediate computation is inevitable. .fit will often require a compute to learn the parameters (e.g. StandardScaler), we just want to make sure it's actually required.

True.

I've cleaned things up a bit now — all unnecessary calls to compute() have been removed — and TfidfTransformer's fit() function is now lazy, that is, it does not learn the parameters until the first call to TfidfTransformer's transform() function is made. Thereafter, all learned data remains in memory and so does not need to be computed again.

Also, TfidfTransformer's transform() function, and TfidfVectorizer's fit(), fit_transform(), and transform() functions are all lazy.

Currently all tests are passing.

The one outstanding issue regards the 'worrying' side-effect of using self.get_params() in CountVectorizer. I'm not sure why the original developers chose to use this function, since I found that sklearn.feature_extraction.text.CountVectorizer's fit_transform() and transform() functions do not use this function themselves. So there might be a way to circumvent its use. I'll take a closer look at it.

@ParticularMiner
Copy link
Author

@TomAugspurger

By the way, I do not have access to a cluster, so I'm not sure how the code scales with the cluster-size. I merely presumed that if I wrote code similar to that of already existing dask-ml's CountVectorizer, things would be fine.

If you know of any way I can test the code in a truly distributed environment, kindly let me know.

@@ -166,10 +215,35 @@ class CountVectorizer(sklearn.feature_extraction.text.CountVectorizer):
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
"""

def get_CountVectorizer_params(self, deep=True):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger

I'm a bit worried about a parent class needing to know about the details of its subclasses.
Is it possible for each subclass to override get_params to do the right thing?

How about this? I've instead added a new method to CountVectorizer called .get_CountVectorizer_params() whose implementation is a slight modification of the original .get_params() of the sklearn.Base.BaseEstimator class but which does what is expected. Subclasses do not need to override it. Moreover, CountVectorizer does not get to "know" the parameters of its subclasses. I hope this is acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants