-
-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CountVectorizer example #160
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on Review Jupyter notebook visual diffs & provide feedback on notebooks. Powered by ReviewNB |
There is already this:
https://examples.dask.org/machine-learning/text-vectorization.html
Maybe we can roll this into that somehow?
…On Mon, Jul 27, 2020 at 7:51 AM review-notebook-app[bot] < ***@***.***> wrote:
Check out this pull request on [image: ReviewNB]
<https://app.reviewnb.com/dask/dask-examples/pull/160>
Review Jupyter notebook visual diffs & provide feedback on notebooks.
------------------------------
*Powered by ReviewNB <https://www.reviewnb.com>*
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#160 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHSTSPW2MMTLO25IZDR5WIABANCNFSM4PI3CMUA>
.
|
Merged them into a single "Working with text data" notebook that starts with different comparing different vectorizers (HashingVectorizer, CountVectorizer) and ends with the full pipeline. |
" toolz.sliding_window(2, lengths)]\n", | ||
"# Notice the persist here! More details later.\n", | ||
"documents = db.from_delayed([load_news(x) for x in slices]).persist()\n", | ||
"documents" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also call db.read_sequence(..., npartitions=10).persist()
and then call client.rebalance()
Given that people are going to blindly copy-paste whatever we do anyway I'd personally rather that they see this. It's a bit more in line with ordinary behavior I think.
"import dask_ml.feature_extraction.text" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend merging adjacent code cells, if only to cut down on Ctrl-Enter pressing.
"remote_vocabulary, = client.scatter([vocabulary], broadcast=True)\n", | ||
"\n", | ||
"vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(\n", | ||
" vocabulary=remote_vocabulary\n", | ||
")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a well defined vocabulary that we can use somewhere? Maybe in nltk
? I'm concerned that people will see this, and think that they should copy the vocabulary off of one CountVectorizer and then pass it to another.
Also, do we need the scatter? Can you verify that if vocabulary is included directly in the vocabulary=
keyword argument that it will occupy only a single task, and not be in many of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about nltk, but probably not worth adding it to the environment just for this example. I noted that you'd probably get this from an external source in practice.
I think that as of dask/dask-ml#719, the answer to your question about user-provided vocabulary being in one task is "yes". But that change hasn't been released yet.
No description provided.