CountVectorizer example #160

TomAugspurger · 2020-07-27T14:51:22Z

No description provided.

review-notebook-app · 2020-07-27T14:51:27Z

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

mrocklin · 2020-07-27T17:37:34Z

There is already this: https://examples.dask.org/machine-learning/text-vectorization.html Maybe we can roll this into that somehow?

…

On Mon, Jul 27, 2020 at 7:51 AM review-notebook-app[bot] < ***@***.***> wrote: Check out this pull request on [image: ReviewNB] <https://app.reviewnb.com/dask/dask-examples/pull/160> Review Jupyter notebook visual diffs & provide feedback on notebooks. ------------------------------ *Powered by ReviewNB <https://www.reviewnb.com>* — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#160 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHSTSPW2MMTLO25IZDR5WIABANCNFSM4PI3CMUA> .

TomAugspurger · 2020-07-27T19:27:40Z

Merged them into a single "Working with text data" notebook that starts with different comparing different vectorizers (HashingVectorizer, CountVectorizer) and ends with the full pipeline.

mrocklin · 2020-07-28T00:06:31Z

machine-learning/text-count-vectorizer.ipynb

+    "          toolz.sliding_window(2, lengths)]\n",
+    "# Notice the persist here! More details later.\n",
+    "documents = db.from_delayed([load_news(x) for x in slices]).persist()\n",
+    "documents"


We could also call db.read_sequence(..., npartitions=10).persist() and then call client.rebalance()

Given that people are going to blindly copy-paste whatever we do anyway I'd personally rather that they see this. It's a bit more in line with ordinary behavior I think.

mrocklin · 2020-07-28T00:07:00Z

machine-learning/text-count-vectorizer.ipynb

+    "import dask_ml.feature_extraction.text"
+   ]
+  },
+  {
+   "cell_type": "code",


I recommend merging adjacent code cells, if only to cut down on Ctrl-Enter pressing.

mrocklin · 2020-07-28T00:09:18Z

machine-learning/text-count-vectorizer.ipynb

+    "remote_vocabulary, = client.scatter([vocabulary], broadcast=True)\n",
+    "\n",
+    "vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(\n",
+    "    vocabulary=remote_vocabulary\n",
+    ")"


Is there a well defined vocabulary that we can use somewhere? Maybe in nltk? I'm concerned that people will see this, and think that they should copy the vocabulary off of one CountVectorizer and then pass it to another.

Also, do we need the scatter? Can you verify that if vocabulary is included directly in the vocabulary= keyword argument that it will occupy only a single task, and not be in many of them?

I'm not sure about nltk, but probably not worth adding it to the environment just for this example. I noted that you'd probably get this from an external source in practice.

I think that as of dask/dask-ml#719, the answer to your question about user-provided vocabulary being in one task is "yes". But that change hasn't been released yet.

TomAugspurger added 5 commits July 24, 2020 16:09

Add example using CountVectorizer

dccfa6e

update

68f348d

fixup

007ae47

strip

f257e40

Merge remote-tracking branch 'upstream/master' into count-vectorizer

715de82

merge

39cb3bd

mrocklin reviewed Jul 28, 2020

View reviewed changes

fixups

4a59343

Base automatically changed from master to main January 27, 2021 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CountVectorizer example #160

CountVectorizer example #160

TomAugspurger commented Jul 27, 2020

review-notebook-app bot commented Jul 27, 2020

mrocklin commented Jul 27, 2020 via email

TomAugspurger commented Jul 27, 2020

mrocklin Jul 28, 2020

mrocklin Jul 28, 2020

mrocklin Jul 28, 2020

TomAugspurger Aug 6, 2020

CountVectorizer example #160

Are you sure you want to change the base?

CountVectorizer example #160

Conversation

TomAugspurger commented Jul 27, 2020

review-notebook-app bot commented Jul 27, 2020

mrocklin commented Jul 27, 2020 via email

TomAugspurger commented Jul 27, 2020

mrocklin Jul 28, 2020

Choose a reason for hiding this comment

mrocklin Jul 28, 2020

Choose a reason for hiding this comment

mrocklin Jul 28, 2020

Choose a reason for hiding this comment

TomAugspurger Aug 6, 2020

Choose a reason for hiding this comment