From 79356d13f813cb35f96af26de31d01963c7c00c7 Mon Sep 17 00:00:00 2001 From: Prianikq <71663981+Prianikq@users.noreply.github.com> Date: Sat, 23 Jul 2022 11:56:21 +0300 Subject: [PATCH] Added PyTorch notebook --- .../15-LanguageModeling/CBoW-PyTorch.ipynb | 563 ++++++++++++++++++ 1 file changed, 563 insertions(+) create mode 100644 lessons/5-NLP/15-LanguageModeling/CBoW-PyTorch.ipynb diff --git a/lessons/5-NLP/15-LanguageModeling/CBoW-PyTorch.ipynb b/lessons/5-NLP/15-LanguageModeling/CBoW-PyTorch.ipynb new file mode 100644 index 00000000..ea24e4de --- /dev/null +++ b/lessons/5-NLP/15-LanguageModeling/CBoW-PyTorch.ipynb @@ -0,0 +1,563 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "NXTSugt6ieXh" + }, + "source": [ + "## Training CBoW Model\n", + "\n", + "This notebooks is a part of [AI for Beginners Curriculum](http://aka.ms/ai-beginners)\n", + "\n", + "In this example, we will look at training CBoW language model to get our own Word2Vec embedding space. We will use AG News dataset as the source of text." + ] + }, + { + "cell_type": "code", + "source": [ + "import torch\n", + "import torchtext\n", + "import os\n", + "import collections\n", + "import builtins\n", + "import random\n", + "import numpy as np" + ], + "metadata": { + "id": "q-UiiJUKaxHj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")" + ], + "metadata": { + "id": "TFbR8CZaTZ1q" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "First let's load our dataset and define tokenizer and vocabulary. We will set `vocab_size` to 5000 to limit computations a bit." + ], + "metadata": { + "id": "HIwC7lI5T-ov" + } + }, + { + "cell_type": "code", + "source": [ + "def load_dataset(ngrams = 1, min_freq = 1, vocab_size = 5000 , lines_cnt = 500):\n", + " tokenizer = torchtext.data.utils.get_tokenizer('basic_english')\n", + " print(\"Loading dataset...\")\n", + " test_dataset, train_dataset = torchtext.datasets.AG_NEWS(root='./data')\n", + " train_dataset = list(train_dataset)\n", + " test_dataset = list(test_dataset)\n", + " classes = ['World', 'Sports', 'Business', 'Sci/Tech']\n", + " print('Building vocab...')\n", + " counter = collections.Counter()\n", + " for i, (_, line) in enumerate(train_dataset):\n", + " counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line),ngrams=ngrams))\n", + " if i == lines_cnt:\n", + " break\n", + " vocab = torchtext.vocab.Vocab(collections.Counter(dict(counter.most_common(vocab_size))), min_freq=min_freq)\n", + " return train_dataset, test_dataset, classes, vocab, tokenizer" + ], + "metadata": { + "id": "wdZuygtgiuLG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "train_dataset, test_dataset, _, vocab, tokenizer = load_dataset()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4d1nU1gsivGu", + "outputId": "949fe272-ae0e-49f5-c373-6703458b3a74" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Loading dataset...\n", + "Building vocab...\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "def encode(x, vocabulary, tokenizer = tokenizer):\n", + " return [vocabulary[s] for s in tokenizer(x)]" + ], + "metadata": { + "id": "1XDYNhG8ToFV" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LIlQk6_PaHVY" + }, + "source": [ + "## CBoW Model\n", + "\n", + "CBoW learns to predict a word based on the $2N$ neighboring words. For example, when $N=1$, we will get the following pairs from the sentence *I like to train networks*: (like,I), (I, like), (to, like), (like,to), (train,to), (to, train), (networks, train), (train,networks). Here, first word is the neighboring word used as an input, and second word is the one we are predicting.\n", + "\n", + "To build a network to predict next word, we will need to supply neighboring word as input, and get word number as output. The architecture of CBoW network is the following:\n", + "\n", + "* Input word is passed through the embedding layer. This very embedding layer would be our Word2Vec embedding, thus we will define it separately as `embedder` variable. We will use embedding size = 30 in this example, even though you might want to experiment with higher dimensions (real word2vec has 300)\n", + "* Embedding vector would then be passed to a linear layer that will predict output word. Thus it has the `vocab_size` neurons.\n", + "\n", + "For the output, if we use `CrossEntropyLoss` as loss function, we would also have to provide just word numbers as expected results, without one-hot encoding." + ] + }, + { + "cell_type": "code", + "source": [ + "vocab_size = len(vocab)\n", + "\n", + "embedder = torch.nn.Embedding(num_embeddings = vocab_size, embedding_dim = 30)\n", + "model = torch.nn.Sequential(\n", + " embedder,\n", + " torch.nn.Linear(in_features = 30, out_features = vocab_size),\n", + ")\n", + "\n", + "print(model)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "akKTcKQKkfl2", + "outputId": "da687e3e-a8ec-4c1a-e456-ab8cd6ac7dad" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Sequential(\n", + " (0): Embedding(5002, 30)\n", + " (1): Linear(in_features=30, out_features=5002, bias=True)\n", + ")\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nud6jgGPaHVa" + }, + "source": [ + "## Preparing Training Data\n", + "\n", + "Now let's program the main function that will compute CBoW word pairs from text. This function will allow us to specify window size, and will return a set of pairs - input and output word. Note that this function can be used on words, as well as on vectors/tensors - which will allow us to encode the text, before passing it to `to_cbow` function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "x-dsXygOieXn", + "outputId": "c2218280-e540-40ba-9546-efe48d0d714f" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "[['like', 'I'], ['to', 'I'], ['I', 'like'], ['to', 'like'], ['train', 'like'], ['I', 'to'], ['like', 'to'], ['train', 'to'], ['networks', 'to'], ['like', 'train'], ['to', 'train'], ['networks', 'train'], ['to', 'networks'], ['train', 'networks']]\n", + "[[232, 172], [5, 172], [172, 232], [5, 232], [0, 232], [172, 5], [232, 5], [0, 5], [1202, 5], [232, 0], [5, 0], [1202, 0], [5, 1202], [0, 1202]]\n" + ] + } + ], + "source": [ + "def to_cbow(sent,window_size=2):\n", + " res = []\n", + " for i,x in enumerate(sent):\n", + " for j in range(max(0,i-window_size),min(i+window_size+1,len(sent))):\n", + " if i!=j:\n", + " res.append([sent[j],x])\n", + " return res\n", + "\n", + "print(to_cbow(['I','like','to','train','networks']))\n", + "print(to_cbow(encode('I like to train networks', vocab)))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XVaaDLjaaHVb" + }, + "source": [ + "Let's prepare the training dataset. We will go through all news, call `to_cbow` to get the list of word pairs, and add those pairs to `X` and `Y`. For the sake of time, we will only consider first 10k news items - you can easily remove the limitation in case you have more time to wait, and want to get better embeddings :)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "54b-Gd9TieXo" + }, + "outputs": [], + "source": [ + "X = []\n", + "Y = []\n", + "for i, x in zip(range(10000), train_dataset):\n", + " for w1, w2 in to_cbow(encode(x[1], vocab), window_size = 5):\n", + " X.append(w1)\n", + " Y.append(w2)\n", + "\n", + "X = torch.tensor(X)\n", + "Y = torch.tensor(Y)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "We will also convert that data to one dataset, and create dataloader:" + ], + "metadata": { + "id": "cwWy0PzXWhN5" + } + }, + { + "cell_type": "code", + "source": [ + "class SimpleIterableDataset(torch.utils.data.IterableDataset):\n", + " def __init__(self, X, Y):\n", + " super(SimpleIterableDataset).__init__()\n", + " self.data = []\n", + " for i in range(len(X)):\n", + " self.data.append( (Y[i], X[i]) )\n", + " random.shuffle(self.data)\n", + "\n", + " def __iter__(self):\n", + " return iter(self.data)" + ], + "metadata": { + "id": "mfoAcGPFZU8p" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e4NQ_-5waHVc" + }, + "source": [ + "We will also convert that data to one dataset, and create dataloader:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AbLUcojlieXo" + }, + "outputs": [], + "source": [ + "ds = SimpleIterableDataset(X, Y)\n", + "dl = torch.utils.data.DataLoader(ds, batch_size = 256)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pKQr7sXeaHVc" + }, + "source": [ + "Now let's do the actual training. We will use `SGD` optimizer with pretty high learning rate. You can also try playing around with other optimizers, such as `Adam`. We will train for 10 epochs to begin with - and you can re-run this cell if you want even lower loss." + ] + }, + { + "cell_type": "code", + "source": [ + "def train_epoch(net, dataloader, lr = 0.01, optimizer = None, loss_fn = torch.nn.CrossEntropyLoss(), epochs = None, report_freq = 1):\n", + " optimizer = optimizer or torch.optim.Adam(net.parameters(), lr = lr)\n", + " loss_fn = loss_fn.to(device)\n", + " net.train()\n", + "\n", + " for i in range(epochs):\n", + " total_loss, j = 0, 0, \n", + " for labels, features in dataloader:\n", + " optimizer.zero_grad()\n", + " features, labels = features.to(device), labels.to(device)\n", + " out = net(features)\n", + " loss = loss_fn(out, labels)\n", + " loss.backward()\n", + " optimizer.step()\n", + " total_loss += loss\n", + " j += 1\n", + " if i % report_freq == 0:\n", + " print(f\"Epoch: {i+1}: loss={total_loss.item()/j}\")\n", + "\n", + " return total_loss.item()/j" + ], + "metadata": { + "id": "HeeCYKr_KF1w" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "train_epoch(net = model, dataloader = dl, optimizer = torch.optim.SGD(model.parameters(), lr = 0.1), loss_fn = torch.nn.CrossEntropyLoss(), epochs = 10)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KVgwGtDHgDlT", + "outputId": "2447833f-f0e3-4566-c33d-addbfe2f451d" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Epoch: 1: loss=5.664632366860172\n", + "Epoch: 2: loss=5.632101973960962\n", + "Epoch: 3: loss=5.610399051405015\n", + "Epoch: 4: loss=5.594621561080262\n", + "Epoch: 5: loss=5.582538017415446\n", + "Epoch: 6: loss=5.572900234519603\n", + "Epoch: 7: loss=5.564951676341915\n", + "Epoch: 8: loss=5.558288112064614\n", + "Epoch: 9: loss=5.552576955031129\n", + "Epoch: 10: loss=5.547634165194347\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "5.547634165194347" + ] + }, + "metadata": {}, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W8u2qXZmaHVd" + }, + "source": [ + "## Trying out Word2Vec\n", + "\n", + "To use Word2Vec, let's extract vectors corresponding to all words in our vocabulary:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "r8TatcXjkU_t" + }, + "outputs": [], + "source": [ + "vectors = torch.stack([embedder(torch.tensor(vocab[s])) for s in vocab.itos], 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3OcX21UOaHVd" + }, + "source": [ + "Let's see, for example, how the word **Paris** is encoded into a vector:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bz6tAeLzieXp", + "outputId": "5b20850e-4342-45e9-f840-cfac2b4d61d8" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "tensor([-0.0915, 2.1224, -0.0281, -0.6819, 1.1219, 0.6458, -1.3704, -1.3314,\n", + " -1.1437, 0.4496, 0.2301, -0.3515, -0.8485, 1.0481, 0.4386, -0.8949,\n", + " 0.5644, 1.0939, -2.5096, 3.2949, -0.2601, -0.8640, 0.1421, -0.0804,\n", + " -0.5083, -1.0560, 0.9753, -0.5949, -1.6046, 0.5774],\n", + " grad_fn=)\n" + ] + } + ], + "source": [ + "paris_vec = embedder(torch.tensor(vocab['paris']))\n", + "print(paris_vec)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pHTJlaeYaHVd" + }, + "source": [ + "It is interesting to use Word2Vec to look for synonyms. The following function will return `n` closest words to a given input. To find them, we compute the norm of $|w_i - v|$, where $v$ is the vector corresponding to our input word, and $w_i$ is the encoding of $i$-th word in the vocabulary. We then sort the array and return corresponding indices using `argsort`, and take first `n` elements of the list, which encode positions of closest words in the vocabulary. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "NlZyi-_olFar", + "outputId": "b5dbb163-88c4-4d5a-eaf2-6751f700e98c" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['microsoft', 'quoted', 'lp', 'rate', 'top']" + ] + }, + "metadata": {}, + "execution_count": 56 + } + ], + "source": [ + "def close_words(x, n = 5):\n", + " vec = embedder(torch.tensor(vocab[x]))\n", + " top5 = np.linalg.norm(vectors.detach().numpy() - vec.detach().numpy(), axis = 1).argsort()[:n]\n", + " return [ vocab.itos[x] for x in top5 ]\n", + "\n", + "close_words('microsoft')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-dQq7xeAln0U", + "outputId": "66f768c3-c248-4bfd-ce4f-c8ffc6d0dd0d" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['basketball', 'lot', 'sinai', 'states', 'healthdaynews']" + ] + }, + "metadata": {}, + "execution_count": 51 + } + ], + "source": [ + "close_words('basketball')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fJXqK26b29sa", + "outputId": "78f0baba-ffd0-485a-dd87-0a12bedfd7fa" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['funds', 'travel', 'sydney', 'japan', 'business']" + ] + }, + "metadata": {}, + "execution_count": 77 + } + ], + "source": [ + "close_words('funds')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "My0VeTDd3Ji8" + }, + "source": [ + "## Takeaway\n", + "\n", + "Using clever techniques such as CBoW, we can train Word2Vec model. You may also try to train skip-gram model that is trained to predict the neighboring word given the central one, and see how well it performs. " + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "CBoW-PyTorch.ipynb", + "provenance": [] + }, + "interpreter": { + "hash": "16af2a8bbb083ea23e5e41c7f5787656b2ce26968575d8763f2c4b17f9cd711f" + }, + "kernelspec": { + "display_name": "Python 3.8.12 ('py38')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + }, + "orig_nbformat": 4, + "gpuClass": "standard" + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file