aboutsummaryrefslogtreecommitdiff
path: root/Fundamentals_of_Deep_Learning/06_nlp.ipynb
diff options
context:
space:
mode:
Diffstat (limited to 'Fundamentals_of_Deep_Learning/06_nlp.ipynb')
-rw-r--r--Fundamentals_of_Deep_Learning/06_nlp.ipynb1184
1 files changed, 1184 insertions, 0 deletions
diff --git a/Fundamentals_of_Deep_Learning/06_nlp.ipynb b/Fundamentals_of_Deep_Learning/06_nlp.ipynb
new file mode 100644
index 0000000..6b42a33
--- /dev/null
+++ b/Fundamentals_of_Deep_Learning/06_nlp.ipynb
@@ -0,0 +1,1184 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<center><a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a></center>"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 6. Natural Language Processing"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this tutorial, we'll take a detour away from stand-alone pieces of data such as still images, to data that is dependent on other data items in a sequence. For our example, we'll use text sentences. Language is naturally composed of sequence data, in the form of characters in words, and words in sentences. Other examples of sequence data include stock prices and weather data over time. Videos, while containing still images, are also sequences. Elements in the data have a relationship with what comes before and what comes after, and this fact requires a different approach."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6.1 Objectives"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "* Use a tokenizer to prepare text for a neural network\n",
+ "* See how embeddings are used to identify numerical features for text data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6.2 BERT "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "BERT, which stands for **B**idirectional **E**ncoder **R**epresentations from **T**ransformers, was a ground-breaking model introduced in 2018 by [Google](https://www.google.com/).\n",
+ "\n",
+ "BERT is simultaneously trained on two goals:\n",
+ "* Predict a missing word from a sequence of words\n",
+ "* Predict a new sentence after a sequence of sentences\n",
+ "\n",
+ "Let's see BERT in action with these two types of challenges."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6.3 Tokenization"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Since neural networks are number crunching machines, let's turn text into numerical tokens. Let's load BERT's [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#tokenizer):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "483f46dab2c64faf8c9af283a63c51db",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='tokenizer_config.json', max=49.0, style=ProgressStyle(des…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "d6d4103eb55749e1afbbedf2a991d6d7",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='vocab.txt', max=213450.0, style=ProgressStyle(description…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "35d0a3762d334a5996359c53f45d0b9b",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='tokenizer.json', max=435797.0, style=ProgressStyle(descri…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "c18d4ee46ac54a97b3569ea83eda95cc",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='config.json', max=570.0, style=ProgressStyle(description_…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import torch\n",
+ "from transformers import BertTokenizer, BertModel, BertForMaskedLM, BertForQuestionAnswering\n",
+ "tokenizer = BertTokenizer.from_pretrained('bert-base-cased')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The BERT `tokenizer` can [encode](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode) multiple texts at once. We will later test BERT's memory, so let's give it information and a question about that information. Feel free to come back here later and try a different combination of sentences."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[101,\n",
+ " 146,\n",
+ " 2437,\n",
+ " 11838,\n",
+ " 117,\n",
+ " 1241,\n",
+ " 1103,\n",
+ " 3014,\n",
+ " 1105,\n",
+ " 186,\n",
+ " 18413,\n",
+ " 21961,\n",
+ " 1348,\n",
+ " 119,\n",
+ " 102,\n",
+ " 1327,\n",
+ " 1912,\n",
+ " 1104,\n",
+ " 11838,\n",
+ " 1202,\n",
+ " 146,\n",
+ " 2437,\n",
+ " 136,\n",
+ " 102]"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "text_1 = \"I understand equations, both the simple and quadratical.\"\n",
+ "text_2 = \"What kind of equations do I understand?\"\n",
+ "\n",
+ "# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)\n",
+ "indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)\n",
+ "indexed_tokens"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If we count the number of tokens, there are more tokens than words in our sentences. Let's see why that is. We can use [convert_ids_to_tokens](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.convert_ids_to_tokens) to see what was used as tokens."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['[CLS]',\n",
+ " 'I',\n",
+ " 'understand',\n",
+ " 'equations',\n",
+ " ',',\n",
+ " 'both',\n",
+ " 'the',\n",
+ " 'simple',\n",
+ " 'and',\n",
+ " 'q',\n",
+ " '##uad',\n",
+ " '##ratic',\n",
+ " '##al',\n",
+ " '.',\n",
+ " '[SEP]',\n",
+ " 'What',\n",
+ " 'kind',\n",
+ " 'of',\n",
+ " 'equations',\n",
+ " 'do',\n",
+ " 'I',\n",
+ " 'understand',\n",
+ " '?',\n",
+ " '[SEP]']"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokenizer.convert_ids_to_tokens([str(token) for token in indexed_tokens])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "There are two reasons why the indexed list is longer than our origincal input:\n",
+ "1. The `tokenizer` adds `special_tokens` to represent the start (`[CLS]`) of a sequence and separation ('[SEP]`) between sentences.\n",
+ "2. The `tokenizer` can break a word down into multiple parts.\n",
+ "\n",
+ "From a linguistic perspective, the second one is interesting. Many languages have [word roots](https://en.wikipedia.org/wiki/List_of_Greek_and_Latin_roots_in_English), or components that make up a word. For instance, the word \"quadratic\" has the root \"quadr\" which means \"4\". Rather than use word roots as defined by a language, BERT uses a [WordPiece](https://paperswithcode.com/method/wordpiece) model to find patterns in how to break up a word. The BERT model we will be using today has `28996` token vocabulary."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If we want to [decode](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode) our encoded text directly, we can. Notice the `special_tokens` have been added in."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'[CLS] I understand equations, both the simple and quadratical. [SEP] What kind of equations do I understand? [SEP]'"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokenizer.decode(indexed_tokens)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6.4 Segmenting Text"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In order to use the BERT model for predictions, it also needs a list of `segment_ids`. This is a vector the same length as our tokens and represents which segment belongs to each sentence.\n",
+ "\n",
+ "Since our `tokenizer` added in some `special_tokens`, we can use these special tokens to find the segments. First, let's define which index correspnds to which special token."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "cls_token = 101\n",
+ "sep_token = 102"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we can create a `for` loop. We'll start with our `segment_id` set to `0`, and we'll increment the `segment_id` whenever we see the [SEP] token. For good measure, we will return both the `segment_ids` and `indexd_tokens` as tensors as we will be feeding these into the model later."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_segment_ids(indexed_tokens):\n",
+ " segment_ids = []\n",
+ " segment_id = 0\n",
+ " for token in indexed_tokens:\n",
+ " if token == sep_token:\n",
+ " segment_id += 1\n",
+ " segment_ids.append(segment_id)\n",
+ " segment_ids[-1] -= 1 # Last [SEP] is ignored\n",
+ " return torch.tensor([segment_ids]), torch.tensor([indexed_tokens])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's test it out. Does each number correctly correspond to the first and second sentence?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "segments_tensors, tokens_tensor = get_segment_ids(indexed_tokens)\n",
+ "segments_tensors"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6.4 Text Masking"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's start with the focus BERT has on words. To train for word embeddings, BERT masks out a word in a sequence of words. The mask is its own special token:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'[MASK]'"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokenizer.mask_token"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "103"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokenizer.mask_token_id"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's take our two sentences from before and mask out the position at index `5`. Feel free to return here to change the index to see how it changes the results!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "masked_index = 5"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we'll apply the mask and verify it appears in our sequence of setences."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'[CLS] I understand equations, [MASK] the simple and quadratical. [SEP] What kind of equations do I understand? [SEP]'"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "indexed_tokens[masked_index] = tokenizer.mask_token_id\n",
+ "tokens_tensor = torch.tensor([indexed_tokens])\n",
+ "tokenizer.decode(indexed_tokens)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Then, we will load the model used to predict the missing word: `modelForMaskedLM`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "217b4bb209964205a302a6b65d7d735e",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='model.safetensors', max=435755784.0, style=ProgressStyle(…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
+ "- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+ "- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+ ]
+ }
+ ],
+ "source": [
+ "masked_lm_model = BertForMaskedLM.from_pretrained(\"bert-base-cased\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Just like with other PyTorch modules, we can check the architecture."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "BertForMaskedLM(\n",
+ " (bert): BertModel(\n",
+ " (embeddings): BertEmbeddings(\n",
+ " (word_embeddings): Embedding(28996, 768, padding_idx=0)\n",
+ " (position_embeddings): Embedding(512, 768)\n",
+ " (token_type_embeddings): Embedding(2, 768)\n",
+ " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
+ " )\n",
+ " (encoder): BertEncoder(\n",
+ " (layer): ModuleList(\n",
+ " (0-11): 12 x BertLayer(\n",
+ " (attention): BertAttention(\n",
+ " (self): BertSelfAttention(\n",
+ " (query): Linear(in_features=768, out_features=768, bias=True)\n",
+ " (key): Linear(in_features=768, out_features=768, bias=True)\n",
+ " (value): Linear(in_features=768, out_features=768, bias=True)\n",
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
+ " )\n",
+ " (output): BertSelfOutput(\n",
+ " (dense): Linear(in_features=768, out_features=768, bias=True)\n",
+ " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
+ " )\n",
+ " )\n",
+ " (intermediate): BertIntermediate(\n",
+ " (dense): Linear(in_features=768, out_features=3072, bias=True)\n",
+ " (intermediate_act_fn): GELUActivation()\n",
+ " )\n",
+ " (output): BertOutput(\n",
+ " (dense): Linear(in_features=3072, out_features=768, bias=True)\n",
+ " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
+ " )\n",
+ " )\n",
+ " )\n",
+ " )\n",
+ " )\n",
+ " (cls): BertOnlyMLMHead(\n",
+ " (predictions): BertLMPredictionHead(\n",
+ " (transform): BertPredictionHeadTransform(\n",
+ " (dense): Linear(in_features=768, out_features=768, bias=True)\n",
+ " (transform_act_fn): GELUActivation()\n",
+ " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+ " )\n",
+ " (decoder): Linear(in_features=768, out_features=28996, bias=True)\n",
+ " )\n",
+ " )\n",
+ ")"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "masked_lm_model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Can you spot the section labeled `word_embeddings`? These are the embeddings BERT learned for each token."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Parameter containing:\n",
+ "tensor([[-0.0005, -0.0416, 0.0131, ..., -0.0039, -0.0335, 0.0150],\n",
+ " [ 0.0169, -0.0311, 0.0042, ..., -0.0147, -0.0356, -0.0036],\n",
+ " [-0.0006, -0.0267, 0.0080, ..., -0.0100, -0.0331, -0.0165],\n",
+ " ...,\n",
+ " [-0.0064, 0.0166, -0.0204, ..., -0.0418, -0.0492, 0.0042],\n",
+ " [-0.0048, -0.0027, -0.0290, ..., -0.0512, 0.0045, -0.0118],\n",
+ " [ 0.0313, -0.0297, -0.0230, ..., -0.0145, -0.0525, 0.0284]],\n",
+ " requires_grad=True)"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "embedding_table = next(masked_lm_model.bert.embeddings.word_embeddings.parameters())\n",
+ "embedding_table"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can verify there is an embedding of size `768` for each of the `28996` tokens in BERT's vocabulary."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "torch.Size([28996, 768])"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "embedding_table.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's test the model! Can it correctly predict the missing word in our provided sentences? We will use [torch.no_grad](https://pytorch.org/docs/stable/generated/torch.no_grad.html) to inform PyTorch not to calculate a gradient."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "MaskedLMOutput(loss=None, logits=tensor([[[ -7.3832, -7.2504, -7.4539, ..., -6.0597, -5.7928, -6.2133],\n",
+ " [ -6.7681, -6.7896, -6.8317, ..., -5.4655, -5.4048, -6.0683],\n",
+ " [ -7.7323, -7.9597, -7.7348, ..., -5.7611, -5.3566, -4.3361],\n",
+ " ...,\n",
+ " [ -6.1213, -6.3311, -6.4144, ..., -5.8884, -4.1157, -3.1189],\n",
+ " [-12.3216, -12.4479, -11.9787, ..., -10.6539, -8.7396, -11.0487],\n",
+ " [-13.4115, -13.7876, -13.5183, ..., -10.6359, -11.6582, -10.9009]]]), hidden_states=None, attentions=None)"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "with torch.no_grad():\n",
+ " predictions = masked_lm_model(tokens_tensor, token_type_ids=segments_tensors)\n",
+ "predictions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This is a little bit hard to read, let's look at the `shape` to get a better sense of what's going on."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "torch.Size([1, 24, 28996])"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "predictions[0].shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `24` is our number of tokens, and the `28996` are the predictions for every token in BERT's vocabulary. We'd like to find the highest value accross all the token in the vocabulary, so we can use [torch.argmax](https://pytorch.org/docs/stable/generated/torch.argmax.html) to find it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1241"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Get the predicted token\n",
+ "predicted_index = torch.argmax(predictions[0][0], dim=1)[masked_index].item()\n",
+ "predicted_index"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's see what token `1241` corresponds to:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'both'"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]\n",
+ "predicted_token"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'both'"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "predicted_indices = torch.argmax(predictions[0], dim=-1) # [seq_len]\n",
+ "predicted_indices = predicted_indices.view(-1)\n",
+ "predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_indices.tolist())\n",
+ "predicted_token"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "What do you think? Is it correct?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'[CLS] I understand equations, [MASK] the simple and quadratical. [SEP] What kind of equations do I understand? [SEP]'"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokenizer.decode(indexed_tokens)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6.5 Question and Answering"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "While word masking is interesting, BERT was designed for more complex problems such as sentence prediction. It is able to accomplish this by building on the [Attention Transformer](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) architecture.\n",
+ "\n",
+ "We will be using a different version of BERT for this section, which has its own tokenizer. Let's find a new set of tokens for our sample sentences."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "983dd59be2bd453db20a6430f4a9df5e",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='tokenizer_config.json', max=48.0, style=ProgressStyle(des…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "aace1f9426ba4a64a102731d9c8740c6",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='vocab.txt', max=231508.0, style=ProgressStyle(description…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "a64de1a9b0ef40349a4d84f1e19758fa",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='tokenizer.json', max=466062.0, style=ProgressStyle(descri…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "14b67d22839d4c868635e2264702f35f",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='config.json', max=443.0, style=ProgressStyle(description_…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "text_1 = \"I understand equations, both the simple and quadratical.\"\n",
+ "text_2 = \"What kind of equations do I understand?\"\n",
+ "\n",
+ "question_answering_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n",
+ "indexed_tokens = question_answering_tokenizer.encode(text_1, text_2, add_special_tokens=True)\n",
+ "segments_tensors, tokens_tensor = get_segment_ids(indexed_tokens)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, let's load the `question_answering_model`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "877cb5fb0ae44316967317812d9f5d83",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, description='model.safetensors', max=1340622760.0, style=ProgressStyle…"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']\n",
+ "- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+ "- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+ ]
+ }
+ ],
+ "source": [
+ "question_answering_model = BertForQuestionAnswering.from_pretrained(\"bert-large-uncased-whole-word-masking-finetuned-squad\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can feed in our tokens and segments, just like when we were masking out a word."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Predict the start and end positions logits\n",
+ "with torch.no_grad():\n",
+ " out = question_answering_model(tokens_tensor, token_type_ids=segments_tensors)\n",
+ "out"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `question_answering_model` and answering model is scanning through our input sequence to find the subsequence that best answers the question. The higher the value, the more likely the start of the answer is."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "out.start_logits"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Similarly, the higher the value in `end_logits`, the more likely the answer will end on that token."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "out.end_logits"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can then use [torch.argmax](https://pytorch.org/docs/stable/generated/torch.argmax.html) to find the `answer_sequence` from start to finish:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "answer_sequence = indexed_tokens[torch.argmax(out.start_logits):torch.argmax(out.end_logits)+1]\n",
+ "answer_sequence"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Finally, let's [decode](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode) these tokens to see if the answer is correct!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "question_answering_tokenizer.convert_ids_to_tokens(answer_sequence)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "question_answering_tokenizer.decode(answer_sequence)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6.7 Summary"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Great work! You successfully used a Large Language Model (LLM) to extract answers from a sequence of sentences. Even though BERT was state-of-the-art when it was first released, many other LLMs have since broke ground. [build.nvidia.com](https://build.nvidia.com/explore/discover) hosts many of these models to be interacted with in the browser. Go check it out and see where the state-of-the-art is today!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.7.1 Clear the Memory\n",
+ "Before moving on, please execute the following cell to clear up the GPU memory."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import IPython\n",
+ "app = IPython.Application.instance()\n",
+ "app.kernel.do_shutdown(True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.7.2 Next"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Congratulations, you have completed all the learning objectives of the course!\n",
+ "\n",
+ "As a final exercise, and to earn certification in the course, successfully complete an end-to-end image classification problem in the assessment."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<center><a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a></center>"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}