1 files changed, 2073 insertions, 0 deletions
diff --git a/nvidia1/Effective+Memory+Use.ipynb b/nvidia1/Effective+Memory+Use.ipynb
new file mode 100644
index 0000000..2f038e9
--- /dev/null
+++ b/nvidia1/Effective+Memory+Use.ipynb
@@ -0,0 +1,2073 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Effective Use of the Memory Subsystem"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that you can write correct CUDA kernels, and understand the importance of launching grids that give the GPU sufficient opportunity to hide latency, you are going to learn techniques to effectively utilize GPU memory subsystems. These techniques are widely applicable to a variety of CUDA applications, and some of the most important when it comes time to make your CUDA code go fast.\n",
+    "\n",
+    "You are going to begin by learning about memory coalescing. To challenge your ability to reason about memory coalescing, and to expose important details relevent to many CUDA applications, you will then learn about 2-dimensional grids and thread blocks. Next you will learn about a very fast, user-controlled, on-demand memory space called shared memory, and will use shared memory to facilitate memory coalescing where it would not have otherwise been possible. Finally, you will learn about shared memory bank conflicts, which can spoil the performance possibilities of using shared memory, and a technique to address them."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Objectives"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By the time you complete this section, you will be able to:\n",
+    "* Write CUDA kernels that benefit from coalesced memory access patterns.\n",
+    "* Work with multi-dimensional grids and thread blocks.\n",
+    "* Use shared memory to coordinate threads within a block.\n",
+    "* Use shared memory to facilitate coalesced memory access patterns.\n",
+    "* Resolve shared memory bank conflicts."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The Problem: Uncoalesced Memory Access Hurts Performance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before you learn the details about what **coalesced memory access** is, run the following cells to observe the performance implications for a seemingly trivial change to the data access pattern within a kernel."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Creation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we define `n` and create a grid with threads equal to `n`. We also create an output vector with length `n`. For the inputs we create vectors of size `stride * n` for reasons that will be made clear below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 1024*1024 # 1M\n",
+    "\n",
+    "threads_per_block = 1024\n",
+    "blocks = int(n / threads_per_block)\n",
+    "\n",
+    "stride = 16\n",
+    "\n",
+    "# Input Vectors of length stride * n\n",
+    "a = np.ones(stride * n).astype(np.float32)\n",
+    "b = a.copy().astype(np.float32)\n",
+    "\n",
+    "# Output Vector\n",
+    "out = np.zeros(n).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_b = cuda.to_device(b)\n",
+    "d_out = cuda.to_device(out)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Kernel Definition"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In `add_experiment`, every thread in the grid will add an item in `a`, and an item in `b` and write the result to `out`. The kernel has been written such that we can pass a `coalesced` value of either `True` or `False` to affect how it indexes into the `a` and `b` vectors. You will see the performance comparison of the two modes below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def add_experiment(a, b, out, stride, coalesced):\n",
+    "    i = cuda.grid(1)\n",
+    "    # The above line is equivalent to\n",
+    "    # i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
+    "    if coalesced == True:\n",
+    "        out[i] = a[i] + b[i]\n",
+    "    else:\n",
+    "        out[i] = a[stride*i] + b[stride*i]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Launch Kernel Using Coalesced Access"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we pass `True` as the `coalesced` value, and observe the performance of the kernel over several runs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "227 µs ± 73.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, True); cuda.synchronize"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we make sure the kernel ran as expected:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_out.copy_to_host()\n",
+    "truth = a[:n] + b[:n]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, truth)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Launch Kernel Using Uncoalesced Access"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we pass `False`, to observe the perfomance of the uncoalesced data access pattern for `add_experiment`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "540 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, False); cuda.synchronize"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we make sure the kernel ran as expected:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_out.copy_to_host()\n",
+    "truth = a[::stride] + b[::stride]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, truth)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The performance of the uncoalesced data access pattern was far worse. Now you will learn why, and how to think about data access patterns in your kernels to obtain high performing kernels."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Presentation: Global Memory Coalescing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"800\"\n",
+       "            height=\"450\"\n",
+       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/coalescing-v3.pptx\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f9086df57b8>"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/coalescing-v3.pptx', 800, 450)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> _**Footnote**: for additional details about global memory segment size across a variety of devices, and with regards to caching, see [The CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory)._"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise: Column and Row Sums"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this exercise you will be asked to write a column sums kernel that uses fully coalesced memory access patterns. To begin you will observe the performance of a row sums kernel that makes uncoalesced memory accesses."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Row Sums"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we create an input matrix, as well as a vector for storing the solution, and transfer each of them to the device. We also define the grid and block dimensions to be used when we launch the kernel below. We set an arbitrary row of data to some arbitrary value to facilitate checking for correctness below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 16384 # matrix side size\n",
+    "threads_per_block = 256\n",
+    "blocks = int(n / threads_per_block)\n",
+    "\n",
+    "# Input Matrix\n",
+    "a = np.ones(n*n).reshape(n, n).astype(np.float32)\n",
+    "# Here we set an arbitrary row to an arbitrary value to facilitate a check for correctness below.\n",
+    "a[3] = 9\n",
+    "\n",
+    "# Output vector\n",
+    "sums = np.zeros(n).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_sums = cuda.to_device(sums)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "** Kernel Definition**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`row_sums` will use each thread to iterate over a row of data, summing it, and then store its row sum in `sums`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def row_sums(a, sums, n):\n",
+    "    idx = cuda.grid(1)\n",
+    "    sum = 0.0\n",
+    "    \n",
+    "    for i in range(n):\n",
+    "        # Each thread will sum a row of `a`\n",
+    "        sum += a[idx][i]\n",
+    "        \n",
+    "    sums[idx] = sum"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Row Sums Performance**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "11.6 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit row_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check for Correctness**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_sums.copy_to_host()\n",
+    "truth = a.sum(axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(truth, result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Column Sums"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we create an input matrix, as well as a vector for storing the solution, and transfer each of them to the device. We also define the grid and block dimensions to be used when we launch the kernel below. We set an arbitrary column of data to some arbitrary value to facilitate checking for correctness below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 16384 # matrix side size\n",
+    "threads_per_block = 256\n",
+    "blocks = int(n / threads_per_block)\n",
+    "\n",
+    "a = np.ones(n*n).reshape(n, n).astype(np.float32)\n",
+    "# Here we set an arbitrary column to an arbitrary value to facilitate a check for correctness below.\n",
+    "a[:, 3] = 9\n",
+    "sums = np.zeros(n).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_sums = cuda.to_device(sums)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "** Kernel Definition**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`col_sums` will use each thread to iterate over a column of data, summing it, and then store its column sum in `sums`. Complete the kernel definition to accomplish this. If you get stuck, feel free to refer to [the solution](../edit/solutions/col_sums_solution.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def col_sums(a, sums, ds):\n",
+    "    idx = cuda.grid(1)\n",
+    "    stride = cuda.gridsize(1)\n",
+    "    sum = 0\n",
+    "    for i in range(stride):\n",
+    "        sum += a[i][idx]\n",
+    "    \n",
+    "    sums[idx] = sum"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Performance**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assuming you have written `col_sums` to use coalesced access patterns, you should see a significant (almost 2x) speed up compared to the uncoalesced `row_sums` you ran above:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "7.88 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit col_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Correctness**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Confirm your kernel is working as expected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "col_sums[blocks, threads_per_block](d_a, d_sums, n)\n",
+    "cuda.synchronize()\n",
+    "result = d_sums.copy_to_host()\n",
+    "truth = a.sum(axis=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 51,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(truth, result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2 and 3 Dimensional Blocks and Grids"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Both grids and blocks can be configured to contain a 2 or 3 dimensional collection of blocks or threads, respectively. This is done mostly as a matter of convenience for programmers who often work with 2 or 3 dimensional datasets. Here is a very trivial example to highlight the syntax. You may need to read *both* the kernel definition and its launch before the concept makes sense."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "A = np.zeros((4,4)) # A 4x4 Matrix of 0's\n",
+    "d_A = cuda.to_device(A)\n",
+    "\n",
+    "# Here we create a 2D grid with 4 blocks in a 2x2 structure, each with 4 threads in a 2x2 structure\n",
+    "# by using a Python tuple to signify grid and block dimensions.\n",
+    "blocks = (2, 2)\n",
+    "threads_per_block = (2, 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This kernel will take an input matrix of 0s and write to each of its elements, its (x,y) coordinates within the grid in the format of `X.Y`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def get_2D_indices(A):\n",
+    "    # By passing `2`, we get the thread's unique x and y coordinates in the 2D grid\n",
+    "    x, y = cuda.grid(2)\n",
+    "    # The above is equivalent to the following 2 lines of code:\n",
+    "    # x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
+    "    # y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
+    "    \n",
+    "    # Write the x index followed by a decimal and the y index.\n",
+    "    A[x][y] = x + y / 10"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "get_2D_indices[blocks, threads_per_block](d_A)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[0. , 0.1, 0.2, 0.3],\n",
+       "       [1. , 1.1, 1.2, 1.3],\n",
+       "       [2. , 2.1, 2.2, 2.3],\n",
+       "       [3. , 3.1, 3.2, 3.3]])"
+      ]
+     },
+     "execution_count": 60,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "result = d_A.copy_to_host()\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise: Coalesced 2-Dimensional Matrix Add"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Creation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we define 2048x2048 elmement input matrices `a` and `b`, as well as a 2048x2048 0-initialized output matrix. We copy these matrices to the device.\n",
+    "\n",
+    "We also define the 2-dimensional block and grid dimensions to be used below. Note that we are creating a grid with the same number of total threads as there are input and output elements, such that each thread in the grid will calculate the sum for a single element in the output matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 63,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 2048*2048 # 4M\n",
+    "\n",
+    "# 2D blocks\n",
+    "threads_per_block = (32, 32)\n",
+    "# 2D grid\n",
+    "blocks = (64, 64)\n",
+    "\n",
+    "# 2048x2048 input matrices\n",
+    "a = np.arange(n).reshape(2048,2048).astype(np.float32)\n",
+    "b = a.copy().astype(np.float32)\n",
+    "\n",
+    "# 2048x2048 0-initialized output matrix\n",
+    "out = np.zeros_like(a).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_b = cuda.to_device(b)\n",
+    "d_out = cuda.to_device(out)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2D Matrix Add"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Your job is to complete the TODOs in `matrix_add` to correctly sum `a` and `b` into `out`. As a challenge to your understanding of coalesced access patterns, `matrix_add` will accept a `coalesced` boolean indicating whether the access patterns should be coalesced or not. Both modes (coalesced and uncoalesced) should produce correct results, however, you should observe significant speedups below when running with `coalesced` set to `True`.\n",
+    "\n",
+    "If you get stuck, feel free to check out [the solution](../edit/solutions/matrix_add_solution.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 66,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def matrix_add(a, b, out, coalesced):\n",
+    "    x, y = cuda.grid(2)\n",
+    "    \n",
+    "    if coalesced == True:\n",
+    "        out[y][x] = a[y][x] + b[y][x]\n",
+    "    else:\n",
+    "        out[x][y] = a[x][y] + b[x][y]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Check Performance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run both cells below to launch `matrix_add` with both the coalesced and uncoalesced access patterns you wrote into it, and observe the performance difference. Additional cells have been provided to confirm the correctness of your kernel."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Coalesced**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "203 µs ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, True); cuda.synchronize"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 69,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_out.copy_to_host()\n",
+    "truth = a+b"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 70,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 70,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, truth)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Uncoalesced**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 71,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "586 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, False); cuda.synchronize"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 72,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_out.copy_to_host()\n",
+    "truth = a+b"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 73,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, truth)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Shared Memory"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So far we have been differentiating between host and device memory, as if device memory were a single kind of memory. But in fact, CUDA has an even more fine-grained [memory hierarchy](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy). The device memory we have been utilizing thus far is called **global memory** which is available to any thread or block on the device, can persist for the lifetime of the application, and is a relatively large memory space.\n",
+    "\n",
+    "We will now discuss how to utilize a region of on-chip device memory called **shared memory**. Shared memory is a programmer defined cache of limited size that [depends on the GPU](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) being used and is **shared** between all threads in a block. It is a scarce resource, cannot be accessed by threads outside of the block where it was allocated, and does not persist after a kernel finishes executing. Shared memory however has a much higher bandwidth than global memory and can be used to great effect in many kernels, especially to optimize performance.\n",
+    "\n",
+    "Here are a few common use cases for shared memory:\n",
+    "\n",
+    " * Caching memory read from global memory that will need to be read multiple times within a block.\n",
+    " * Buffering output from threads so it can be coalesced before writing it back to global memory.\n",
+    " * Staging data for scatter/gather operations within a block."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Shared Memory Syntax"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Numba provides [functions](https://numba.pydata.org/numba-doc/dev/cuda/memory.html#shared-memory-and-thread-synchronization) for allocating shared memory as well as for synchronizing between threads in a block, which is often necessary after parallel threads read from or write to shared memory.\n",
+    "\n",
+    "When declaring shared memory, you provide the shape of the shared array, as well as its type, using a [Numba type](https://numba.pydata.org/numba-doc/dev/reference/types.html#numba-types). **The shape of the array must be a constant value**, and therefore, you cannot use arguments passed into the function, or, provided variables like `numba.cuda.blockDim.x`, or the calculated values of `cuda.griddim`. Here is a convoluted example to demonstrate the syntax with comments pointing out the movement from host memory to global device memory, to shared memory, back to global device memory, and finally back to host memory:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will use `numba.types` to define the types of values in shared memory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 74,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import types, cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Swap Elements Using Shared Memory**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following kernel takes an input vector, where each thread will first write one element of the vector to shared memory, and then, after syncing such that all elements have been written to shared memory, will write one element out of shared memory into the swapped output vector.\n",
+    "\n",
+    "Worth noting is that each thread will be writing a swapped value from shared memory that was written into shared memory by another thread."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 75,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def swap_with_shared(vector, swapped):\n",
+    "    # Allocate a 4 element vector containing int32 values in shared memory.\n",
+    "    temp = cuda.shared.array(4, dtype=types.int32)\n",
+    "    \n",
+    "    idx = cuda.grid(1)\n",
+    "    \n",
+    "    # Move an element from global memory into shared memory\n",
+    "    temp[idx] = vector[idx]\n",
+    "    \n",
+    "    # cuda.syncthreads will force all threads in the block to synchronize here, which is necessary because...\n",
+    "    cuda.syncthreads()\n",
+    "    #...the following operation is reading an element written to shared memory by another thread.\n",
+    "    \n",
+    "    # Move an element from shared memory back into global memory\n",
+    "    swapped[idx] = temp[3 - cuda.threadIdx.x] # swap elements"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "vector = np.arange(4).astype(np.int32)\n",
+    "swapped = np.zeros_like(vector)\n",
+    "\n",
+    "# Move host memory to device (global) memory\n",
+    "d_vector = cuda.to_device(vector)\n",
+    "d_swapped = cuda.to_device(swapped)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0, 1, 2, 3], dtype=int32)"
+      ]
+     },
+     "execution_count": 77,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vector"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "** Run Kernel**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "swap_with_shared[1, 4](d_vector, d_swapped)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Results**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([3, 2, 1, 0], dtype=int32)"
+      ]
+     },
+     "execution_count": 79,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Move device (global) memory back to the host\n",
+    "result = d_swapped.copy_to_host()\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Presentation: Shared Memory for Memory Coalescing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 80,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"800\"\n",
+       "            height=\"450\"\n",
+       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/shared_coalescing.pptx\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f9086acc438>"
+      ]
+     },
+     "execution_count": 80,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/shared_coalescing.pptx', 800, 450)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Excercise: Used Shared Memory for Coalesced Reads and Writes With Matrix Transpose"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this exercise you will implement what was just demonstrated in the presentation by writing a matrix transpose kernel which, using shared memory, makes coalesced reads and writes to the output matrix in global memory."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Coalesced Reads, Uncoalesced Writes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As reference, and for performance comparison, here is a naive matrix transpose kernel that makes coalesced reads from input, but uncoalesced writes to output."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from numba import cuda\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we create a 4096x4096 input matrix `a` as well as a 4096x4096 output matrix `transposed`, and copy them to the device.\n",
+    "\n",
+    "We also define a 2-dimensional grid with 2-dimensional blocks to be used below. Note that we have created a grid with a total number of threads equal to the number of elments in the input matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 4096*4096 # 16M\n",
+    "\n",
+    "# 2D blocks\n",
+    "threads_per_block = (32, 32)\n",
+    "#2D grid\n",
+    "blocks = (128, 128)\n",
+    "\n",
+    "# 4096x4096 input and output matrices\n",
+    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
+    "transposed = np.zeros_like(a).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_transposed = cuda.to_device(transposed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Naive Matrix Transpose Kernel**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This kernel correctly transposes `a`, writing the transposition to `transposed`. It makes reads from `a` in a coalesced fashion, however, its writes to `transposed` are uncoalesced."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 82,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def transpose(a, transposed):\n",
+    "    x, y = cuda.grid(2)\n",
+    "\n",
+    "    transposed[x][y] = a[y][x]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Performance**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 83,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.59 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Correctness**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_transposed.copy_to_host()\n",
+    "expected = a.T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 85,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, expected)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Refactor for Coalesced Reads and Writes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Your job will be to refactor the `transpose` kernel to use shared memory and make both reads to and writes from global memory in a coalesced fashion."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 86,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda, types as numba_types"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 87,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 4096*4096 # 16M\n",
+    "\n",
+    "# 2D blocks\n",
+    "threads_per_block = (32, 32)\n",
+    "#2D grid\n",
+    "blocks = (128, 128)\n",
+    "\n",
+    "# 4096x4096 input and output matrices\n",
+    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
+    "transposed = np.zeros_like(a).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_transposed = cuda.to_device(transposed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Write a Transpose Kernel that Uses Shared Memory**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Complete the TODOs inside the `tile_transpose` kernel definition.\n",
+    "\n",
+    "If you get stuck, feel free to check out [the solution](../edit/solutions/tile_transpose_solution.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 91,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def tile_transpose(a, transposed):\n",
+    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
+    "    # and that `a` is a multiple of these dimensions.\n",
+    "    \n",
+    "    # 1) Create 32x32 shared memory array.\n",
+    "    \n",
+    "    temp = cuda.shared.array((32,32), dtype=types.int32)\n",
+    "\n",
+    "    # Compute offsets into global input array. Recall for coalesced access we want to map threadIdx.x increments to\n",
+    "    # the fastest changing index in the data, i.e. the column in our array.\n",
+    "    # Note: `a_col` and `a_row` are already correct.\n",
+    "    a_col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
+    "    a_row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
+    "    \n",
+    "    # 2) Make coalesced read from global memory (using grid indices)\n",
+    "    # into shared memory array (using thread indices).\n",
+    "    \n",
+    "    temp[cuda.threadIdx.y, cuda.threadIdx.x] = a[a_row, a_col]\n",
+    "\n",
+    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
+    "    \n",
+    "    cuda.syncthreads()\n",
+    "    \n",
+    "    # 4) Calculate transposed location for the shared memory array tile\n",
+    "    # to be written back to global memory. Note that blockIdx.y*blockDim.y \n",
+    "    # and blockIdx.x* blockDim.x are swapped (because we want to write to the\n",
+    "    # transpose locations), but we want to keep access coalesced, so match up the\n",
+    "    # threadIdx.x to the fastest changing index, i.e. the column./\n",
+    "    # Note: `t_col` and `t_row` are already correct.\n",
+    "    t_col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
+    "    t_row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
+    "\n",
+    "    # 5) Write from shared memory (using thread indices)\n",
+    "    # back to global memory (using grid indices)\n",
+    "    # transposing each element within the shared memory array.\n",
+    "    \n",
+    "    transposed[t_row, t_col] = temp[cuda.threadIdx.x, cuda.threadIdx.y]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Performance**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Check the performance of your refactored transpose kernel. You should see a speedup compared to the baseline transpose performance above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 92,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.09 ms ± 60.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit tile_transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Correctness**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 93,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_transposed.copy_to_host()\n",
+    "expected = a.T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 94,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 94,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, expected)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Why Such a Small Improvement?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "While this is a significant speedup for only a few lines of code, but you might think that the performance improvement is not as stark as you expected based on earlier performance improvements to use coalesced access patterns. There are 2 main reasons for this:\n",
+    "\n",
+    "1. The naive transpose kernel was making coalesced reads, so, your refactored version only optimized half of the global memory access throughout the execution of the kernel.\n",
+    "2. Your code as written suffers from something called shared memory bank conflicts, a topic to which we will now turn our attention."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Presentation: Memory Bank Conflicts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 95,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"800\"\n",
+       "            height=\"450\"\n",
+       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/bank_conflicts.pptx\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f908e3479e8>"
+      ]
+     },
+     "execution_count": 95,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/bank_conflicts.pptx', 800, 450)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Assessment: Resolve Memory Bank Conflicts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As a final exercise, and to get credit towards a certificate in the course for this final section of the workshop, you will refactor the transpose kernel utilizing shared memory to be shared memory bank conflict free."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda, types as numba_types"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Creation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 4096*4096 # 16M\n",
+    "threads_per_block = (32, 32)\n",
+    "blocks = (128, 128)\n",
+    "\n",
+    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
+    "transposed = np.zeros_like(a).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_transposed = cuda.to_device(transposed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Make the Kernel Bank Conflict Free"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `tile_transpose_conflict_free` kernel is a working matrix transpose kernel which utilizes shared memory so that both reads from and writes to global memory are coalesced. Your job is to refactor the kernel so that it does not suffer from memory bank conflicts.\n",
+    "\n",
+    "**Note:** Because this final exercise counts towards certification in the course, a solution will not be provided."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 96,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def tile_transpose_conflict_free(a, transposed):\n",
+    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
+    "    # and that `a` is a multiple of these dimensions.\n",
+    "    \n",
+    "    # 1) Create 32x32 shared memory array.\n",
+    "    tile = cuda.shared.array((32, 33), numba_types.float32)\n",
+    "\n",
+    "    # Compute offsets into global input array.\n",
+    "    x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
+    "    y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
+    "    \n",
+    "    # 2) Make coalesced read from global memory into shared memory array.\n",
+    "    # Note the use of local thread indices for the shared memory write,\n",
+    "    # and global offsets for global memory read.\n",
+    "    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a[y, x]\n",
+    "\n",
+    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
+    "    cuda.syncthreads()\n",
+    "    \n",
+    "    # 4) Calculate transposed location for the shared memory array tile\n",
+    "    # to be written back to global memory.\n",
+    "    t_x = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
+    "    t_y = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
+    "\n",
+    "    # 5) Write back to global memory,\n",
+    "    # transposing each element within the shared memory array.\n",
+    "    transposed[t_y, t_x] = tile[cuda.threadIdx.x, cuda.threadIdx.y]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Check Performance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assuming you have correctly resolved the bank conflicts, this kernel should run significantly faster than both the naive transpose kernel, and, the shared memory (with bank conflicts) transpose kernel. In order to pass the assessment, your kernel will need to run on average in less than 840 µs.\n",
+    "\n",
+    "The first value printed by running the following cell will give you the average run time of your kernel."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 97,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "804 µs ± 2.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit tile_transpose_conflict_free[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Check Correctness"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In order to pass the assessment, your kernel also needs to work correctly. Run the following 2 cells to confirm this is true."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 98,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_transposed.copy_to_host()\n",
+    "expected = a.T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 99,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 99,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, expected)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the Assessment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you have completed the refactor, observed it's run time to be less than 840 µs, and confirmed that it runs correctly, execute the following cells to run the assessment against your kernel definition."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 100,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from assessment import assess"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 101,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Your function took 808.28 µs to run.\n",
+      "\n",
+      "Your function runs fast enough (less than 840 µs): True\n",
+      "\n",
+      "Your function returns the correct results: True\n",
+      "\n",
+      "Congratulations, you passed! See the instructions below for how to get credit for your work to count toward a certificate in the course.\n"
+     ]
+    }
+   ],
+   "source": [
+    "assess(tile_transpose_conflict_free)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Get Credit for Your Work"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After successfully passing the assessment above, revisit the webpage where you launched this interactive environment and click on the **\"ASSESS TASK\"** button as shown in the screenshot below. Doing so will give you credit for this part of the workshop that counts towards earning a **certificate of competency** for the entire course."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Run the assessment](images/run_the_assessment.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that you have completed this session you are able to:\n",
+    "\n",
+    "* Write CUDA kernels that benefit from coalesced memory access patterns.\n",
+    "* Work with multi-dimensional grids and thread blocks.\n",
+    "* Use shared memory to coordinate threads within a block.\n",
+    "* Use shared memory to facilitate coalesced memory access patterns.\n",
+    "* Resolve shared memory bank conflicts."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download Content"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To download the contents of this notebook, execute the following cell and then click the download link below. Note: If you run this notebook on a local Jupyter server, you can expect some of the file path links in the notebook to be broken as they are shaped to our own platform. You can still navigate to the files through the Jupyter file navigator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "!tar -zcvf section3.tar.gz ."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Download files from this section.](files/section3.tar.gz)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}