From 35b9040e4104b0e79bf243a2c9769c589f96e2c4 Mon Sep 17 00:00:00 2001
From: leshe4ka46 <alex9102naid1@ya.ru>
Date: Sat, 18 Oct 2025 12:25:32 +0300
Subject: mv nvidia1 to other folder

---
 .../Custom+CUDA+Kernels+in+Python+with+Numba.ipynb | 5434 --------------------
 nvidia1/Effective+Memory+Use.ipynb                 | 2073 --------
 .../Introduction+to+CUDA+Python+with+Numba.ipynb   | 2072 --------
 nvidia1/bumashka.png                               |  Bin 76870 -> 0 bytes
 4 files changed, 9579 deletions(-)
 delete mode 100644 nvidia1/Custom+CUDA+Kernels+in+Python+with+Numba.ipynb
 delete mode 100644 nvidia1/Effective+Memory+Use.ipynb
 delete mode 100644 nvidia1/Introduction+to+CUDA+Python+with+Numba.ipynb
 delete mode 100644 nvidia1/bumashka.png

(limited to 'nvidia1')
diff --git a/nvidia1/Custom+CUDA+Kernels+in+Python+with+Numba.ipynb b/nvidia1/Custom+CUDA+Kernels+in+Python+with+Numba.ipynb
deleted file mode 100644
index 9807623..0000000
--- a/nvidia1/Custom+CUDA+Kernels+in+Python+with+Numba.ipynb
+++ /dev/null
@@ -1,5434 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Custom CUDA Kernels in Python with Numba\n",
-    "\n",
-    "In this section we will go further into our understanding of how the CUDA programming model organizes parallel work, and will leverage this understanding to write custom CUDA **kernels**, functions which run in parallel on CUDA GPUs. Custom CUDA kernels, in utilizing the CUDA programming model, require more work to implement than, for example, simply decorating a ufunc with `@vectorize`. However, they make possible parallel computing in places where ufuncs are just not able, and provide a flexibility that can lead to the highest level of performance.\n",
-    "\n",
-    "This section contains three appendices for those of you interested in futher study: a variety of debugging techniques to assist your GPU programming, links to CUDA programming references, and coverage of Numba supported random number generation on the GPU."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Objectives\n",
-    "\n",
-    "By the time you complete this section you will be able to:\n",
-    "\n",
-    "* Write custom CUDA kernels in Python and launch them with an execution configuration.\n",
-    "* Utilize grid stride loops for working in parallel over large data sets and leveraging memory coalescing.\n",
-    "* Use atomic operations to avoid race conditions when working in parallel."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## The Need for Custom Kernels"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Ufuncs are fantastically elegant, and for any scalar operation that ought to be performed element wise on data, ufuncs are likely the right tool for the job.\n",
-    "\n",
-    "As you are well aware, there are many, if not more, classes of problems that cannot be solved by applying the same function to each element of a data set. Consider, for example, any problem that requires access to more than one element of a data structure in order to calculate its output, like stencil algorithms, or any problem that cannot be expressed by a one input value to one output value mapping, such as a reduction. Many of these problems are still inherently parallelizable, but cannot be expressed by a ufunc.\n",
-    "\n",
-    "Writing custom CUDA kernels, while more challenging than writing GPU accelerated ufuncs, provides developers with tremendous flexibility for the types of functions they can send to run in parallel on the GPU. Furthermore, as you will begin learning in this and the next section, it also provides fine-grained control over *how* the parallelism is conducted by exposing CUDA's thread hierarchy to developers explicitly.\n",
-    "\n",
-    "While remaining purely in Python, the way we write CUDA kernels using Numba is very reminiscent of how developers write them in CUDA C/C++. For those of you familiar with programming in CUDA C/C++, you will likely pick up custom kernels in Python with Numba very rapidly, and for those of you learning them for the first time, know that the work you do here will also serve you well should you ever need or wish to develop CUDA in C/C++, or even, make a study of the wealth of CUDA resources on the web that are most commonly portraying CUDA C/C++ code."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Introduction to CUDA Kernels\n",
-    "\n",
-    "When programming in CUDA, developers write functions for the GPU called **kernels**, which are executed, or in CUDA parlance, **launched**, on the GPU's many cores in parallel **threads**. When kernels are launched, programmers use a special syntax, called an **execution configuration** (also called a launch configuration) to describe the parallel execution's configuration.\n",
-    "\n",
-    "The following slides (which will appear after executing the cell below) give a high level introduction to how CUDA kernels can be created to work on large datasets in parallel on the GPU device. Work through the slides and then you will begin writing and executing your own custom CUDA kernels, using the ideas presented in the slides."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "\n",
-       "        <iframe\n",
-       "            width=\"640\"\n",
-       "            height=\"390\"\n",
-       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/AC_CUDA_Python_1.pptx\"\n",
-       "            frameborder=\"0\"\n",
-       "            allowfullscreen\n",
-       "        ></iframe>\n",
-       "        "
-      ],
-      "text/plain": [
-       "<IPython.lib.display.IFrame at 0x7f8d844ee668>"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from IPython.display import IFrame\n",
-    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/AC_CUDA_Python_1.pptx', 640, 390)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## A First CUDA Kernel\n",
-    "\n",
-    "Let's start with a concrete, and very simple example by rewriting our addition function for 1D NumPy arrays. CUDA kernels are compiled using the `numba.cuda.jit` decorator. `numba.cuda.jit` is not to be confused with the `numba.jit` decorator you've already learned which optimizes functions **for the CPU**.\n",
-    "\n",
-    "We will begin with a very simple example to highlight some of the essential syntax. Worth mentioning is that this particular function could in fact be written as a ufunc, but we choose it here to keep the focus on learning the syntax. We will be proceeding to functions more well suited to being written as a custom kernel below. Be sure to read the comments carefully, as they provide some important information about the code."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import cuda\n",
-    "\n",
-    "# Note the use of an `out` array. CUDA kernels written with `@cuda.jit` do not return values,\n",
-    "# just like their C counterparts. Also, no explicit type signature is required with @cuda.jit\n",
-    "@cuda.jit\n",
-    "def add_kernel(x, y, out):\n",
-    "    \n",
-    "    # The actual values of the following CUDA-provided variables for thread and block indices,\n",
-    "    # like function parameters, are not known until the kernel is launched.\n",
-    "    \n",
-    "    # This calculation gives a unique thread index within the entire grid (see the slides above for more)\n",
-    "    idx = cuda.grid(1)          # 1 = one dimensional thread grid, returns a single value.\n",
-    "                                # This Numba-provided convenience function is equivalent to\n",
-    "                                # `cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x`\n",
-    "\n",
-    "    # This thread will do the work on the data element with the same index as its own\n",
-    "    # unique index within the grid.\n",
-    "    out[idx] = x[idx] + y[idx]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "\n",
-    "n = 4096\n",
-    "x = np.arange(n).astype(np.int32) # [0...4095] on the host\n",
-    "y = np.ones_like(x)               # [1...1] on the host\n",
-    "\n",
-    "d_x = cuda.to_device(x) # Copy of x on the device\n",
-    "d_y = cuda.to_device(y) # Copy of y on the device\n",
-    "d_out = cuda.device_array_like(d_x) # Like np.array_like, but for device arrays\n",
-    "\n",
-    "# Because of how we wrote the kernel above, we need to have a 1 thread to one data element mapping,\n",
-    "# therefore we define the number of threads in the grid (128*32) to equal n (4096).\n",
-    "threads_per_block = 128\n",
-    "blocks_per_grid = 32"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[   1    2    3 ... 4094 4095 4096]\n"
-     ]
-    }
-   ],
-   "source": [
-    "add_kernel[blocks_per_grid, threads_per_block](d_x, d_y, d_out)\n",
-    "cuda.synchronize()\n",
-    "print(d_out.copy_to_host()) # Should be [1...4096]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Tweak the Code\n",
-    "\n",
-    "Make the following minor changes to the code above to see how it affects its execution. Make educated guesses about what will happen before running the code:\n",
-    "\n",
-    "* Decrease the `threads_per_block` variable\n",
-    "* Decrease the `blocks_per_grid` variable\n",
-    "* Increase the `threads_per_block` and/or `blocks_per_grid variables`\n",
-    "* Remove or comment out the `cuda.synchronize()` call\n",
-    "\n",
-    "### Results\n",
-    "\n",
-    "In the example above, because the kernel is written so that each thread works on exactly one data element, it is essential for the number of threads in the grid equal the number of data elements.\n",
-    "\n",
-    "By **reducing the number of threads in the grid**, either by reducing the number of blocks, and/or reducing the number of threads per block, there are elements where work is left undone and thus we can see in the output that the elements toward the end of the `d_out` array did not have any values added to it. If you edited the execution configuration by reducing the number of threads per block, then in fact there are other elements through the `d_out` array that were not processed.\n",
-    "\n",
-    "**Increasing the size of the grid** in fact creates issues with out of bounds memory access. This error will not show in your code presently, but later in this section you will learn how to expose this error using `cuda-memcheck` and debug it.\n",
-    "\n",
-    "You might have expected that **removing the synchronization point** would have resulted in a print showing that no or less work had been done. This is a reasonable guess since without a synchronization point the CPU will work asynchronously while the GPU is processing. The detail to learn here is that memory copies carry implicit synchronization, making the call to `cuda.synchronize` above unnecessary."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Accelerate a CPU Function as a Custom CUDA Kernel\n",
-    "\n",
-    "Below is CPU scalar function `square_device` that could be used as a CPU ufunc. Your job is to refactor it to run as a CUDA kernel decorated with the `@cuda.jit` decorator.\n",
-    "\n",
-    "You might think that making this function run on the device could be much more easily done with `@vectorize`, and you would be correct. But this scenario will give you a chance to work with all the syntax we've introduced before moving on to more complicated and realistic examples.\n",
-    "\n",
-    "In this exercise you will need to:\n",
-    "* Refactor the `square_device` definition to be a CUDA kernel that will do one thread's worth of work on a single element.\n",
-    "* Refactor the `d_a` and `d_out` arrays below to be CUDA device arrays.\n",
-    "* Modify the `blocks` and `threads` variables to appropriate values for the provided `n`.\n",
-    "* Refactor the call to `square_device` to be a kernel launch that includes an execution configuration.\n",
-    "\n",
-    "The assertion test below will fail until you successfully implement the above. If you get stuck, feel free to check out a [solution](../edit/solutions/square_device_solution.py)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 33,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# Refactor to be a CUDA kernel doing one thread's work.\n",
-    "# Don't forget that when using `@cuda.jit`, you must provide an output array as no value will be returned.\n",
-    "def square_device(a):\n",
-    "    return a**2\n",
-    "\n",
-    "@cuda.jit\n",
-    "def square_kernel(a, out):\n",
-    "    idx = cuda.grid(1)\n",
-    "    out[idx] = a[idx]*a[idx]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 34,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# Leave the values in this cell fixed for this exercise\n",
-    "n = 4096\n",
-    "\n",
-    "a = np.arange(n)\n",
-    "out = a**2 # `out` will only be used for testing below"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 35,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "d_a = cuda.to_device(a)                  # TODO make `d_a` a device array\n",
-    "d_out = cuda.device_array_like(a) # TODO: make d_out a device array\n",
-    "\n",
-    "# TODO: Update the execution configuration for the amount of work needed\n",
-    "blocks = 128\n",
-    "threads = 32\n",
-    "\n",
-    "# TODO: Launch as a kernel with an appropriate execution configuration\n",
-    "# d_out = square_device(d_a)\n",
-    "square_kernel[blocks, threads](d_a, d_out)\n",
-    "cuda.synchronize()\n",
-    "d_out = d_out.copy_to_host()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 36,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from numpy import testing\n",
-    "testing.assert_almost_equal(d_out, out)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## An Aside on Hiding Latency and Execution Configuration Choices"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "CUDA enabled NVIDIA GPUs consist of several [**Streaming Multiprocessors**](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation), or **SMs** on a die, with attached DRAM. SMs contain all required resources for the execution of kernel code including many CUDA cores. When a kernel is launched, each block is assigned to a single SM, with potentially many blocks assigned to a single SM. SMs partition blocks into further subdivisions of 32 threads called **warps** and it is these warps which are given parallel instructions to execute.\n",
-    "\n",
-    "When an instruction takes more than one clock cycle to complete (or in CUDA parlance, to **expire**) the SM can continue to do meaningful work *if it has additional warps that are ready to be issued new instructions.* Because of very large register files on the SMs, there is no time penalty for an SM to change context between issuing instructions to one warp or another. In short, the latency of operations can be hidden by SMs with other meaningful work so long as there is other work to be done.\n",
-    "\n",
-    "**Therefore, of primary importance to utilizing the full potential of the GPU, and thereby writing performant accelerated applications, it is essential to give SMs the ability to hide latency by providing them with a sufficient number of warps which can be accomplished most simply by executing kernels with sufficiently large grid and block dimensions.**\n",
-    "\n",
-    "Deciding the very best size for the CUDA thread grid is a complex problem, and depends on both the algorithm and the specific GPU's [compute capability](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities), but here are some very rough heuristics that we tend to follow and which can work well for getting started:\n",
-    "\n",
-    "  * The size of a block should be a multiple of 32 threads (the size of a warp), with typical block sizes between 128 and 512 threads per block.\n",
-    "  * The size of the grid should ensure the full GPU is utilized where possible. Launching a grid where the number of blocks is 2x-4x the number of SMs on the GPU is a good starting place. Something in the range of 20 - 100 blocks is usually a good starting point.\n",
-    "  * The CUDA kernel launch overhead does increase with the number of blocks, so when the input size is very large we find it best not to launch a grid where the number of threads equals the number of input elements, which would result in a tremendous number of blocks. Instead we use a pattern to which we will now turn our attention for dealing with large inputs."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Working on Largest Datasets with Grid Stride Loops\n",
-    "\n",
-    "The following slides give a high level overview of a technique called a **grid stride loop** which will create flexible kernels where each thread is able to work on more than one data element, an essential technique for large datasets. Execute the cell to load the slides."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 37,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "\n",
-       "        <iframe\n",
-       "            width=\"640\"\n",
-       "            height=\"390\"\n",
-       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/AC_CUDA_Python_2.pptx\"\n",
-       "            frameborder=\"0\"\n",
-       "            allowfullscreen\n",
-       "        ></iframe>\n",
-       "        "
-      ],
-      "text/plain": [
-       "<IPython.lib.display.IFrame at 0x7f8d53069710>"
-      ]
-     },
-     "execution_count": 37,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from IPython.display import IFrame\n",
-    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/AC_CUDA_Python_2.pptx', 640, 390)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## A First Grid Stride Loop\n",
-    "\n",
-    "Let's refactor the `add_kernel` above to utilize a grid stride loop so that we can launch it to work on larger data sets flexibly while incurring the benefits of global **memory coalescing**, which allows parallel threads to access memory in contiguous chunks, a scenario which the GPU can leverage to reduce the total number of memory operations:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 39,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import cuda\n",
-    "\n",
-    "@cuda.jit\n",
-    "def add_kernel(x, y, out):\n",
-    "    \n",
-    "\n",
-    "    start = cuda.grid(1)\n",
-    "    \n",
-    "    # This calculation gives the total number of threads in the entire grid\n",
-    "    stride = cuda.gridsize(1)   # 1 = one dimensional thread grid, returns a single value.\n",
-    "                                # This Numba-provided convenience function is equivalent to\n",
-    "                                # `cuda.blockDim.x * cuda.gridDim.x`\n",
-    "\n",
-    "    # This thread will start work at the data element index equal to that of its own\n",
-    "    # unique index in the grid, and then, will stride the number of threads in the grid each\n",
-    "    # iteration so long as it has not stepped out of the data's bounds. In this way, each\n",
-    "    # thread may work on more than one data element, and together, all threads will work on\n",
-    "    # every data element.\n",
-    "    for i in range(start, x.shape[0], stride):\n",
-    "        # Assuming x and y inputs are same length\n",
-    "        out[i] = x[i] + y[i]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 42,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "(100000,)\n"
-     ]
-    }
-   ],
-   "source": [
-    "import numpy as np\n",
-    "\n",
-    "n = 100000 # This is far more elements than threads in our grid\n",
-    "x = np.arange(n).astype(np.int32)\n",
-    "y = np.ones_like(x)\n",
-    "\n",
-    "d_x = cuda.to_device(x)\n",
-    "print(d_x.shape)\n",
-    "d_y = cuda.to_device(y)\n",
-    "d_out = cuda.device_array_like(d_x)\n",
-    "\n",
-    "threads_per_block = 128\n",
-    "blocks_per_grid = 30"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 43,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[     1      2      3 ...  99998  99999 100000]\n"
-     ]
-    }
-   ],
-   "source": [
-    "add_kernel[blocks_per_grid, threads_per_block](d_x, d_y, d_out)\n",
-    "print(d_out.copy_to_host()) # Remember, memory copy carries implicit synchronization"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Implement a Grid Stride Loop\n",
-    "\n",
-    "Refactor the following CPU scalar `hypot_stride` function to run as a CUDA Kernel utilizing a grid stride loop. Feel free to look at [the solution](../edit/solutions/hypot_stride_solution.py) if you get stuck."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 48,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from math import hypot\n",
-    "from numba import cuda\n",
-    "\n",
-    "@cuda.jit\n",
-    "def hypot_stride(a, b, c):\n",
-    "    start = cuda.grid(1)\n",
-    "    stride = cuda.gridsize(1)\n",
-    "    for i in range(start, a.shape[0], stride):\n",
-    "        c[i] = hypot(a[i], b[i])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 49,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# You do not need to modify the contents in this cell\n",
-    "n = 1000000\n",
-    "a = np.random.uniform(-12, 12, n).astype(np.float32)\n",
-    "b = np.random.uniform(-12, 12, n).astype(np.float32)\n",
-    "d_a = cuda.to_device(a)\n",
-    "d_b = cuda.to_device(b)\n",
-    "d_c = cuda.device_array_like(d_b)\n",
-    "\n",
-    "blocks = 128\n",
-    "threads_per_block = 64\n",
-    "\n",
-    "hypot_stride[blocks, threads_per_block](d_a, d_b, d_c)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 50,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numpy import testing\n",
-    "# This assertion will fail until you successfully implement the hypot_stride kernel above\n",
-    "testing.assert_almost_equal(np.hypot(a,b), d_c.copy_to_host(), decimal=5)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Timing the Kernel\n",
-    "\n",
-    "Let's take the time to do some performance timing for the `hypot_stride` kernel. If you weren't able to successfully implement it, copy and execute [the solution](../edit/solutions/hypot_stride_solution.py) before timing."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### CPU Baseline\n",
-    "\n",
-    "First let's get a baseline with `np.hypot`:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 51,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "6.07 ms ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit np.hypot(a, b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Numba on the CPU\n",
-    "\n",
-    "Next let's see about a CPU optimized version:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 52,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import jit\n",
-    "\n",
-    "@jit\n",
-    "def numba_hypot(a, b):\n",
-    "    return np.hypot(a, b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 53,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "5.68 ms ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit numba_hypot(a, b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Single Threaded on the Device\n",
-    "\n",
-    "Just to see, let's launch our kernel in a grid with only a single thread. Here we will use `%time`, which only runs the statement once to ensure our measurement isn't affected by the finite depth of the CUDA kernel queue. We will also add a `cuda.synchronize` to be sure we don't get any innacurate times on account of returning control to the CPU, where the timer is, before the kernel completes:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 54,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "CPU times: user 188 ms, sys: 128 ms, total: 316 ms\n",
-      "Wall time: 316 ms\n"
-     ]
-    }
-   ],
-   "source": [
-    "%time hypot_stride[1, 1](d_a, d_b, d_c); cuda.synchronize()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Hopefully not too much of a surprise that this is way slower than even the baseline CPU execution."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Parallel on the Device"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 55,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
-      "Wall time: 696 µs\n"
-     ]
-    }
-   ],
-   "source": [
-    "%time hypot_stride[128, 64](d_a, d_b, d_c); cuda.synchronize()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "That's much faster!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Atomic Operations and Avoiding Race Conditions\n",
-    "\n",
-    "CUDA, like many general purpose parallel execution frameworks, makes it possible to have race conditions in your code.  A race condition in CUDA arises when threads read to or write from a memory location that might be modified by another independent thread. Generally speaking, you need to worry about:\n",
-    "\n",
-    " * read-after-write hazards: One thread is reading a memory location at the same time another thread might be writing to it.\n",
-    " * write-after-write hazards: Two threads are writing to the same memory location, and only one write will be visible when the kernel is complete.\n",
-    " \n",
-    "A common strategy to avoid both of these hazards is to organize your CUDA kernel algorithm such that each thread has exclusive responsibility for unique subsets of output array elements, and/or to never use the same array for both input and output in a single kernel call. (Iterative algorithms can use a double-buffering strategy if needed, and switch input and output arrays on each iteration.)\n",
-    "\n",
-    "However, there are many cases where different threads need to combine results. Consider something very simple, like: \"every thread increments a global counter.\" Implementing this in your kernel requires each thread to:\n",
-    "\n",
-    "1. Read the current value of a global counter.\n",
-    "2. Compute `counter + 1`.\n",
-    "3. Write that value back to global memory.\n",
-    "\n",
-    "However, there is no guarantee that another thread has not changed the global counter between steps 1 and 3. To resolve this problem, CUDA provides **atomic operations** which will read, modify and update a memory location in one, indivisible step. Numba supports several of these functions, [described here](http://numba.pydata.org/numba-doc/dev/cuda/intrinsics.html#supported-atomic-operations).\n",
-    "\n",
-    "Let's make our thread counter kernel:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 56,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def thread_counter_race_condition(global_counter):\n",
-    "    global_counter[0] += 1  # This is bad\n",
-    "    \n",
-    "@cuda.jit\n",
-    "def thread_counter_safe(global_counter):\n",
-    "    cuda.atomic.add(global_counter, 0, 1)  # Safely add 1 to offset 0 in global_counter array"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 57,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Should be 4096: [1]\n"
-     ]
-    }
-   ],
-   "source": [
-    "# This gets the wrong answer\n",
-    "global_counter = cuda.to_device(np.array([0], dtype=np.int32))\n",
-    "thread_counter_race_condition[64, 64](global_counter)\n",
-    "\n",
-    "print('Should be %d:' % (64*64), global_counter.copy_to_host())"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 58,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Should be 4096: [4096]\n"
-     ]
-    }
-   ],
-   "source": [
-    "# This works correctly\n",
-    "global_counter = cuda.to_device(np.array([0], dtype=np.int32))\n",
-    "thread_counter_safe[64, 64](global_counter)\n",
-    "\n",
-    "print('Should be %d:' % (64*64), global_counter.copy_to_host())"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Assessment\n",
-    "\n",
-    "The following exercise will require you to utilize everything you've learned so far. Unlike previous exercises, there will not be any solution code available to you, and, there are a couple additional steps you will need to take to \"run the assessment\" and get a score for your attempt(s). **Please read the directions carefully before beginning your work to ensure the best chance at successfully completing the assessment.**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### How to Run the Assessment\n",
-    "\n",
-    "Take the following steps to complete this assessment:\n",
-    "\n",
-    "1. Using the instructions that follow, work on the cells below as you usually would for an exercise.\n",
-    "2. When you are satisfied with your work, follow the instructions below to copy and paste code in into linked source code files. Be sure to save the files after you paste your work.\n",
-    "3. Return to the browser tab you used to launch this notebook, and click on the **\"Assess\"** button. After a few seconds a score will be generated along with a helpful message.\n",
-    "\n",
-    "You are welcome to click on the **Assess** button as many times as you like, so feel free if you don't pass the first time to make additional modifications to your code and repeat steps 1 through 3. Good luck!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "![Run the assessment](images/run_the_assessment.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "collapsed": true
-   },
-   "source": [
-    "### Write an Accelerated Histogramming Kernel\n",
-    "\n",
-    "For this assessment, you will create an accelerated histogramming kernel. This will take an array of input data, a range, and a number of bins, and count how many of the input data elements land in each bin. Below is a working CPU implementation of histogramming to serve as an example for your work:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 62,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "def cpu_histogram(x, xmin, xmax, histogram_out):\n",
-    "    '''Increment bin counts in histogram_out, given histogram range [xmin, xmax).'''\n",
-    "    # Note that we don't have to pass in nbins explicitly, because the size of histogram_out determines it\n",
-    "    nbins = histogram_out.shape[0]\n",
-    "    bin_width = (xmax - xmin) / nbins\n",
-    "    \n",
-    "    # This is a very slow way to do this with NumPy, but looks similar to what you will do on the GPU\n",
-    "    for element in x:\n",
-    "        bin_number = np.int32((element - xmin)/bin_width)\n",
-    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
-    "            # only increment if in range\n",
-    "            histogram_out[bin_number] += 1"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 63,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([   3,   88,  445, 1576, 2969, 2854, 1548,  442,   72,    3],\n",
-       "      dtype=int32)"
-      ]
-     },
-     "execution_count": 63,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "x = np.random.normal(size=10000, loc=0, scale=1).astype(np.float32)\n",
-    "xmin = np.float32(-4.0)\n",
-    "xmax = np.float32(4.0)\n",
-    "histogram_out = np.zeros(shape=10, dtype=np.int32)\n",
-    "\n",
-    "cpu_histogram(x, xmin, xmax, histogram_out)\n",
-    "\n",
-    "histogram_out"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Using a grid stride loop and atomic operations, implement your solution in the cell below. After making any modifications, and before running the assessment, paste this cell's content into [**`assessment/histogram.py`**](../edit/assessment/histogram.py) and save it."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 73,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def cuda_histogram(x, xmin, xmax, histogram_out):\n",
-    "    '''Increment bin counts in histogram_out, given histogram range [xmin, xmax).'''\n",
-    "    nbins = histogram_out.shape[0]\n",
-    "    bin_width = (xmax - xmin) / nbins\n",
-    "    \n",
-    "    start = cuda.grid(1)\n",
-    "    stride = cuda.gridsize(1)\n",
-    "    for i in range(start, x.shape[0], stride):\n",
-    "        bin_number = np.int32((x[i] - xmin)/bin_width)\n",
-    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
-    "            cuda.atomic.add(histogram_out, bin_number, 1)\n",
-    "    pass  # Replace this with your implementation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 74,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "d_x = cuda.to_device(x)\n",
-    "d_histogram_out = cuda.to_device(np.zeros(shape=10, dtype=np.int32))\n",
-    "\n",
-    "blocks = 128\n",
-    "threads_per_block = 64\n",
-    "\n",
-    "cuda_histogram[blocks, threads_per_block](d_x, xmin, xmax, d_histogram_out)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 75,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# This assertion will fail until you correctly implement `cuda_histogram`\n",
-    "np.testing.assert_array_almost_equal(d_histogram_out.copy_to_host(), histogram_out, decimal=2)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Summary\n",
-    "\n",
-    "In this section you learned how to:\n",
-    "\n",
-    "* Write custom CUDA kernels in Python and launch them with an execution configuration.\n",
-    "* Utilize grid stride loops for working in parallel over large data sets and leveraging memory coalescing.\n",
-    "* Use atomic operations to avoid race conditions when working in parallel."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Download Content\n",
-    "\n",
-    "To download the contents of this notebook, execute the following cell and then click the download link below. Note: If you run this notebook on a local Jupyter server, you can expect some of the file path links in the notebook to be broken as they are shaped to our own platform. You can still navigate to the files through the Jupyter file navigator."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "!tar -zcvf section2.tar.gz ."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "[Download files from this section.](files/section2.tar.gz)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Appendix: Troubleshooting and Debugging\n",
-    "\n",
-    "### Note about the Terminal\n",
-    "\n",
-    "Debugging is an important part of programming.  Unfortuntely, it is pretty difficult to debug CUDA kernels directly in the Jupyter notebook for a variety of reasons, so this notebook will show terminal commands by executing Jupyter notebook cells using the shell.  These shell commands will appear in notebook cells with the command line prefixed by `!`. When applying the debug methods described in this notebook, you will likely run the commands in the terminal directly.\n",
-    "\n",
-    "### Printing\n",
-    "\n",
-    "A common debugging strategy is printing to the console.  Numba supports printing from CUDA kernels, with some restrictions.  Note that output printed from a CUDA kernel will not be captured by Jupyter, so you will need to debug with a script you can run from the terminal.\n",
-    "\n",
-    "Let's look at a CUDA kernel with a bug:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 59,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "import numpy as np\r\n",
-      "\r\n",
-      "from numba import cuda\r\n",
-      "\r\n",
-      "@cuda.jit\r\n",
-      "def histogram(x, xmin, xmax, histogram_out):\r\n",
-      "    nbins = histogram_out.shape[0]\r\n",
-      "    bin_width = (xmax - xmin) / nbins\r\n",
-      "\r\n",
-      "    start = cuda.grid(1)\r\n",
-      "    stride = cuda.gridsize(1)\r\n",
-      "\r\n",
-      "    for i in range(start, x.shape[0], stride):\r\n",
-      "        bin_number = np.int32((x[i] - xmin)/bin_width)\r\n",
-      "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\r\n",
-      "            histogram_out[bin_number] += 1\r\n",
-      "\r\n",
-      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
-      "xmin = np.float32(-4.0)\r\n",
-      "xmax = np.float32(4.0)\r\n",
-      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
-      "\r\n",
-      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
-      "\r\n",
-      "print('input count:', x.shape[0])\r\n",
-      "print('histogram:', histogram_out)\r\n",
-      "print('count:', histogram_out.sum())\r\n"
-     ]
-    }
-   ],
-   "source": [
-    "! cat debug/ex1.py"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "collapsed": true
-   },
-   "source": [
-    "When we run this code to histogram 50 values, we see the histogram is not getting 50 entries: "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 76,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "input count: 50\r\n",
-      "histogram: [0 1 1 1 1 1 1 0 0 0]\r\n",
-      "count: 6\r\n"
-     ]
-    }
-   ],
-   "source": [
-    "! python debug/ex1.py"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "*(You might have already spotted the mistake, but let's pretend we don't know the answer.)*\n",
-    "\n",
-    "We hypothesize that maybe a bin calculation error is causing many of the histogram entries to appear out of range.  Let's add some printing around the `if` statement to show us what is going on:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 61,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "import numpy as np\r\n",
-      "\r\n",
-      "from numba import cuda\r\n",
-      "\r\n",
-      "@cuda.jit\r\n",
-      "def histogram(x, xmin, xmax, histogram_out):\r\n",
-      "    nbins = histogram_out.shape[0]\r\n",
-      "    bin_width = (xmax - xmin) / nbins\r\n",
-      "\r\n",
-      "    start = cuda.grid(1)\r\n",
-      "    stride = cuda.gridsize(1)\r\n",
-      "\r\n",
-      "    for i in range(start, x.shape[0], stride):\r\n",
-      "        bin_number = np.int32((x[i] - xmin)/bin_width)\r\n",
-      "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\r\n",
-      "            histogram_out[bin_number] += 1\r\n",
-      "            print('in range', x[i], bin_number)\r\n",
-      "        else:\r\n",
-      "            print('out of range', x[i], bin_number)\r\n",
-      "\r\n",
-      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
-      "xmin = np.float32(-4.0)\r\n",
-      "xmax = np.float32(4.0)\r\n",
-      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
-      "\r\n",
-      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
-      "\r\n",
-      "print('input count:', x.shape[0])\r\n",
-      "print('histogram:', histogram_out)\r\n",
-      "print('count:', histogram_out.sum())\r\n"
-     ]
-    }
-   ],
-   "source": [
-    "! cat debug/ex1a.py"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This kernel will print every value and bin number it calculates.  Looking at one of the print statements, we see that `print` supports constant strings, and scalar values:\n",
-    "\n",
-    "``` python\n",
-    "print('in range', x[i], bin_number)\n",
-    "```\n",
-    "\n",
-    "String substitution (using C printf syntax or the newer `format()` syntax) is not supported.  If we run this script we see:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 77,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "in range 0.261026 5\r\n",
-      "in range -2.102431 2\r\n",
-      "in range 0.799183 5\r\n",
-      "in range 1.051908 6\r\n",
-      "in range -0.201711 4\r\n",
-      "in range -1.698864 2\r\n",
-      "in range 0.248762 5\r\n",
-      "in range 1.782836 7\r\n",
-      "in range -0.594408 4\r\n",
-      "in range 1.867431 7\r\n",
-      "in range 0.418070 5\r\n",
-      "in range 0.365282 5\r\n",
-      "in range -0.655639 4\r\n",
-      "in range 0.817385 6\r\n",
-      "in range 0.646000 5\r\n",
-      "in range 0.776718 5\r\n",
-      "in range -0.665656 4\r\n",
-      "in range 0.431279 5\r\n",
-      "in range 0.480257 5\r\n",
-      "in range 0.769916 5\r\n",
-      "in range 0.386032 5\r\n",
-      "in range -0.824273 3\r\n",
-      "in range -0.310682 4\r\n",
-      "in range -1.554290 3\r\n",
-      "in range 1.897843 7\r\n",
-      "in range -0.788933 4\r\n",
-      "in range -0.509624 4\r\n",
-      "in range -0.854971 3\r\n",
-      "in range 0.470186 5\r\n",
-      "in range 1.196934 6\r\n",
-      "in range 0.821883 6\r\n",
-      "in range 1.011266 6\r\n",
-      "in range -3.438190 0\r\n",
-      "in range 0.612806 5\r\n",
-      "in range 0.789266 5\r\n",
-      "in range -2.211243 2\r\n",
-      "in range 1.039794 6\r\n",
-      "in range 2.000385 7\r\n",
-      "in range -1.390927 3\r\n",
-      "in range 1.432608 6\r\n",
-      "in range 0.208954 5\r\n",
-      "in range -1.194161 3\r\n",
-      "in range 0.558909 5\r\n",
-      "in range 0.494454 5\r\n",
-      "in range 0.149325 5\r\n",
-      "in range -0.593924 4\r\n",
-      "in range 0.702312 5\r\n",
-      "in range 0.765463 5\r\n",
-      "in range -1.847362 2\r\n",
-      "in range 2.459083 8\r\n",
-      "input count: 50\r\n",
-      "histogram: [1 0 1 1 1 1 1 1 1 0]\r\n",
-      "count: 8\r\n"
-     ]
-    }
-   ],
-   "source": [
-    "! python debug/ex1a.py"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Scanning down that output, we see that all 50 values should be in range.  Clearly we have some kind of race condition updating the histogram.  In fact, the culprit line is:\n",
-    "\n",
-    "``` python\n",
-    "histogram_out[bin_number] += 1\n",
-    "```\n",
-    "\n",
-    "which should be (as you may have seen in a previous exercise)\n",
-    "\n",
-    "``` python\n",
-    "cuda.atomic.add(histogram_out, bin_number, 1)\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "collapsed": true
-   },
-   "source": [
-    "### CUDA Simulator\n",
-    "\n",
-    "Back in the early days of CUDA, `nvcc` had an \"emulator\" mode that would execute CUDA code on the CPU for debugging.  That functionality was dropped in later CUDA releases after `cuda-gdb` was created.  There isn't a debugger for CUDA+Python, so Numba includes a \"CUDA simulator\" in Numba that runs your CUDA code with the Python interpreter on the host CPU.  This allows you to debug the logic of your code using Python modules and functions that would otherwise be not allowed by the compile.\n",
-    "\n",
-    "A very common use case is to start the Python debugger inside one thread of a CUDA kernel:\n",
-    "``` python\n",
-    "import numpy as np\n",
-    "\n",
-    "from numba import cuda\n",
-    "\n",
-    "@cuda.jit\n",
-    "def histogram(x, xmin, xmax, histogram_out):\n",
-    "    nbins = histogram_out.shape[0]\n",
-    "    bin_width = (xmax - xmin) / nbins\n",
-    "\n",
-    "    start = cuda.grid(1)\n",
-    "    stride = cuda.gridsize(1)\n",
-    "\n",
-    "    ### DEBUG FIRST THREAD\n",
-    "    if start == 0:\n",
-    "        from pdb import set_trace; set_trace()\n",
-    "    ###\n",
-    "\n",
-    "    for i in range(start, x.shape[0], stride):\n",
-    "        bin_number = np.int32((x[i] + xmin)/bin_width)\n",
-    "\n",
-    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
-    "            cuda.atomic.add(histogram_out, bin_number, 1)\n",
-    "\n",
-    "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\n",
-    "xmin = np.float32(-4.0)\n",
-    "xmax = np.float32(4.0)\n",
-    "histogram_out = np.zeros(shape=10, dtype=np.int32)\n",
-    "\n",
-    "histogram[64, 64](x, xmin, xmax, histogram_out)\n",
-    "\n",
-    "print('input count:', x.shape[0])\n",
-    "print('histogram:', histogram_out)\n",
-    "print('count:', histogram_out.sum())\n",
-    "```\n",
-    "\n",
-    "This code allows a debug session like the following to take place:\n",
-    "```\n",
-    "(gtc2017) 0179-sseibert:gtc2017-numba sseibert$ NUMBA_ENABLE_CUDASIM=1 python debug/ex2.py\n",
-    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(18)histogram()\n",
-    "-> for i in range(start, x.shape[0], stride):\n",
-    "(Pdb) n\n",
-    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(19)histogram()\n",
-    "-> bin_number = np.int32((x[i] + xmin)/bin_width)\n",
-    "(Pdb) n\n",
-    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(21)histogram()\n",
-    "-> if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
-    "(Pdb) p bin_number, x[i]\n",
-    "(-6, -1.4435024)\n",
-    "(Pdb) p x[i], xmin, bin_width\n",
-    "(-1.4435024, -4.0, 0.80000000000000004)\n",
-    "(Pdb) p (x[i] - xmin) / bin_width\n",
-    "3.1956219673156738\n",
-    "(Pdb) q\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "collapsed": true
-   },
-   "source": [
-    "### CUDA Memcheck\n",
-    "\n",
-    "Another common error occurs when a CUDA kernel has an invalid memory access, typically caused by running off the end of an array.  The full CUDA toolkit from NVIDIA (not the `cudatoolkit` conda package) contain a utility called `cuda-memcheck` that can check for a wide range of memory access mistakes in CUDA code.\n",
-    "\n",
-    "Let's debug the following code:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 78,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "import numpy as np\r\n",
-      "\r\n",
-      "from numba import cuda\r\n",
-      "\r\n",
-      "@cuda.jit\r\n",
-      "def histogram(x, xmin, xmax, histogram_out):\r\n",
-      "    nbins = histogram_out.shape[0]\r\n",
-      "    bin_width = (xmax - xmin) / nbins\r\n",
-      "\r\n",
-      "    start = cuda.grid(1)\r\n",
-      "    stride = cuda.gridsize(1)\r\n",
-      "\r\n",
-      "    for i in range(start, x.shape[0], stride):\r\n",
-      "        bin_number = np.int32((x[i] + xmin)/bin_width)\r\n",
-      "\r\n",
-      "        if bin_number >= 0 or bin_number < histogram_out.shape[0]:\r\n",
-      "            cuda.atomic.add(histogram_out, bin_number, 1)\r\n",
-      "\r\n",
-      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
-      "xmin = np.float32(-4.0)\r\n",
-      "xmax = np.float32(4.0)\r\n",
-      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
-      "\r\n",
-      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
-      "\r\n",
-      "print('input count:', x.shape[0])\r\n",
-      "print('histogram:', histogram_out)\r\n",
-      "print('count:', histogram_out.sum())\r\n"
-     ]
-    }
-   ],
-   "source": [
-    "! cat debug/ex3.py"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 79,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "========= CUDA-MEMCHECK\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (31,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (30,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (29,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (28,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (27,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f0 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (26,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (25,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001fc is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (24,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e4 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (23,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (22,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (21,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (20,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (19,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (18,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f4 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (17,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (16,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f0 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (15,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f0 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (14,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f4 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (13,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f0 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (12,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (11,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (10,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (9,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e4 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (8,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (7,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f0 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (6,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001e8 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (5,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f4 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (4,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (3,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (2,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (1,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001ec is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (0,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f43844001f4 is out of bounds\n",
-      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to \"unspecified launch failure\" on CUDA API call to cuMemcpyDtoH_v2. \n",
-      "=========     Saved host backtrace up to driver entry point at error\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuMemcpyDtoH_v2 + 0x1c9) [0x291fe9]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x966) [0x193d06]\n",
-      "=========     Host Frame:python [0x1944d4]\n",
-      "Traceback (most recent call last):\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "  File \"debug/ex3.py\", line 24, in <module>\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python [0x1945e6]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191e46]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "    histogram[64, 64](x, xmin, xmax, histogram_out)\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 755, in __call__\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "    cfg(*args)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 494, in __call__\n",
-      "    sharedmem=self.sharedmem)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 596, in _kernel_call\n",
-      "    wb()\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/args.py\", line 65, in <lambda>\n",
-      "    retr.append(lambda: devary.copy_to_host(self.value, stream=stream))\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py\", line 212, in _require_cuda_context\n",
-      "    return fn(*args, **kws)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py\", line 252, in copy_to_host\n",
-      "    _driver.device_to_host(hostary, self, self.alloc_size, stream=stream)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 1819, in device_to_host\n",
-      "    fn(host_pointer(dst), device_pointer(src), size, *varargs)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 290, in safe_cuda_api_call\n",
-      "    self._check_error(fname, retcode)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 325, in _check_error\n",
-      "    raise CudaAPIError(retcode, msg)\n",
-      "numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in CUDA_ERROR_LAUNCH_FAILED\n",
-      "========= ERROR SUMMARY: 33 errors\n"
-     ]
-    }
-   ],
-   "source": [
-    "! cuda-memcheck python debug/ex3.py"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The output of `cuda-memcheck` is clearly showing a problem with our histogram function:\n",
-    "```\n",
-    "========= Invalid __global__ write of size 4\n",
-    "=========     at 0x00000548 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-    "```\n",
-    "But we don't know which line it is.  To get better error information, we can turn \"debug\" mode on when compiling the kernel, by changing the kernel to look like this:\n",
-    "``` python\n",
-    "@cuda.jit(debug=True)\n",
-    "def histogram(x, xmin, xmax, histogram_out):\n",
-    "    nbins = histogram_out.shape[0]\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 80,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "========= CUDA-MEMCHECK\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (31,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f4 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (30,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (29,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (28,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (27,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (26,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (25,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (24,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (23,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (22,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f4 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (21,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f4 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (20,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e4 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (19,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001ec is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (18,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f4 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (17,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001ec is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (16,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001ec is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (15,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (14,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (13,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (12,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (11,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (10,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f4 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (9,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (8,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (7,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (6,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (5,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f4 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (4,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (3,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (2,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f4 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (1,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001e8 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n",
-      "========= Invalid __global__ write of size 4\n",
-      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
-      "=========     by thread (0,0,0) in block (0,0,0)\n",
-      "=========     Address 0x7f4f464001f0 is out of bounds\n",
-      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
-      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x19296b]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to \"unspecified launch failure\" on CUDA API call to cuMemcpyDtoH_v2. \n",
-      "=========     Saved host backtrace up to driver entry point at error\n",
-      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuMemcpyDtoH_v2 + 0x1c9) [0x291fe9]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
-      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x966) [0x193d06]\n",
-      "=========     Host Frame:python [0x1944d4]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "Traceback (most recent call last):\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "  File \"debug/ex3a.py\", line 24, in <module>\n",
-      "=========     Host Frame:python [0x192b83]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python [0x191fae]\n",
-      "=========     Host Frame:python [0x192be6]\n",
-      "=========     Host Frame:python [0x198a65]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
-      "=========     Host Frame:python [0x191b76]\n",
-      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
-      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
-      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
-      "=========     Host Frame:python [0x16a101]\n",
-      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
-      "=========     Host Frame:python [0x198ade]\n",
-      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
-      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
-      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
-      "=========     Host Frame:python [0x214d54]\n",
-      "    histogram[64, 64](x, xmin, xmax, histogram_out)\n",
-      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 755, in __call__\n",
-      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
-      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
-      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
-      "=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]\n",
-      "=========     Host Frame:python [0x1c1fff]\n",
-      "=========\n",
-      "    cfg(*args)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 494, in __call__\n",
-      "    sharedmem=self.sharedmem)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 571, in _kernel_call\n",
-      "    driver.device_to_host(ctypes.addressof(excval), excmem, excsz)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 1819, in device_to_host\n",
-      "    fn(host_pointer(dst), device_pointer(src), size, *varargs)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 290, in safe_cuda_api_call\n",
-      "    self._check_error(fname, retcode)\n",
-      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 325, in _check_error\n",
-      "    raise CudaAPIError(retcode, msg)\n",
-      "numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in CUDA_ERROR_LAUNCH_FAILED\n",
-      "========= ERROR SUMMARY: 33 errors\n"
-     ]
-    }
-   ],
-   "source": [
-    "! cuda-memcheck python debug/ex3a.py"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we get an error message that includes a source file and line number: `ex3a.py:17`."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 81,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "    15\t\r\n",
-      "    16\t        if bin_number >= 0 or bin_number < histogram_out.shape[0]:\r\n",
-      "    17\t            cuda.atomic.add(histogram_out, bin_number, 1)\r\n",
-      "    18\t\r\n",
-      "    19\tx = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n"
-     ]
-    }
-   ],
-   "source": [
-    "! cat -n debug/ex3a.py | grep -C 2 \"17\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "At this point, we might realize that our if statement incorrectly has an `or` instead of an `and`.\n",
-    "\n",
-    "`cuda-memcheck` has different modes for detecting different kinds of problems (similar to `valgrind` for debugging CPU memory access errors).  Take a look at the documentation for more information: http://docs.nvidia.com/cuda/cuda-memcheck/"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Appendix: CUDA References\n",
-    "\n",
-    "It's worth bookmarking Chapters 1 and 2 of the CUDA C Programming Guide for study after the completion of this course. They are written for CUDA C, but are still highly applicable to programming CUDA Python.\n",
-    "\n",
-    " * Introduction: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction\n",
-    " * Programming Model: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Appendix: Random Number Generation on the GPU with Numba\n",
-    "\n",
-    "GPUs can be extremely useful for Monte Carlo applications where you need to use large amounts of random numbers.  CUDA ships with an excellent set of random number generation algorithms in the cuRAND library.  Unfortunately, cuRAND is defined in a set of C headers which Numba can't easily compile or link to.  (Numba's CUDA JIT does not ever create C code for CUDA kernels.)  It is on the Numba roadmap to find a solution to this problem, but it may take some time.\n",
-    "\n",
-    "In the meantime, Numba version 0.33 and later includes the `xoroshiro128+` generator, which is pretty high quality, though with a smaller period ($2^{128} - 1$) than the XORWOW generator in cuRAND.\n",
-    "\n",
-    "To use it, you will want to initialize the RNG state on the host for each thread in your kernel. This state creation function initializes each state to be in the same sequence designated by the seed, but separated by $2^{64}$ steps from each other. This ensures that different threads will not accidentally end up with overlapping sequences (unless a single thread draws $2^{64}$ random numbers, which you won't have patience for):"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 82,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import cuda\n",
-    "from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32\n",
-    "\n",
-    "threads_per_block = 64\n",
-    "blocks = 24\n",
-    "rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We can use these random number states in our kernel by passing it in as an argument:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 91,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def monte_carlo_mean(rng_states, iterations, out):\n",
-    "    thread_id = cuda.grid(1)\n",
-    "    total = 0\n",
-    "    for i in range(iterations):\n",
-    "        sample = xoroshiro128p_uniform_float32(rng_states, thread_id) # Returns a float32 in range [0.0, 1.0)\n",
-    "        total += sample\n",
-    "        \n",
-    "    out[thread_id] = total/iterations"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 98,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "0.49990836\n"
-     ]
-    }
-   ],
-   "source": [
-    "out = cuda.device_array(threads_per_block * blocks, dtype=np.float32)\n",
-    "monte_carlo_mean[blocks, threads_per_block](rng_states, 10000, out)\n",
-    "print(out.copy_to_host().mean())"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Monte Carlo Pi on the GPU\n",
-    "\n",
-    "Let's revisit Monte Carlo Pi generating algorithm from the first section, where we had compiled it with Numba on the CPU."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 99,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import njit\n",
-    "import random\n",
-    "\n",
-    "@njit\n",
-    "def monte_carlo_pi(nsamples):\n",
-    "    acc = 0\n",
-    "    for i in range(nsamples):\n",
-    "        x = random.random()\n",
-    "        y = random.random()\n",
-    "        if (x**2 + y**2) < 1.0:\n",
-    "            acc += 1\n",
-    "    return 4.0 * acc / nsamples"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 102,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "105 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "nsamples = 10000000\n",
-    "%timeit monte_carlo_pi(nsamples)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Your task is to refactor `monte_carlo_pi_device` below, currently identical to `monte_carlo_pi` above, to run on the GPU. You can use `monte_carlo_mean` above for inspiration, but at the least you will need to:\n",
-    "\n",
-    "- Decorate to be a CUDA kernel\n",
-    "- Draw samples for the thread from the device RNG state (generated 2 cells below)\n",
-    "- Store each thread's results in an output array which will be meaned on the host (as `monte_carlo_mean` did above)\n",
-    "\n",
-    "If you look two cells below you will see that all the data has already been initialized, the execution configuration created, and the kernel launched. All you need to do is refactor the kernel definition in the cell immediately below. Check out [the solution](../edit/solutions/monte_carlo_pi_solution.py) if you get stuck."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 116,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import njit\n",
-    "import random\n",
-    "\n",
-    "# TODO: All your work will be in this cell. Refactor to run on the device successfully given the way the\n",
-    "# kernel is launched below.\n",
-    "@cuda.jit\n",
-    "def monte_carlo_pi_device(rng_states, nsamples, out):\n",
-    "    idx = cuda.grid(1)\n",
-    "\n",
-    "    if idx < out.size:\n",
-    "        acc = 0\n",
-    "        for i in range(nsamples):\n",
-    "            x = xoroshiro128p_uniform_float32(rng_states, idx)\n",
-    "            y = xoroshiro128p_uniform_float32(rng_states, idx)\n",
-    "            if x*x + y*y < 1.0:\n",
-    "                acc += 1\n",
-    "        out[idx] = 4.0 * acc / nsamples"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 117,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# Do not change any of the values in this cell\n",
-    "nsamples = 10000000\n",
-    "threads_per_block = 128\n",
-    "blocks = 32\n",
-    "\n",
-    "grid_size = threads_per_block * blocks\n",
-    "samples_per_thread = int(nsamples / grid_size) # Each thread only needs to work on a fraction of total number of samples.\n",
-    "                                               # This could also be calcuated inside the kernel definition using `gridsize(1)`.\n",
-    "\n",
-    "rng_states = create_xoroshiro128p_states(grid_size, seed=1)\n",
-    "d_out = cuda.device_array(threads_per_block * blocks, dtype=np.float32)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 118,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "1.04 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit monte_carlo_pi_device[blocks, threads_per_block](rng_states, samples_per_thread, d_out); cuda.synchronize()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 114,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "3.140668\n"
-     ]
-    }
-   ],
-   "source": [
-    "print(d_out.copy_to_host().mean())"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/nvidia1/Effective+Memory+Use.ipynb b/nvidia1/Effective+Memory+Use.ipynb
deleted file mode 100644
index 2f038e9..0000000
--- a/nvidia1/Effective+Memory+Use.ipynb
+++ /dev/null
@@ -1,2073 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Effective Use of the Memory Subsystem"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now that you can write correct CUDA kernels, and understand the importance of launching grids that give the GPU sufficient opportunity to hide latency, you are going to learn techniques to effectively utilize GPU memory subsystems. These techniques are widely applicable to a variety of CUDA applications, and some of the most important when it comes time to make your CUDA code go fast.\n",
-    "\n",
-    "You are going to begin by learning about memory coalescing. To challenge your ability to reason about memory coalescing, and to expose important details relevent to many CUDA applications, you will then learn about 2-dimensional grids and thread blocks. Next you will learn about a very fast, user-controlled, on-demand memory space called shared memory, and will use shared memory to facilitate memory coalescing where it would not have otherwise been possible. Finally, you will learn about shared memory bank conflicts, which can spoil the performance possibilities of using shared memory, and a technique to address them."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Objectives"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "By the time you complete this section, you will be able to:\n",
-    "* Write CUDA kernels that benefit from coalesced memory access patterns.\n",
-    "* Work with multi-dimensional grids and thread blocks.\n",
-    "* Use shared memory to coordinate threads within a block.\n",
-    "* Use shared memory to facilitate coalesced memory access patterns.\n",
-    "* Resolve shared memory bank conflicts."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## The Problem: Uncoalesced Memory Access Hurts Performance"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Before you learn the details about what **coalesced memory access** is, run the following cells to observe the performance implications for a seemingly trivial change to the data access pattern within a kernel."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Imports"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import cuda"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Data Creation"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this cell we define `n` and create a grid with threads equal to `n`. We also create an output vector with length `n`. For the inputs we create vectors of size `stride * n` for reasons that will be made clear below:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 1024*1024 # 1M\n",
-    "\n",
-    "threads_per_block = 1024\n",
-    "blocks = int(n / threads_per_block)\n",
-    "\n",
-    "stride = 16\n",
-    "\n",
-    "# Input Vectors of length stride * n\n",
-    "a = np.ones(stride * n).astype(np.float32)\n",
-    "b = a.copy().astype(np.float32)\n",
-    "\n",
-    "# Output Vector\n",
-    "out = np.zeros(n).astype(np.float32)\n",
-    "\n",
-    "d_a = cuda.to_device(a)\n",
-    "d_b = cuda.to_device(b)\n",
-    "d_out = cuda.to_device(out)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Kernel Definition"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In `add_experiment`, every thread in the grid will add an item in `a`, and an item in `b` and write the result to `out`. The kernel has been written such that we can pass a `coalesced` value of either `True` or `False` to affect how it indexes into the `a` and `b` vectors. You will see the performance comparison of the two modes below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def add_experiment(a, b, out, stride, coalesced):\n",
-    "    i = cuda.grid(1)\n",
-    "    # The above line is equivalent to\n",
-    "    # i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
-    "    if coalesced == True:\n",
-    "        out[i] = a[i] + b[i]\n",
-    "    else:\n",
-    "        out[i] = a[stride*i] + b[stride*i]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Launch Kernel Using Coalesced Access"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here we pass `True` as the `coalesced` value, and observe the performance of the kernel over several runs:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "227 µs ± 73.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, True); cuda.synchronize"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here we make sure the kernel ran as expected:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "result = d_out.copy_to_host()\n",
-    "truth = a[:n] + b[:n]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(result, truth)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Launch Kernel Using Uncoalesced Access"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this cell we pass `False`, to observe the perfomance of the uncoalesced data access pattern for `add_experiment`:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "540 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, False); cuda.synchronize"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here we make sure the kernel ran as expected:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "result = d_out.copy_to_host()\n",
-    "truth = a[::stride] + b[::stride]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 9,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(result, truth)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The performance of the uncoalesced data access pattern was far worse. Now you will learn why, and how to think about data access patterns in your kernels to obtain high performing kernels."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Presentation: Global Memory Coalescing"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "\n",
-       "        <iframe\n",
-       "            width=\"800\"\n",
-       "            height=\"450\"\n",
-       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/coalescing-v3.pptx\"\n",
-       "            frameborder=\"0\"\n",
-       "            allowfullscreen\n",
-       "        ></iframe>\n",
-       "        "
-      ],
-      "text/plain": [
-       "<IPython.lib.display.IFrame at 0x7f9086df57b8>"
-      ]
-     },
-     "execution_count": 10,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from IPython.display import IFrame\n",
-    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/coalescing-v3.pptx', 800, 450)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "> _**Footnote**: for additional details about global memory segment size across a variety of devices, and with regards to caching, see [The CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory)._"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Exercise: Column and Row Sums"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "For this exercise you will be asked to write a column sums kernel that uses fully coalesced memory access patterns. To begin you will observe the performance of a row sums kernel that makes uncoalesced memory accesses."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Row Sums"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Imports**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import cuda"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Data Creation**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this cell we create an input matrix, as well as a vector for storing the solution, and transfer each of them to the device. We also define the grid and block dimensions to be used when we launch the kernel below. We set an arbitrary row of data to some arbitrary value to facilitate checking for correctness below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 16384 # matrix side size\n",
-    "threads_per_block = 256\n",
-    "blocks = int(n / threads_per_block)\n",
-    "\n",
-    "# Input Matrix\n",
-    "a = np.ones(n*n).reshape(n, n).astype(np.float32)\n",
-    "# Here we set an arbitrary row to an arbitrary value to facilitate a check for correctness below.\n",
-    "a[3] = 9\n",
-    "\n",
-    "# Output vector\n",
-    "sums = np.zeros(n).astype(np.float32)\n",
-    "\n",
-    "d_a = cuda.to_device(a)\n",
-    "d_sums = cuda.to_device(sums)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "** Kernel Definition**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "`row_sums` will use each thread to iterate over a row of data, summing it, and then store its row sum in `sums`."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def row_sums(a, sums, n):\n",
-    "    idx = cuda.grid(1)\n",
-    "    sum = 0.0\n",
-    "    \n",
-    "    for i in range(n):\n",
-    "        # Each thread will sum a row of `a`\n",
-    "        sum += a[idx][i]\n",
-    "        \n",
-    "    sums[idx] = sum"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Row Sums Performance**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "11.6 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit row_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Check for Correctness**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "result = d_sums.copy_to_host()\n",
-    "truth = a.sum(axis=1)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 16,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 16,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(truth, result)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Column Sums"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Imports**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 17,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import cuda"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Data Creation**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this cell we create an input matrix, as well as a vector for storing the solution, and transfer each of them to the device. We also define the grid and block dimensions to be used when we launch the kernel below. We set an arbitrary column of data to some arbitrary value to facilitate checking for correctness below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 18,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 16384 # matrix side size\n",
-    "threads_per_block = 256\n",
-    "blocks = int(n / threads_per_block)\n",
-    "\n",
-    "a = np.ones(n*n).reshape(n, n).astype(np.float32)\n",
-    "# Here we set an arbitrary column to an arbitrary value to facilitate a check for correctness below.\n",
-    "a[:, 3] = 9\n",
-    "sums = np.zeros(n).astype(np.float32)\n",
-    "\n",
-    "d_a = cuda.to_device(a)\n",
-    "d_sums = cuda.to_device(sums)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "** Kernel Definition**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "`col_sums` will use each thread to iterate over a column of data, summing it, and then store its column sum in `sums`. Complete the kernel definition to accomplish this. If you get stuck, feel free to refer to [the solution](../edit/solutions/col_sums_solution.py)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 52,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def col_sums(a, sums, ds):\n",
-    "    idx = cuda.grid(1)\n",
-    "    stride = cuda.gridsize(1)\n",
-    "    sum = 0\n",
-    "    for i in range(stride):\n",
-    "        sum += a[i][idx]\n",
-    "    \n",
-    "    sums[idx] = sum"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Check Performance**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Assuming you have written `col_sums` to use coalesced access patterns, you should see a significant (almost 2x) speed up compared to the uncoalesced `row_sums` you ran above:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 53,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "7.88 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit col_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Check Correctness**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Confirm your kernel is working as expected."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 50,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "col_sums[blocks, threads_per_block](d_a, d_sums, n)\n",
-    "cuda.synchronize()\n",
-    "result = d_sums.copy_to_host()\n",
-    "truth = a.sum(axis=0)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 51,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 51,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(truth, result)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 2 and 3 Dimensional Blocks and Grids"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Both grids and blocks can be configured to contain a 2 or 3 dimensional collection of blocks or threads, respectively. This is done mostly as a matter of convenience for programmers who often work with 2 or 3 dimensional datasets. Here is a very trivial example to highlight the syntax. You may need to read *both* the kernel definition and its launch before the concept makes sense."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 54,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import cuda"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 57,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "A = np.zeros((4,4)) # A 4x4 Matrix of 0's\n",
-    "d_A = cuda.to_device(A)\n",
-    "\n",
-    "# Here we create a 2D grid with 4 blocks in a 2x2 structure, each with 4 threads in a 2x2 structure\n",
-    "# by using a Python tuple to signify grid and block dimensions.\n",
-    "blocks = (2, 2)\n",
-    "threads_per_block = (2, 2)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This kernel will take an input matrix of 0s and write to each of its elements, its (x,y) coordinates within the grid in the format of `X.Y`:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 58,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def get_2D_indices(A):\n",
-    "    # By passing `2`, we get the thread's unique x and y coordinates in the 2D grid\n",
-    "    x, y = cuda.grid(2)\n",
-    "    # The above is equivalent to the following 2 lines of code:\n",
-    "    # x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
-    "    # y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
-    "    \n",
-    "    # Write the x index followed by a decimal and the y index.\n",
-    "    A[x][y] = x + y / 10"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 59,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "get_2D_indices[blocks, threads_per_block](d_A)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 60,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([[0. , 0.1, 0.2, 0.3],\n",
-       "       [1. , 1.1, 1.2, 1.3],\n",
-       "       [2. , 2.1, 2.2, 2.3],\n",
-       "       [3. , 3.1, 3.2, 3.3]])"
-      ]
-     },
-     "execution_count": 60,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "result = d_A.copy_to_host()\n",
-    "result"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Exercise: Coalesced 2-Dimensional Matrix Add"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Imports"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 62,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import cuda"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Data Creation"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this cell we define 2048x2048 elmement input matrices `a` and `b`, as well as a 2048x2048 0-initialized output matrix. We copy these matrices to the device.\n",
-    "\n",
-    "We also define the 2-dimensional block and grid dimensions to be used below. Note that we are creating a grid with the same number of total threads as there are input and output elements, such that each thread in the grid will calculate the sum for a single element in the output matrix."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 63,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 2048*2048 # 4M\n",
-    "\n",
-    "# 2D blocks\n",
-    "threads_per_block = (32, 32)\n",
-    "# 2D grid\n",
-    "blocks = (64, 64)\n",
-    "\n",
-    "# 2048x2048 input matrices\n",
-    "a = np.arange(n).reshape(2048,2048).astype(np.float32)\n",
-    "b = a.copy().astype(np.float32)\n",
-    "\n",
-    "# 2048x2048 0-initialized output matrix\n",
-    "out = np.zeros_like(a).astype(np.float32)\n",
-    "\n",
-    "d_a = cuda.to_device(a)\n",
-    "d_b = cuda.to_device(b)\n",
-    "d_out = cuda.to_device(out)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 2D Matrix Add"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Your job is to complete the TODOs in `matrix_add` to correctly sum `a` and `b` into `out`. As a challenge to your understanding of coalesced access patterns, `matrix_add` will accept a `coalesced` boolean indicating whether the access patterns should be coalesced or not. Both modes (coalesced and uncoalesced) should produce correct results, however, you should observe significant speedups below when running with `coalesced` set to `True`.\n",
-    "\n",
-    "If you get stuck, feel free to check out [the solution](../edit/solutions/matrix_add_solution.py)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 66,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def matrix_add(a, b, out, coalesced):\n",
-    "    x, y = cuda.grid(2)\n",
-    "    \n",
-    "    if coalesced == True:\n",
-    "        out[y][x] = a[y][x] + b[y][x]\n",
-    "    else:\n",
-    "        out[x][y] = a[x][y] + b[x][y]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Check Performance"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Run both cells below to launch `matrix_add` with both the coalesced and uncoalesced access patterns you wrote into it, and observe the performance difference. Additional cells have been provided to confirm the correctness of your kernel."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Coalesced**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 68,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "203 µs ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, True); cuda.synchronize"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 69,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "result = d_out.copy_to_host()\n",
-    "truth = a+b"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 70,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 70,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(result, truth)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Uncoalesced**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 71,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "586 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, False); cuda.synchronize"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 72,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "result = d_out.copy_to_host()\n",
-    "truth = a+b"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 73,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 73,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(result, truth)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Shared Memory"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "So far we have been differentiating between host and device memory, as if device memory were a single kind of memory. But in fact, CUDA has an even more fine-grained [memory hierarchy](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy). The device memory we have been utilizing thus far is called **global memory** which is available to any thread or block on the device, can persist for the lifetime of the application, and is a relatively large memory space.\n",
-    "\n",
-    "We will now discuss how to utilize a region of on-chip device memory called **shared memory**. Shared memory is a programmer defined cache of limited size that [depends on the GPU](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) being used and is **shared** between all threads in a block. It is a scarce resource, cannot be accessed by threads outside of the block where it was allocated, and does not persist after a kernel finishes executing. Shared memory however has a much higher bandwidth than global memory and can be used to great effect in many kernels, especially to optimize performance.\n",
-    "\n",
-    "Here are a few common use cases for shared memory:\n",
-    "\n",
-    " * Caching memory read from global memory that will need to be read multiple times within a block.\n",
-    " * Buffering output from threads so it can be coalesced before writing it back to global memory.\n",
-    " * Staging data for scatter/gather operations within a block."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Shared Memory Syntax"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Numba provides [functions](https://numba.pydata.org/numba-doc/dev/cuda/memory.html#shared-memory-and-thread-synchronization) for allocating shared memory as well as for synchronizing between threads in a block, which is often necessary after parallel threads read from or write to shared memory.\n",
-    "\n",
-    "When declaring shared memory, you provide the shape of the shared array, as well as its type, using a [Numba type](https://numba.pydata.org/numba-doc/dev/reference/types.html#numba-types). **The shape of the array must be a constant value**, and therefore, you cannot use arguments passed into the function, or, provided variables like `numba.cuda.blockDim.x`, or the calculated values of `cuda.griddim`. Here is a convoluted example to demonstrate the syntax with comments pointing out the movement from host memory to global device memory, to shared memory, back to global device memory, and finally back to host memory:"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Imports**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We will use `numba.types` to define the types of values in shared memory."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 74,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import types, cuda"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Swap Elements Using Shared Memory**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The following kernel takes an input vector, where each thread will first write one element of the vector to shared memory, and then, after syncing such that all elements have been written to shared memory, will write one element out of shared memory into the swapped output vector.\n",
-    "\n",
-    "Worth noting is that each thread will be writing a swapped value from shared memory that was written into shared memory by another thread."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 75,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def swap_with_shared(vector, swapped):\n",
-    "    # Allocate a 4 element vector containing int32 values in shared memory.\n",
-    "    temp = cuda.shared.array(4, dtype=types.int32)\n",
-    "    \n",
-    "    idx = cuda.grid(1)\n",
-    "    \n",
-    "    # Move an element from global memory into shared memory\n",
-    "    temp[idx] = vector[idx]\n",
-    "    \n",
-    "    # cuda.syncthreads will force all threads in the block to synchronize here, which is necessary because...\n",
-    "    cuda.syncthreads()\n",
-    "    #...the following operation is reading an element written to shared memory by another thread.\n",
-    "    \n",
-    "    # Move an element from shared memory back into global memory\n",
-    "    swapped[idx] = temp[3 - cuda.threadIdx.x] # swap elements"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Data Creation**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 76,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "vector = np.arange(4).astype(np.int32)\n",
-    "swapped = np.zeros_like(vector)\n",
-    "\n",
-    "# Move host memory to device (global) memory\n",
-    "d_vector = cuda.to_device(vector)\n",
-    "d_swapped = cuda.to_device(swapped)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 77,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([0, 1, 2, 3], dtype=int32)"
-      ]
-     },
-     "execution_count": 77,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "vector"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "** Run Kernel**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 78,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "swap_with_shared[1, 4](d_vector, d_swapped)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Check Results**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 79,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([3, 2, 1, 0], dtype=int32)"
-      ]
-     },
-     "execution_count": 79,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# Move device (global) memory back to the host\n",
-    "result = d_swapped.copy_to_host()\n",
-    "result"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Presentation: Shared Memory for Memory Coalescing"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 80,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "\n",
-       "        <iframe\n",
-       "            width=\"800\"\n",
-       "            height=\"450\"\n",
-       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/shared_coalescing.pptx\"\n",
-       "            frameborder=\"0\"\n",
-       "            allowfullscreen\n",
-       "        ></iframe>\n",
-       "        "
-      ],
-      "text/plain": [
-       "<IPython.lib.display.IFrame at 0x7f9086acc438>"
-      ]
-     },
-     "execution_count": 80,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from IPython.display import IFrame\n",
-    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/shared_coalescing.pptx', 800, 450)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Excercise: Used Shared Memory for Coalesced Reads and Writes With Matrix Transpose"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this exercise you will implement what was just demonstrated in the presentation by writing a matrix transpose kernel which, using shared memory, makes coalesced reads and writes to the output matrix in global memory."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Coalesced Reads, Uncoalesced Writes"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "As reference, and for performance comparison, here is a naive matrix transpose kernel that makes coalesced reads from input, but uncoalesced writes to output."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Imports**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import cuda\n",
-    "import numpy as np"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Data Creation**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here we create a 4096x4096 input matrix `a` as well as a 4096x4096 output matrix `transposed`, and copy them to the device.\n",
-    "\n",
-    "We also define a 2-dimensional grid with 2-dimensional blocks to be used below. Note that we have created a grid with a total number of threads equal to the number of elments in the input matrix."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 81,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 4096*4096 # 16M\n",
-    "\n",
-    "# 2D blocks\n",
-    "threads_per_block = (32, 32)\n",
-    "#2D grid\n",
-    "blocks = (128, 128)\n",
-    "\n",
-    "# 4096x4096 input and output matrices\n",
-    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
-    "transposed = np.zeros_like(a).astype(np.float32)\n",
-    "\n",
-    "d_a = cuda.to_device(a)\n",
-    "d_transposed = cuda.to_device(transposed)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Naive Matrix Transpose Kernel**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This kernel correctly transposes `a`, writing the transposition to `transposed`. It makes reads from `a` in a coalesced fashion, however, its writes to `transposed` are uncoalesced."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 82,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def transpose(a, transposed):\n",
-    "    x, y = cuda.grid(2)\n",
-    "\n",
-    "    transposed[x][y] = a[y][x]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Check Performance**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 83,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "1.59 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Check Correctness**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 84,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "result = d_transposed.copy_to_host()\n",
-    "expected = a.T"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 85,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 85,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(result, expected)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Refactor for Coalesced Reads and Writes"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Your job will be to refactor the `transpose` kernel to use shared memory and make both reads to and writes from global memory in a coalesced fashion."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Imports**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 86,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import cuda, types as numba_types"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Data Creation**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 87,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 4096*4096 # 16M\n",
-    "\n",
-    "# 2D blocks\n",
-    "threads_per_block = (32, 32)\n",
-    "#2D grid\n",
-    "blocks = (128, 128)\n",
-    "\n",
-    "# 4096x4096 input and output matrices\n",
-    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
-    "transposed = np.zeros_like(a).astype(np.float32)\n",
-    "\n",
-    "d_a = cuda.to_device(a)\n",
-    "d_transposed = cuda.to_device(transposed)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Write a Transpose Kernel that Uses Shared Memory**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Complete the TODOs inside the `tile_transpose` kernel definition.\n",
-    "\n",
-    "If you get stuck, feel free to check out [the solution](../edit/solutions/tile_transpose_solution.py)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 91,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def tile_transpose(a, transposed):\n",
-    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
-    "    # and that `a` is a multiple of these dimensions.\n",
-    "    \n",
-    "    # 1) Create 32x32 shared memory array.\n",
-    "    \n",
-    "    temp = cuda.shared.array((32,32), dtype=types.int32)\n",
-    "\n",
-    "    # Compute offsets into global input array. Recall for coalesced access we want to map threadIdx.x increments to\n",
-    "    # the fastest changing index in the data, i.e. the column in our array.\n",
-    "    # Note: `a_col` and `a_row` are already correct.\n",
-    "    a_col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
-    "    a_row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
-    "    \n",
-    "    # 2) Make coalesced read from global memory (using grid indices)\n",
-    "    # into shared memory array (using thread indices).\n",
-    "    \n",
-    "    temp[cuda.threadIdx.y, cuda.threadIdx.x] = a[a_row, a_col]\n",
-    "\n",
-    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
-    "    \n",
-    "    cuda.syncthreads()\n",
-    "    \n",
-    "    # 4) Calculate transposed location for the shared memory array tile\n",
-    "    # to be written back to global memory. Note that blockIdx.y*blockDim.y \n",
-    "    # and blockIdx.x* blockDim.x are swapped (because we want to write to the\n",
-    "    # transpose locations), but we want to keep access coalesced, so match up the\n",
-    "    # threadIdx.x to the fastest changing index, i.e. the column./\n",
-    "    # Note: `t_col` and `t_row` are already correct.\n",
-    "    t_col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
-    "    t_row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
-    "\n",
-    "    # 5) Write from shared memory (using thread indices)\n",
-    "    # back to global memory (using grid indices)\n",
-    "    # transposing each element within the shared memory array.\n",
-    "    \n",
-    "    transposed[t_row, t_col] = temp[cuda.threadIdx.x, cuda.threadIdx.y]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Check Performance**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Check the performance of your refactored transpose kernel. You should see a speedup compared to the baseline transpose performance above."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 92,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "1.09 ms ± 60.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit tile_transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Check Correctness**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 93,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "result = d_transposed.copy_to_host()\n",
-    "expected = a.T"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 94,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 94,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(result, expected)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Why Such a Small Improvement?"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "While this is a significant speedup for only a few lines of code, but you might think that the performance improvement is not as stark as you expected based on earlier performance improvements to use coalesced access patterns. There are 2 main reasons for this:\n",
-    "\n",
-    "1. The naive transpose kernel was making coalesced reads, so, your refactored version only optimized half of the global memory access throughout the execution of the kernel.\n",
-    "2. Your code as written suffers from something called shared memory bank conflicts, a topic to which we will now turn our attention."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Presentation: Memory Bank Conflicts"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 95,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "\n",
-       "        <iframe\n",
-       "            width=\"800\"\n",
-       "            height=\"450\"\n",
-       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/bank_conflicts.pptx\"\n",
-       "            frameborder=\"0\"\n",
-       "            allowfullscreen\n",
-       "        ></iframe>\n",
-       "        "
-      ],
-      "text/plain": [
-       "<IPython.lib.display.IFrame at 0x7f908e3479e8>"
-      ]
-     },
-     "execution_count": 95,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from IPython.display import IFrame\n",
-    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/bank_conflicts.pptx', 800, 450)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Assessment: Resolve Memory Bank Conflicts"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "As a final exercise, and to get credit towards a certificate in the course for this final section of the workshop, you will refactor the transpose kernel utilizing shared memory to be shared memory bank conflict free."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Imports"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from numba import cuda, types as numba_types"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Data Creation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 4096*4096 # 16M\n",
-    "threads_per_block = (32, 32)\n",
-    "blocks = (128, 128)\n",
-    "\n",
-    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
-    "transposed = np.zeros_like(a).astype(np.float32)\n",
-    "\n",
-    "d_a = cuda.to_device(a)\n",
-    "d_transposed = cuda.to_device(transposed)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Make the Kernel Bank Conflict Free"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The `tile_transpose_conflict_free` kernel is a working matrix transpose kernel which utilizes shared memory so that both reads from and writes to global memory are coalesced. Your job is to refactor the kernel so that it does not suffer from memory bank conflicts.\n",
-    "\n",
-    "**Note:** Because this final exercise counts towards certification in the course, a solution will not be provided."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 96,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@cuda.jit\n",
-    "def tile_transpose_conflict_free(a, transposed):\n",
-    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
-    "    # and that `a` is a multiple of these dimensions.\n",
-    "    \n",
-    "    # 1) Create 32x32 shared memory array.\n",
-    "    tile = cuda.shared.array((32, 33), numba_types.float32)\n",
-    "\n",
-    "    # Compute offsets into global input array.\n",
-    "    x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
-    "    y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
-    "    \n",
-    "    # 2) Make coalesced read from global memory into shared memory array.\n",
-    "    # Note the use of local thread indices for the shared memory write,\n",
-    "    # and global offsets for global memory read.\n",
-    "    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a[y, x]\n",
-    "\n",
-    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
-    "    cuda.syncthreads()\n",
-    "    \n",
-    "    # 4) Calculate transposed location for the shared memory array tile\n",
-    "    # to be written back to global memory.\n",
-    "    t_x = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
-    "    t_y = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
-    "\n",
-    "    # 5) Write back to global memory,\n",
-    "    # transposing each element within the shared memory array.\n",
-    "    transposed[t_y, t_x] = tile[cuda.threadIdx.x, cuda.threadIdx.y]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Check Performance"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Assuming you have correctly resolved the bank conflicts, this kernel should run significantly faster than both the naive transpose kernel, and, the shared memory (with bank conflicts) transpose kernel. In order to pass the assessment, your kernel will need to run on average in less than 840 µs.\n",
-    "\n",
-    "The first value printed by running the following cell will give you the average run time of your kernel."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 97,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "804 µs ± 2.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit tile_transpose_conflict_free[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Check Correctness"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In order to pass the assessment, your kernel also needs to work correctly. Run the following 2 cells to confirm this is true."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 98,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "result = d_transposed.copy_to_host()\n",
-    "expected = a.T"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 99,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 99,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.array_equal(result, expected)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Run the Assessment"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "If you have completed the refactor, observed it's run time to be less than 840 µs, and confirmed that it runs correctly, execute the following cells to run the assessment against your kernel definition."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 100,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from assessment import assess"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 101,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "Your function took 808.28 µs to run.\n",
-      "\n",
-      "Your function runs fast enough (less than 840 µs): True\n",
-      "\n",
-      "Your function returns the correct results: True\n",
-      "\n",
-      "Congratulations, you passed! See the instructions below for how to get credit for your work to count toward a certificate in the course.\n"
-     ]
-    }
-   ],
-   "source": [
-    "assess(tile_transpose_conflict_free)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Get Credit for Your Work"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "After successfully passing the assessment above, revisit the webpage where you launched this interactive environment and click on the **\"ASSESS TASK\"** button as shown in the screenshot below. Doing so will give you credit for this part of the workshop that counts towards earning a **certificate of competency** for the entire course."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "![Run the assessment](images/run_the_assessment.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Summary"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now that you have completed this session you are able to:\n",
-    "\n",
-    "* Write CUDA kernels that benefit from coalesced memory access patterns.\n",
-    "* Work with multi-dimensional grids and thread blocks.\n",
-    "* Use shared memory to coordinate threads within a block.\n",
-    "* Use shared memory to facilitate coalesced memory access patterns.\n",
-    "* Resolve shared memory bank conflicts."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Download Content"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "To download the contents of this notebook, execute the following cell and then click the download link below. Note: If you run this notebook on a local Jupyter server, you can expect some of the file path links in the notebook to be broken as they are shaped to our own platform. You can still navigate to the files through the Jupyter file navigator."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "!tar -zcvf section3.tar.gz ."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "[Download files from this section.](files/section3.tar.gz)"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/nvidia1/Introduction+to+CUDA+Python+with+Numba.ipynb b/nvidia1/Introduction+to+CUDA+Python+with+Numba.ipynb
deleted file mode 100644
index c1e885e..0000000
--- a/nvidia1/Introduction+to+CUDA+Python+with+Numba.ipynb
+++ /dev/null
@@ -1,2072 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Introduction to CUDA Python with Numba\n",
-    "\n",
-    "The **[CUDA](https://en.wikipedia.org/wiki/CUDA)** compute platform enables remarkable application acceleration by enabling developers to execute code in a massively parallel fashion on NVIDA GPUs.\n",
-    "\n",
-    "**[Numba](http://numba.pydata.org/)** is a just-in-time Python function compiler that exposes a simple interface for accelerating numerically-focused Python functions. Numba is a very attractive option for Python programmers wishing to GPU accelerate their applications without needing to write C/C++ code, especially for developers already performing computationally heavy operations on NumPy arrays. Numba can be used to accelerate Python functions for the CPU, as well as for NVIDIA GPUs. **The focus of this course is the fundamental techniques needed to GPU-accelerate Python applications using Numba.**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Course Structure\n",
-    "\n",
-    "This course is divided into **three** main sections:\n",
-    "\n",
-    "- _Introduction to CUDA Python with Numba_\n",
-    "- _Custom CUDA Kernels in Python with Numba_\n",
-    "- _Multidimensional Grids and Shared Memory for CUDA Python with Numba_\n",
-    "\n",
-    "Each section contains a final assessment problem, the successful completion of which will enable you to earn a Certificate of Competency for the course. Each section also contains an appendix with advanced materials for those of you with interest."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Introduction to CUDA Python with Numba\n",
-    "\n",
-    "In this first section you will learn first how to use Numba to compile functions for the CPU, and will receive an introduction to the inner workings of the Numba compiler. You will then proceed to learn how to GPU accelerate element-wise NumPy array functions, along with some techniques for efficiently moving data between a CPU host and GPU device.\n",
-    "\n",
-    "By the end of the first session you will be able to GPU accelerate Python code that performs element-wise operations on NumPy arrays."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Custom CUDA Kernels in Python with Numba\n",
-    "\n",
-    "In the second section you will expand your abilities to be able to launch arbitrary, not just element-wise, numerically focused functions in parallel on the GPU by writing custom CUDA kernels. In service of this goal you will learn about how NVIDIA GPUs execute code in parallel. Additionally, you will be exposed to several fundamental parallel programming techniques including how to coordinate the work of parallel threads, and how to address race conditions. You will also learn techniques for debugging code that executes on the GPU.\n",
-    "\n",
-    "By the end of the second section you will be ready to GPU accelerate an incredible range of numerically focused functions on 1D data sets."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Multidimensional Grids and Shared Memory for CUDA Python with Numba\n",
-    "\n",
-    "In the third section you will begin working in parallel with 2D data, and will learn how to utilize an on-chip memory space on the GPU called shared memory.\n",
-    "\n",
-    "By the end of the third section, you will be able to write GPU accelerated code in Python using Numba on 1D and 2D datasets while utilizing several of the most important optimization strategies for writing consistently fast GPU accelerated code."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Course Prerequisites\n",
-    "\n",
-    "* Competency writing Python, specifically, writing and invoking functions, working with variables, loops, and conditionals, and imports.\n",
-    "* Familiarity with the NumPy Python library for numerically-focused Python. If you have never used NumPy, but are familiar with Python, you will likely find the use of NumPy in this session straightforward. Comments and links are provided where some clarification might be helpful.\n",
-    "* A high level understanding of some computer science terms like memory allocation, value types, latency, and processing cores.\n",
-    "* A basic understanding of what vectors and matrices are, and also matrix multiplication."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Objectives for this Section\n",
-    "\n",
-    "By the time you complete this section you will be able to:\n",
-    "\n",
-    "- Use Numba to compile Python functions for the CPU.\n",
-    "- Understand how Numba compiles Python functions.\n",
-    "- GPU accelerate NumPy ufuncs.\n",
-    "- GPU accelerate hand-written vectorized functions.\n",
-    "- Optimize data transfers between the CPU host and GPU device."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## What is Numba?\n",
-    "\n",
-    "Numba is a **just-in-time**, **type-specializing**, **function compiler** for accelerating **numerically-focused** Python for either a CPU or GPU. That's a long list, so let's break down those terms:\n",
-    "\n",
-    " * **function compiler**: Numba compiles Python functions, not entire applications, and not parts of functions.  Numba does not replace your Python interpreter, but is just another Python module that can turn a function into a (usually) faster function. \n",
-    " * **type-specializing**: Numba speeds up your function by generating a specialized implementation for the specific data types you are using.  Python functions are designed to operate on generic data types, which makes them very flexible, but also very slow.  In practice, you only will call a function with a small number of argument types, so Numba will generate a fast implementation for each set of types.\n",
-    " * **just-in-time**: Numba translates functions when they are first called.  This ensures the compiler knows what argument types you will be using.  This also allows Numba to be used interactively in a Jupyter notebook just as easily as a traditional application.\n",
-    " * **numerically-focused**: Currently, Numba is focused on numerical data types, like `int`, `float`, and `complex`.  There is very limited string processing support, and many string use cases are not going to work well on the GPU.  To get best results with Numba, you will likely be using NumPy arrays."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Requirements for Using Numba\n",
-    "\n",
-    "Numba supports a wide range of operating systems:\n",
-    "\n",
-    " * Windows 7 and later, 32 and 64-bit\n",
-    " * macOS 10.9 and later, 64-bit\n",
-    " * Linux (most anything >= RHEL 5), 32-bit and 64-bit\n",
-    "\n",
-    "and Python versions:\n",
-    "\n",
-    " * Python 2.7, >3.4\n",
-    " * NumPy 1.10 and later\n",
-    "\n",
-    "and a very wide range of hardware:\n",
-    "\n",
-    "* x86, x86_64/AMD64 CPUs\n",
-    "* NVIDIA CUDA GPUs (Compute capability 3.0 and later, CUDA 8.0 and later)\n",
-    "* AMD GPUs (experimental patches)\n",
-    "* ARM (experimental patches)\n",
-    "\n",
-    "For this course, we will be using Linux 64-bit and CUDA 9."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Aside: CUDA C/C++ vs. Numba vs. pyCUDA\n",
-    "\n",
-    "By no means is Numba the only way to program with CUDA. By far the most common way to program in CUDA is with the CUDA C/C++ language extensions. With regards to Python, [pyCUDA](https://documen.tician.de/pycuda/) is, in addition to Numba, an alternative to GPU accelerating Python code. We will remained focused on Numba throughout this course, but a quick comparison of the three options just named is worth a mention before we get started, just for a little context.\n",
-    "\n",
-    "**CUDA C/C++**:\n",
-    "- The most common, performant, and flexible way to utilize CUDA\n",
-    "- Accelerates C/C++ applications\n",
-    "\n",
-    "**pyCUDA**:\n",
-    "- Exposes the entire CUDA C/C++ API\n",
-    "- Is the most performant CUDA option available for Python\n",
-    "- Requires writing C code in your Python, and in general, a lot of code modifications\n",
-    "\n",
-    "**Numba**:\n",
-    "- Potentially less performant than pyCUDA\n",
-    "- Does not (yet?) expose the entire CUDA C/C++ API\n",
-    "- Still enables massive acceleration, often with very little code modification\n",
-    "- Allows developers the convenience of writing code directly in Python\n",
-    "- Also optimizes Python code for the CPU"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## First Steps: Compile for the CPU\n",
-    "\n",
-    "If you recall Numba can be used to optimize code for either a CPU or GPU. As an introduction, and before moving onto GPU acceleration, let's write our first Numba function and compile it for the **CPU**. In doing so we will get an easy entrance into Numba syntax, and will also have an opportunity a little later on to compare the performance of CPU optimized Numba code to GPU acclerated Numba code.\n",
-    "\n",
-    "The Numba compiler is typically enabled by applying a [**function decorator**](https://en.wikipedia.org/wiki/Python_syntax_and_semantics#Decorators) to a Python function. Decorators are function modifiers that transform the Python functions they decorate, using a very simple syntax. Here we will use Numba's CPU compilation decorator `@jit`:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import jit\n",
-    "import math\n",
-    "\n",
-    "# This is the function decorator syntax and is equivalent to `hypot = jit(hypot)`.\n",
-    "# The Numba compiler is just a function you can call whenever you want!\n",
-    "@jit\n",
-    "def hypot(x, y):\n",
-    "    # Implementation from https://en.wikipedia.org/wiki/Hypot\n",
-    "    x = abs(x);\n",
-    "    y = abs(y);\n",
-    "    t = min(x, y);\n",
-    "    x = max(x, y);\n",
-    "    t = t / x;\n",
-    "    return x * math.sqrt(1+t*t)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's try out our hypotenuse calculation:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "5.0"
-      ]
-     },
-     "execution_count": 5,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "hypot(3.0, 4.0)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We will go in to more detail below about what happens when `hypot` is called, but for now know that the first time we call `hypot`, the compiler is triggered and compiles a machine code implementation of the function for float inputs. Numba also saves the original Python implementation of the function in the `.py_func` attribute, so we can call the original Python code to make sure we get the same answer:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "5.0"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "hypot.py_func(3.0, 4.0)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Benchmarking\n",
-    "\n",
-    "An important part of using Numba is measuring the performance of your new code.  Let's see if we actually sped anything up.  The easiest way to do this in a Jupyter notebook, like the one this session is run in, is to use the [`%timeit` magic function](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit).  Let's first measure the speed of the original Python:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "677 ns ± 0.576 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit hypot.py_func(3.0, 4.0)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The `%timeit` magic runs the statement many times to get an accurate estimate of the run time.  It also returns the best time by default, which is useful to reduce the probability that random background events affect your measurement.  The best of 3 approach also ensures that the compilation time on the first call doesn't skew the results:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "188 ns ± 0.0139 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit hypot(3.0, 4.0)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Numba did a pretty good job with this function.  It's certainly faster than the pure Python version. Of course, the `hypot` function is already present in the Python module, let's see how it compares:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "124 ns ± 0.0233 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit math.hypot(3.0, 4.0)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Python's built-in is even faster than Numba! This is because Numba does introduce some overhead to each function call that is larger than the function call overhead of Python itself. Extremely fast functions (like the above one) will be hurt by this. (As an aside, if you call one Numba function from another one, there is very little function overhead, sometimes even zero if the compiler inlines the function into the other one. In short, always benchmark your functions for evidence of speed up.)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Use Numba to Compile a Function for the CPU\n",
-    "\n",
-    "The following function uses [the Monte Carlo Method to determine Pi](https://academo.org/demos/estimating-pi-monte-carlo/) (source code from the [Numba homepage](http://numba.pydata.org/)). The function itself is already working so don't worry about the mathematical implementation details.\n",
-    "\n",
-    "Complete the two `TODO`s in order to compile `monte_carlo_pi` with Numba before executing the following 3 cells which will:\n",
-    "\n",
-    "  1. Confirm the compiled version is behaving the same as the uncompiled version.\n",
-    "  2. Benchmark the uncompiled version.\n",
-    "  3. Benchmark the compiled version.\n",
-    "\n",
-    "If you get stuck, check out [the solution](../edit/solutions/monte_carlo_pi_solution.py)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "nsamples = 1000000"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# TODO: Import Numba's just-in-time compiler function\n",
-    "from numba import jit\n",
-    "import random\n",
-    "\n",
-    "# TODO: Use the Numba compiler to compile this function\n",
-    "@jit\n",
-    "def monte_carlo_pi(nsamples):\n",
-    "    acc = 0\n",
-    "    for i in range(nsamples):\n",
-    "        x = random.random()\n",
-    "        y = random.random()\n",
-    "        if (x**2 + y**2) < 1.0:\n",
-    "            acc += 1\n",
-    "    return 4.0 * acc / nsamples"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# We will use numpy's `testing` library to confirm compiled and uncompiled versions run the same\n",
-    "from numpy import testing\n",
-    "\n",
-    "# This assertion will fail until you successfully complete the exercise one cell above\n",
-    "testing.assert_almost_equal(monte_carlo_pi(nsamples), monte_carlo_pi.py_func(nsamples), decimal=2)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "10.5 ms ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit monte_carlo_pi(nsamples)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "349 ms ± 4.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit monte_carlo_pi.py_func(nsamples)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## How Numba Works\n",
-    "\n",
-    "Now that you've gotton your hands a little dirty using the Numba compiler, let's take a look at what is actually going on under the hood. The first time we called our Numba-wrapped `hypot` function, the following process was initiated:\n",
-    "\n",
-    "![Numba Flowchart](images/numba_flowchart.png \"The compilation process\")\n",
-    "\n",
-    "We can see the result of type inference by using the `.inspect_types()` method, which prints an annotated version of the source code:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "hypot (float64, float64)\n",
-      "--------------------------------------------------------------------------------\n",
-      "# File: <ipython-input-4-1ae6c289554e>\n",
-      "# --- LINE 6 --- \n",
-      "# label 0\n",
-      "\n",
-      "@jit\n",
-      "\n",
-      "# --- LINE 7 --- \n",
-      "\n",
-      "def hypot(x, y):\n",
-      "\n",
-      "    # --- LINE 8 --- \n",
-      "\n",
-      "    # Implementation from https://en.wikipedia.org/wiki/Hypot\n",
-      "\n",
-      "    # --- LINE 9 --- \n",
-      "    #   x = arg(0, name=x)  :: float64\n",
-      "    #   y = arg(1, name=y)  :: float64\n",
-      "    #   $0.1 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)\n",
-      "    #   $0.3 = call $0.1(x, func=$0.1, args=[Var(x, <ipython-input-4-1ae6c289554e> (9))], kws=(), vararg=None)  :: (float64,) -> float64\n",
-      "    #   del x\n",
-      "    #   del $0.1\n",
-      "    #   x.1 = $0.3  :: float64\n",
-      "    #   del $0.3\n",
-      "\n",
-      "    x = abs(x);\n",
-      "\n",
-      "    # --- LINE 10 --- \n",
-      "    #   $0.4 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)\n",
-      "    #   $0.6 = call $0.4(y, func=$0.4, args=[Var(y, <ipython-input-4-1ae6c289554e> (9))], kws=(), vararg=None)  :: (float64,) -> float64\n",
-      "    #   del y\n",
-      "    #   del $0.4\n",
-      "    #   y.1 = $0.6  :: float64\n",
-      "    #   del $0.6\n",
-      "\n",
-      "    y = abs(y);\n",
-      "\n",
-      "    # --- LINE 11 --- \n",
-      "    #   $0.7 = global(min: <built-in function min>)  :: Function(<built-in function min>)\n",
-      "    #   $0.10 = call $0.7(x.1, y.1, func=$0.7, args=[Var(x.1, <ipython-input-4-1ae6c289554e> (9)), Var(y.1, <ipython-input-4-1ae6c289554e> (10))], kws=(), vararg=None)  :: (float64, float64) -> float64\n",
-      "    #   del $0.7\n",
-      "    #   t = $0.10  :: float64\n",
-      "    #   del $0.10\n",
-      "\n",
-      "    t = min(x, y);\n",
-      "\n",
-      "    # --- LINE 12 --- \n",
-      "    #   $0.11 = global(max: <built-in function max>)  :: Function(<built-in function max>)\n",
-      "    #   $0.14 = call $0.11(x.1, y.1, func=$0.11, args=[Var(x.1, <ipython-input-4-1ae6c289554e> (9)), Var(y.1, <ipython-input-4-1ae6c289554e> (10))], kws=(), vararg=None)  :: (float64, float64) -> float64\n",
-      "    #   del y.1\n",
-      "    #   del x.1\n",
-      "    #   del $0.11\n",
-      "    #   x.2 = $0.14  :: float64\n",
-      "    #   del $0.14\n",
-      "\n",
-      "    x = max(x, y);\n",
-      "\n",
-      "    # --- LINE 13 --- \n",
-      "    #   $0.17 = t / x.2  :: float64\n",
-      "    #   del t\n",
-      "    #   t.1 = $0.17  :: float64\n",
-      "    #   del $0.17\n",
-      "\n",
-      "    t = t / x;\n",
-      "\n",
-      "    # --- LINE 14 --- \n",
-      "    #   $0.19 = global(math: <module 'math' from '/home/appuser/Miniconda3/lib/python3.6/lib-dynload/math.cpython-36m-x86_64-linux-gnu.so'>)  :: Module(<module 'math' from '/home/appuser/Miniconda3/lib/python3.6/lib-dynload/math.cpython-36m-x86_64-linux-gnu.so'>)\n",
-      "    #   $0.20 = getattr(value=$0.19, attr=sqrt)  :: Function(<built-in function sqrt>)\n",
-      "    #   del $0.19\n",
-      "    #   $const0.21 = const(int, 1)  :: int64\n",
-      "    #   $0.24 = t.1 * t.1  :: float64\n",
-      "    #   del t.1\n",
-      "    #   $0.25 = $const0.21 + $0.24  :: float64\n",
-      "    #   del $const0.21\n",
-      "    #   del $0.24\n",
-      "    #   $0.26 = call $0.20($0.25, func=$0.20, args=[Var($0.25, <ipython-input-4-1ae6c289554e> (14))], kws=(), vararg=None)  :: (float64,) -> float64\n",
-      "    #   del $0.25\n",
-      "    #   del $0.20\n",
-      "    #   $0.27 = x.2 * $0.26  :: float64\n",
-      "    #   del x.2\n",
-      "    #   del $0.26\n",
-      "    #   $0.28 = cast(value=$0.27)  :: float64\n",
-      "    #   del $0.27\n",
-      "    #   return $0.28\n",
-      "\n",
-      "    return x * math.sqrt(1+t*t)\n",
-      "\n",
-      "\n",
-      "================================================================================\n"
-     ]
-    }
-   ],
-   "source": [
-    "hypot.inspect_types()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Note that Numba's type names tend to mirror [the NumPy type names](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html), so a Python `float` is a `float64` (also called \"double precision\" in other languages).  Taking a look at the data types can sometimes be important in GPU code because the performance of `float32` and `float64` computations can (depending on the GPU) be very different on CUDA devices. If your algorithm can obtain correct results using `float32`, then you should probably use that data type, because casting to `float64` can, depending on the GPU type, dramatically slow down the function."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Object and nopython Modes\n",
-    "\n",
-    "Numba cannot compile all Python code.  Some functions don't have a Numba-translation, and some kinds of Python types can't be efficiently compiled at all (yet).  For example, Numba does not support dictionaries (as of this writing). Here let's try to compile some Python code that Numba does not yet know how to compile:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 16,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "'value'"
-      ]
-     },
-     "execution_count": 16,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "@jit\n",
-    "def cannot_compile(x):\n",
-    "    return x['key']\n",
-    "\n",
-    "cannot_compile(dict(key='value'))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Given what we just said, you might be surpised that the cell above executed without any problems. This is because by default, Numba will fall back to a mode, called **object mode**, which does not do type-specialization. Object mode exists to enable other Numba functionality, but in many cases, you want Numba to tell you if type inference fails. You can force **nopython mode** (the other compilation mode) by passing the `nopython` argument to the decorator:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 17,
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "TypingError",
-     "evalue": "Failed in nopython mode pipeline (step: nopython frontend)\nInternal error at <numba.typeinfer.ArgConstraint object at 0x7fe8ec48d898>:\n--%<----------------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 599, in new_error_context\n    yield\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 199, in __call__\n    assert ty.is_precise()\nAssertionError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 142, in propagate\n    constraint(typeinfer)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 200, in __call__\n    typeinfer.add_type(self.dst, ty, loc=self.loc)\n  File \"/home/appuser/Miniconda3/lib/python3.6/contextlib.py\", line 99, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 607, in new_error_context\n    six.reraise(type(newerr), newerr, tb)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\", line 659, in reraise\n    raise value\nnumba.errors.InternalError: \n[1] During: typing of argument at <ipython-input-17-d3b98ca43e8a> (3)\n--%<----------------------------------------------------------------------------\n\n\nFile \"<ipython-input-17-d3b98ca43e8a>\", line 3:\ndef cannot_compile(x):\n    return x['key']\n    ^\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n\nThis is not usually a problem with Numba itself but instead often caused by\nthe use of unsupported features or an issue in resolving types.\n\nTo see Python/NumPy features supported by the latest release of Numba visit:\nhttp://numba.pydata.org/numba-doc/dev/reference/pysupported.html\nand\nhttp://numba.pydata.org/numba-doc/dev/reference/numpysupported.html\n\nFor more information about typing errors and how to debug them visit:\nhttp://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile\n\nIf you think your code should work with Numba, please report the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mTypingError\u001b[0m                               Traceback (most recent call last)",
-      "\u001b[0;32m<ipython-input-17-d3b98ca43e8a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      3\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'key'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mcannot_compile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'value'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m    346\u001b[0m                 \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpatch_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    347\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 348\u001b[0;31m             \u001b[0merror_rewrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'typing'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    349\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mUnsupportedError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    350\u001b[0m             \u001b[0;31m# Something unsupported is present in the user code, add help info\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36merror_rewrite\u001b[0;34m(e, issue_type)\u001b[0m\n\u001b[1;32m    313\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    314\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 315\u001b[0;31m                 \u001b[0mreraise\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    316\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    317\u001b[0m         \u001b[0margtypes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\u001b[0m in \u001b[0;36mreraise\u001b[0;34m(tp, value, tb)\u001b[0m\n\u001b[1;32m    656\u001b[0m             \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtp\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    657\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__traceback__\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mtb\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 658\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    659\u001b[0m         \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    660\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;31mTypingError\u001b[0m: Failed in nopython mode pipeline (step: nopython frontend)\nInternal error at <numba.typeinfer.ArgConstraint object at 0x7fe8ec48d898>:\n--%<----------------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 599, in new_error_context\n    yield\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 199, in __call__\n    assert ty.is_precise()\nAssertionError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 142, in propagate\n    constraint(typeinfer)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 200, in __call__\n    typeinfer.add_type(self.dst, ty, loc=self.loc)\n  File \"/home/appuser/Miniconda3/lib/python3.6/contextlib.py\", line 99, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 607, in new_error_context\n    six.reraise(type(newerr), newerr, tb)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\", line 659, in reraise\n    raise value\nnumba.errors.InternalError: \n[1] During: typing of argument at <ipython-input-17-d3b98ca43e8a> (3)\n--%<----------------------------------------------------------------------------\n\n\nFile \"<ipython-input-17-d3b98ca43e8a>\", line 3:\ndef cannot_compile(x):\n    return x['key']\n    ^\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n\nThis is not usually a problem with Numba itself but instead often caused by\nthe use of unsupported features or an issue in resolving types.\n\nTo see Python/NumPy features supported by the latest release of Numba visit:\nhttp://numba.pydata.org/numba-doc/dev/reference/pysupported.html\nand\nhttp://numba.pydata.org/numba-doc/dev/reference/numpysupported.html\n\nFor more information about typing errors and how to debug them visit:\nhttp://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile\n\nIf you think your code should work with Numba, please report the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n"
-     ]
-    }
-   ],
-   "source": [
-    "@jit(nopython=True)\n",
-    "def cannot_compile(x):\n",
-    "    return x['key']\n",
-    "\n",
-    "cannot_compile(dict(key='value'))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we get an exception when Numba tries to compile the function, and if you scroll down to the end of the exception output you will see an error that describes the underlying problem:\n",
-    "```\n",
-    "- argument 0: cannot determine Numba type of <class 'dict'>\n",
-    "```\n",
-    "\n",
-    "**Using `nopython` mode is the recommended and best practice way to use `jit` as it leads to the best performance.**\n",
-    "\n",
-    "Numba provides another decorator `njit` which is an alias for `jit(nopython=True)`:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 18,
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "TypingError",
-     "evalue": "Failed in nopython mode pipeline (step: nopython frontend)\nInternal error at <numba.typeinfer.ArgConstraint object at 0x7fe8ec43ada0>:\n--%<----------------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 599, in new_error_context\n    yield\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 199, in __call__\n    assert ty.is_precise()\nAssertionError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 142, in propagate\n    constraint(typeinfer)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 200, in __call__\n    typeinfer.add_type(self.dst, ty, loc=self.loc)\n  File \"/home/appuser/Miniconda3/lib/python3.6/contextlib.py\", line 99, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 607, in new_error_context\n    six.reraise(type(newerr), newerr, tb)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\", line 659, in reraise\n    raise value\nnumba.errors.InternalError: \n[1] During: typing of argument at <ipython-input-18-598d254e6e01> (5)\n--%<----------------------------------------------------------------------------\n\n\nFile \"<ipython-input-18-598d254e6e01>\", line 5:\ndef cannot_compile(x):\n    return x['key']\n    ^\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n\nThis is not usually a problem with Numba itself but instead often caused by\nthe use of unsupported features or an issue in resolving types.\n\nTo see Python/NumPy features supported by the latest release of Numba visit:\nhttp://numba.pydata.org/numba-doc/dev/reference/pysupported.html\nand\nhttp://numba.pydata.org/numba-doc/dev/reference/numpysupported.html\n\nFor more information about typing errors and how to debug them visit:\nhttp://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile\n\nIf you think your code should work with Numba, please report the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mTypingError\u001b[0m                               Traceback (most recent call last)",
-      "\u001b[0;32m<ipython-input-18-598d254e6e01>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      5\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'key'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mcannot_compile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'value'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m    346\u001b[0m                 \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpatch_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    347\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 348\u001b[0;31m             \u001b[0merror_rewrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'typing'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    349\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mUnsupportedError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    350\u001b[0m             \u001b[0;31m# Something unsupported is present in the user code, add help info\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36merror_rewrite\u001b[0;34m(e, issue_type)\u001b[0m\n\u001b[1;32m    313\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    314\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 315\u001b[0;31m                 \u001b[0mreraise\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    316\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    317\u001b[0m         \u001b[0margtypes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\u001b[0m in \u001b[0;36mreraise\u001b[0;34m(tp, value, tb)\u001b[0m\n\u001b[1;32m    656\u001b[0m             \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtp\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    657\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__traceback__\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mtb\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 658\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    659\u001b[0m         \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    660\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;31mTypingError\u001b[0m: Failed in nopython mode pipeline (step: nopython frontend)\nInternal error at <numba.typeinfer.ArgConstraint object at 0x7fe8ec43ada0>:\n--%<----------------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 599, in new_error_context\n    yield\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 199, in __call__\n    assert ty.is_precise()\nAssertionError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 142, in propagate\n    constraint(typeinfer)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 200, in __call__\n    typeinfer.add_type(self.dst, ty, loc=self.loc)\n  File \"/home/appuser/Miniconda3/lib/python3.6/contextlib.py\", line 99, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 607, in new_error_context\n    six.reraise(type(newerr), newerr, tb)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\", line 659, in reraise\n    raise value\nnumba.errors.InternalError: \n[1] During: typing of argument at <ipython-input-18-598d254e6e01> (5)\n--%<----------------------------------------------------------------------------\n\n\nFile \"<ipython-input-18-598d254e6e01>\", line 5:\ndef cannot_compile(x):\n    return x['key']\n    ^\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n\nThis is not usually a problem with Numba itself but instead often caused by\nthe use of unsupported features or an issue in resolving types.\n\nTo see Python/NumPy features supported by the latest release of Numba visit:\nhttp://numba.pydata.org/numba-doc/dev/reference/pysupported.html\nand\nhttp://numba.pydata.org/numba-doc/dev/reference/numpysupported.html\n\nFor more information about typing errors and how to debug them visit:\nhttp://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile\n\nIf you think your code should work with Numba, please report the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n"
-     ]
-    }
-   ],
-   "source": [
-    "from numba import njit\n",
-    "\n",
-    "@njit\n",
-    "def cannot_compile(x):\n",
-    "    return x['key']\n",
-    "\n",
-    "cannot_compile(dict(key='value'))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Please refer to [the Numba documentation](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html) for an exhaustive account of Numba-supported Python."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Introduction to Numba for the GPU with NumPy Universal Functions (ufuncs)\n",
-    "\n",
-    "We will begin our coverage of GPU programming in Numba with how to compile [NumPy Universal functions \\(or ufuncs\\)](https://docs.scipy.org/doc/numpy-1.15.1/reference/ufuncs.html) for the GPU."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The most important thing to know about GPU programming as we get started is that GPU hardware is designed for *data parallelism*. Maximum throughput is achieved when the GPU is computing the same operations on many different elements at once.\n",
-    "\n",
-    "NumPy Universal functions, which perform the same operation on every element in a NumPy array, are naturally data parallel, so they are a natural fit for GPU programming."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Review of NumPy Universal Functions (ufuncs)\n",
-    "\n",
-    "Familiarity with NumPy ufuncs is a prerequisite of this course, but in case you are unfamiliar with them, or in case it has been a while, here is a very brief introduction. If, at the end of this brief introduction, you don't feel comfortable with the basic NumPy mechanisms for array creation and ufuncs, consider the ~1 hour [NumPy Quickstart Tutorial](https://docs.scipy.org/doc/numpy/user/quickstart.html).\n",
-    "\n",
-    "NumPy has the concept of universal functions (\"ufuncs\"), which are functions that can take NumPy arrays of varying dimensions, or scalars, and operate on them element-by-element.\n",
-    "\n",
-    "As an example we'll use the NumPy `add` ufunc to demonstrate the basic ufunc mechanism:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 19,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([11, 22, 33, 44])"
-      ]
-     },
-     "execution_count": 19,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "import numpy as np\n",
-    "\n",
-    "a = np.array([1, 2, 3, 4])\n",
-    "b = np.array([10, 20, 30, 40])\n",
-    "\n",
-    "np.add(a, b) # Returns a new NumPy array resulting from adding every element in `a` to every element in `b`"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Ufuncs also can combine scalars with arrays:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 20,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([101, 102, 103, 104])"
-      ]
-     },
-     "execution_count": 20,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "np.add(a, 100) # Returns a new NumPy array resulting from adding 100 to every element in `a`"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Arrays of different, but compatible dimensions can also be combined via a technique called [*broadcasting*](https://docs.scipy.org/doc/numpy-1.15.0/user/basics.broadcasting.html). The lower dimensional array will be replicated to match the dimensionality of the higher dimensional array. If needed, check out the docs for [`numpy.arange`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html) and [`numpy.ndarray.reshape`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.reshape.html), both will be used several times throughout this course:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 21,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "c: [[ 0  1  2  3]\n",
-      " [ 4  5  6  7]\n",
-      " [ 8  9 10 11]\n",
-      " [12 13 14 15]]\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "array([[10, 21, 32, 43],\n",
-       "       [14, 25, 36, 47],\n",
-       "       [18, 29, 40, 51],\n",
-       "       [22, 33, 44, 55]])"
-      ]
-     },
-     "execution_count": 21,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "c = np.arange(4*4).reshape((4,4))\n",
-    "print('c:', c)\n",
-    "\n",
-    "np.add(b, c)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Making ufuncs for the GPU\n",
-    "\n",
-    "Numba has the ability to create *compiled* ufuncs, typically a not-so-straighforward process involving C code. With Numba you simply implement a scalar function to be performed on all the inputs, decorate it with `@vectorize`, and Numba will figure out the broadcast rules for you. For those of you familiar with [NumPy's `vectorize`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.vectorize.html), Numba's `vectorize` decorator will be very familiar."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this very first example we will use the `@vectorize` decorator to compile and optimize a ufunc for the **CPU**."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 26,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import vectorize\n",
-    "\n",
-    "@vectorize\n",
-    "def add_ten(num):\n",
-    "    return num + 10 # This scalar operation will be performed on each element"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 27,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])"
-      ]
-     },
-     "execution_count": 27,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "nums = np.arange(10)\n",
-    "add_ten(nums) # pass the whole array into the ufunc, it performs the operation on each element"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We are generating a ufunc that uses CUDA on the GPU with the addition of giving an **explicit type signature** and setting the `target` attribute. The type signature argument describes what types to use both for the ufuncs arguments and return value:\n",
-    "```python\n",
-    "'return_value_type(argument1_value_type, argument2_value_type, ...)'\n",
-    "```\n",
-    "\n",
-    "Please see the Numba docs for more on [available types](https://numba.pydata.org/numba-doc/dev/reference/types.html), as well as for additional information on [writing ufuncs with more than one signature](https://numba.pydata.org/numba-doc/dev/user/vectorize.html)\n",
-    "\n",
-    "Here is a simple example of a ufunc that will be compiled for a CUDA enabled GPU device. It expects two `int64` values and return also an `int64` value:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 28,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@vectorize(['int64(int64, int64)'], target='cuda') # Type signature and target are required for the GPU\n",
-    "def add_ufunc(x, y):\n",
-    "    return x + y"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 29,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([11, 22, 33, 44])"
-      ]
-     },
-     "execution_count": 29,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "add_ufunc(a, b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "For such a simple function call, a lot of things just happened!  Numba just automatically:\n",
-    "\n",
-    " * Compiled a CUDA kernel to execute the ufunc operation in parallel over all the input elements.\n",
-    " * Allocated GPU memory for the inputs and the output.\n",
-    " * Copied the input data to the GPU.\n",
-    " * Executed the CUDA kernel (GPU function) with the correct kernel dimensions given the input sizes.\n",
-    " * Copied the result back from the GPU to the CPU.\n",
-    " * Returned the result as a NumPy array on the host.\n",
-    " \n",
-    "Compared to an implementation in C, the above is remarkably more concise.\n",
-    "\n",
-    "You might be wondering how fast our simple example is on the GPU?  Let's see:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 30,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "997 ns ± 0.647 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit np.add(b, c)   # NumPy on CPU"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 31,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "680 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit add_ufunc(b, c) # Numba on GPU"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Wow, the GPU is *a lot slower* than the CPU?? For the time being this is to be expected because we have (deliberately) misused the GPU in several ways in this example. How we have misused the GPU will help clarify what kinds of problems are well-suited for GPU computing, and which are best left to be performed on the CPU:\n",
-    "\n",
-    "  * **Our inputs are too small**: the GPU achieves performance through parallelism, operating on thousands of values at once.  Our test inputs have only 4 and 16 integers, respectively.  We need a much larger array to even keep the GPU busy.\n",
-    "  * **Our calculation is too simple**: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU.  If our calculation does not involve enough math operations (often called \"arithmetic intensity\"), then the GPU will spend most of its time waiting for data to move around.\n",
-    "  * **We copy the data to and from the GPU**: While in some scenarios, paying the cost of copying data to and from the GPU can be worth it for a single function, often it will be preferred to to run several GPU operations in sequence. In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.\n",
-    "  * **Our data types are larger than necessary**: Our example uses `int64` when we probably don't need it.  Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, and for integer types the difference may not be drastic, but 64-bit floating point data types may have a significant performance cost on the GPU, depending on the GPU type.  Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats. If you are using more modern GPUs (Volta, Turing, Ampere), then this could be far less of a concern.  NumPy defaults to 64-bit data types when creating arrays, so it is important to set the [`dtype`](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.dtypes.html) attribute or use the [`ndarray.astype()`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.astype.html) method to pick 32-bit types when you need them.\n",
-    "  \n",
-    "  \n",
-    "Given the above, let's try an example that is faster on the GPU by performing an operation with much greater arithmetic intensity, on a much larger input, and using a 32-bit data type.\n",
-    "\n",
-    "**Please note:** Not all NumPy code will work on the GPU, and, as in the following example, we will need to use the `math` library's `pi` and `exp` instead of NumPy's. Please see [the Numba docs](https://numba.pydata.org/numba-doc/latest/reference/numpysupported.html) for extensive coverage of NumPy support on the GPU."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 33,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import math # Note that for the CUDA target, we need to use the scalar functions from the math module, not NumPy\n",
-    "\n",
-    "SQRT_2PI = np.float32((2*math.pi)**0.5)  # Precompute this constant as a float32.  Numba will inline it at compile time.\n",
-    "\n",
-    "@vectorize(['float32(float32, float32, float32)'], target='cuda')\n",
-    "def gaussian_pdf(x, mean, sigma):\n",
-    "    '''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''\n",
-    "    return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 34,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([0.00926774], dtype=float32)"
-      ]
-     },
-     "execution_count": 34,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "import numpy as np\n",
-    "# Evaluate the Gaussian a million times!\n",
-    "x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)\n",
-    "mean = np.float32(0.0)\n",
-    "sigma = np.float32(1.0)\n",
-    "\n",
-    "# Quick test on a single element just to make sure it works\n",
-    "gaussian_pdf(x[0], 0.0, 1.0)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 35,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "63.8 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "import scipy.stats # for definition of gaussian distribution, so we can compare CPU to GPU time\n",
-    "norm_pdf = scipy.stats.norm\n",
-    "%timeit norm_pdf.pdf(x, loc=mean, scale=sigma)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 36,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "4.15 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit gaussian_pdf(x, mean, sigma)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "That's a pretty large improvement, even including the overhead of copying all the data to and from the GPU.  Ufuncs that use special functions (`exp`, `sin`, `cos`, etc) on large data sets run especially well on the GPU.\n",
-    "\n",
-    "To complete our comparison, let's define and time our `gaussian_pdf` function when optimized by Numba for the **CPU**:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 37,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@vectorize\n",
-    "def cpu_gaussian_pdf(x, mean, sigma):\n",
-    "    '''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''\n",
-    "    return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 38,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "22.4 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit cpu_gaussian_pdf(x, mean, sigma)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "That's much faster than the uncompiled CPU version, but much slower than the GPU accelerated one."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## CUDA Device Functions\n",
-    "\n",
-    "Ufuncs are really quite fantastic if and when you want to perform element wise operations, which is a very common task. There are any number of functions however, that do not fit this description. To compile functions for the GPU that are **not** element wise, vectorized functions, we use `numba.cuda.jit`. In the next section of this course we work extensively with `numba.cuda.jit`, but for now, let us demonstrate how to use it to decorate a helper function, to be utilized by a GPU accelerated ufunc, so that you are not required to cram all your logic into a single ufunc defintion.\n",
-    "\n",
-    "Notice that `polar_to_cartesian` below does not require a type signature, and also, that it is passed two scalar values, unlike the vectorized ufuncs we have been using (and like `polar_distance` below) which expect NumPy arrays as arguments.\n",
-    "\n",
-    "The argument `device=True` indicates that the decorated function can **only** be called from a function running on the GPU, and not from CPU host code:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 40,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import cuda\n",
-    "\n",
-    "@cuda.jit(device=True)\n",
-    "def polar_to_cartesian(rho, theta):\n",
-    "    x = rho * math.cos(theta)\n",
-    "    y = rho * math.sin(theta)\n",
-    "    return x, y\n",
-    "\n",
-    "@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')\n",
-    "def polar_distance(rho1, theta1, rho2, theta2):\n",
-    "    x1, y1 = polar_to_cartesian(rho1, theta1) # We can use device functions inside our GPU ufuncs\n",
-    "    x2, y2 = polar_to_cartesian(rho2, theta2)\n",
-    "    \n",
-    "    return ((x1 - x2)**2 + (y1 - y2)**2)**0.5"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 44,
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "TypeError",
-     "evalue": "'DeviceFunctionTemplate' object is not callable",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
-      "\u001b[0;32m<ipython-input-44-edcd96b4594f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mpolar_to_cartesian\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1.0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2.0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[0;31mTypeError\u001b[0m: 'DeviceFunctionTemplate' object is not callable"
-     ]
-    }
-   ],
-   "source": [
-    "polar_to_cartesian(1.0, 2.0)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 45,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 1000000\n",
-    "rho1 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)\n",
-    "theta1 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)\n",
-    "rho2 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)\n",
-    "theta2 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 46,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([2.2186031, 1.3172561, 1.4999884, ..., 0.7888327, 2.3476734,\n",
-       "       1.5603681], dtype=float32)"
-      ]
-     },
-     "execution_count": 46,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "polar_distance(rho1, theta1, rho2, theta2)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Note that the CUDA compiler aggressively inlines device functions, so there is generally no overhead for function calls.  Similarly, the \"tuple\" returned by `polar_to_cartesian` is not actually created as a Python object, but represented temporarily as a struct, which is then optimized away by the compiler."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Allowed Python on the GPU\n",
-    "\n",
-    "Compared to Numba on the CPU (which is already limited), Numba on the GPU has more limitations.  Supported Python includes:\n",
-    "\n",
-    "* `if`/`elif`/`else`\n",
-    "* `while` and `for` loops\n",
-    "* Basic math operators\n",
-    "* Selected functions from the `math` and `cmath` modules\n",
-    "* Tuples\n",
-    "\n",
-    "See [the Numba manual](http://numba.pydata.org/numba-doc/latest/cuda/cudapysupported.html) for more details."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: GPU Accelerate a Function\n",
-    "\n",
-    "Let's GPU accelerate a \"zero suppression\" function. A common operation when working with waveforms is to force all sample values below a certain absolute magnitude to be zero, as a way to eliminate low amplitude noise. Let's make some sample data:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 47,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "[<matplotlib.lines.Line2D at 0x7fe8b81bd400>]"
-      ]
-     },
-     "execution_count": 47,
-     "metadata": {},
-     "output_type": "execute_result"
-    },
-    {
-     "data": {
-      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3Xl8FEX6P/BPkZNAEggECCSQgMgh\nt+HGEwEB12td1xNcD9br67qXP1yP1d11dV2PXV2PxRNdxRMVAUFElNNAuM+QQAIEQk5IQiAn9fsj\nnZmaZO7p7pquft6vV17p6anprrme6a6ueopxzkEIIURd7WRXgBBCiLEo0BNCiOIo0BNCiOIo0BNC\niOIo0BNCiOIo0BNCiOIo0BNCiOIo0BNCiOIo0BNCiOIiZVcAALp27crT09NlV4MQQixl8+bNZZzz\nZF/lwiLQp6enIzs7W3Y1CCHEUhhjh/wpR003hBCiOAr0hBCiOAr0hBCiOAr0hBCiOAr0hBCiOAr0\nhBCiOAr0hBCiOAr0rZRU1+Lb3cdlV8PyGprO4q+L9+C55TkoPHFadnWUsDa3DAVlNbKrYXlVtQ24\n4uU1mPD0SqzJLZVdHVOExYCpcPHyylw8v2I/AODui/rhgcnnIC6aXqJgzFt9EG+tzQcA/GdVHgqe\nmSm5RtZVXduABz/ahpX7SgCAXssQlJ2qQ+bfvnPcvvWtjbZ4PemIXrO/uNoR5AHg9R8P4Lnl+708\ngnjz7Z5il9sPfrRVUk2s79llOY4gT0IjBnk7oUCvuf3dTW3Wvb0uX0JN1LD9yEmX219uOyapJtZW\nUVOP939yHeWePneJpNoQq6JAryk8cUZ2FZRxqq7R7fqDpadMron1bcwvd7v+0+wjJtfE+jjnsqsg\nDQV6orvp/17tdv2lz/+Is2ft+2ULhqeX64+f7TC3Igr4NLvQ7frWZ0wqokAPYOvhEx7vW7m32ON9\nxL0jFZ7Pjqg5LDBzP6eArpdVOe6vczz25S6Ta2I+CvQArnl1vcf77phP6ZMD0dB01uv96w+4b4og\n7lXVum8GA4DdxypNrIm1cc7xzS7P3abrGptMrI35KNATXX2+2f3pcYvvqfeI377f5/1scuZLa02q\nifXllni/PlRSVWdSTeSgQE909d1eCuR6uf1dOpvUy45C72c/Fzy7yqSayEGB3g8/HaTmBn9wzvEd\nXdMgYegPn26XXQWpKND74YZ5P8mugiV46tVAiBWUnVK3+cb2gX5/cbXsKihj3YEy2VWwnXKFg5PZ\nbB3oGWNpjLFVjLG9jLHdjLHfaOuTGGMrGGO52v/O2nrGGHuJMZbHGNvBGBtl9JMIxdQX3ff5bk31\nq/J6WLqzyK9ydh644i9Pg85a+yWdbfrkqydYi5OnGwyuiTz+HNE3Avg953wQgHEA7mOMDQYwF8BK\nznl/ACu12wAwHUB/7W8OgNd0r7UEFJt8a2jy70XydWGMADNfWuNXuTwfvUkI8NHGw36VU7mJ1meg\n55wXcc63aMvVAPYC6AXgKgDztWLzAVytLV8F4D3e7CcAnRhjKbrX3GT7jlMTj178PcKys0PllNpZ\nL19v9+9MU2UBtdEzxtIBjASQBaA757wIaP4xANBNK9YLgJiIo1BbZxl7/jKtzbqHPrP3VftgxUa1\n/Yg9+fUeCTWxvuszU2VXwZI2FlTIroJ0fgd6xlhHAJ8DeJBzXuWtqJt1bc7pGWNzGGPZjLHs0lI5\nyf83H2qb+mDR/RPd5qDfX0ynyN6s3u/+Pfz+9xe3WbfzaCXlvAnCs9cNd7u+tFrdi4hGGZaaiCev\nPE92NUzjV6BnjEWhOch/wDlfqK0ubmmS0f63jJQpBJAmPDwVQJsctZzzeZzzTM55ZnJycrD1D8my\nXW1P6YaldpJQE+tzF+gfmNwfPTu1xzndOkqokVpev6W5T8MF/bu2uY8CfeA+vGscJp7Tpc36EzX1\nEmpjPH963TAAbwHYyzl/QbhrEYDZ2vJsAF8J62dpvW/GAahsaeIJN2+soQRbejleVdtm3dTB3QEA\nX9w7wezqKOfyIc2XuV6+cWSb+yoUDU56aHJz5pj31HR0jInEOd3i29y386iaHQX8OaKfCOBWAJcy\nxrZpfzMAPANgCmMsF8AU7TYALAVwEEAegDcA3Kt/tY1xzUjnpYTbJ2ZIrIn1LN7R9rd8SK9EAEB8\nbBReucm1l+1Z6sbk0ROLdnu8r1NcNP78s8Eu6x78mGbv8sRdV97ICGfY69Wpvct9s97eaHidZPCn\n181azjnjnA/jnI/Q/pZyzss555M55/21/xVaec45v49z3o9zPpRzbpmEHed2d/7Cd4qLanM/9RYJ\n3sxhrh2vtlMXS4/eXV/g9f6rR7j2bSg7RUf0nuwp8nY5EZh/+xiTaiKX7UfGim6flO5YnnNhX/Tp\nEudy//HKts0TBKhxM7jHXW8bEeX591/rdvl2zF1/B+LOlf9Z53L7qWuGuNy2y/UjCvSa8/t0Rkxk\nhON2bFQE5t2a6VJG9Qx3wSp20z7/2d3e2+Vf/eGAUdVRTuvPYaKbs03in5vH9pFdBSlsG+hP17se\nhd40pnebMud2t8evvRHSu3aQXQVltI+O8F2IBG1A97YXZVVj20B/batZpVq3IQMAo1PkoHWMaTsW\ngZBwNG/W+bKrYDjbBvrWKQ1io+ioKVitp2jrS0fzurnufPejYTc/epnJNVFX7yTXa3Fn6tVLYGjb\nQE/088/lOS633fVYAoAJ/doOUCGuSqpdr3cMT3M/gK9LxxiX25UKZ14MVm2Da8CeNd59+3zrM/e/\nLVEvRQcFeh/ev8Me3a/09OrN7k+Fu8W7BqdG6q7axpJW4xFuGJ3moaSrNXly0oiEs+daHYD8fuoA\nvx73QZZ/2S6thAI9gAvP9ZyCYWI/165tqg6R1lOPxFi361t/0eoaKdC39sqqPJfbURH+fUUb/UwR\nbSdvrnUd+Z7Y3r69lSjQA7hlbNseNy1aX48tor70QUtr1Rb6s5fXSqpJ+Ap28NODH2/TuSZEJRTo\nAUw9r4fH+1q33/k78w/x7WBZjewqEGILtgz0oaTIVfFCjZ5+fWFf2VUgJGCPzhwkuwqGsmWg95X/\nwhuaBs+7udMHer3/tgnp5lTEgk6eDqzZ5tO7xxtUE/V8NGec1/vvvMD1AKVesetHtgz09dTbwzC+\nBpnNpkDv0bd7XPP/eOqm2mJ0epKR1VHKCA/dVD1ZuKXQoJrIYctAH2gPhTUPXWJQTayvdb9vXzJo\nMJXfljxwgewqKCPQQe4fbTriu5CF2DLQ3/7upoDKp3Zu77uQTf3jmxzfhUhQWudK9yXQH107ERMW\n+mPbkZMG1UQOWwb6QHvOUM4bz9YfKJNdBaI5SaNjiQe2DPREP+K4gsu9dFP1hEbHOm04UB7S46vO\nUKBv4W5mKTuzfaD/29VDfBdqRcWkR3p45udDA36MuykI7eqLrUdDevx1r2/QqSbW99y31KQosn2g\nP69nQsCP+WpbaF9IVQUzxPyJrz3Pj0pIsF5ZFfjENoH2zLES2wf6kb07B/yYvy3Za0BNrM/faxkX\nCbmFqF3ZvUEp/h2A0LgE3zylem7tzdmZvgtZlO0CfaUO7ZiUBiE0//rlCNlVCHvj+/qX0vmxKwYb\nXBPru2NShl/lurZK/awS2wV6CtLyxcXQJC++3HWhf8Epoh31CPPF37Mjldku0NPVeP0Ee1E60D7N\ndtDUKv9SUodoSTWxvlByWYl2HVUn3YntAv2BUsqYqJd31uf7LuSH1kHOjhpadTOlH8PgZR86oct2\n8hXKrmq7QD/77Y2O5f7dOvr9uC/vm2hEdSytulafZrDymjpdtmNlRypO67Id+tHU76z9T1/s1GU7\n4cB2gV60+IFJfpdVuetVsAp0OuKh1jRAjM+htLvvDSEzqyrahfD6jc1wJorT60AmHNg60IdyekyD\npoBvdh13LG99bErQ2zlLkR7Zhyocy4NS4oPejmo5WoIRShv9I4rmpbd1oA9F6xnm7a5zCBcPT9RQ\nX/pdR51H4tMGB55KooVqedRDNemcrr4LCdopmteKAj2R4ivhmsctb2VJrEl4WLDxsGN5SGpiQI/t\nLczF+zwN/XcxY2hKQOX7dInzXciCbBXoQ71I8ws/R9gR34b2cgaziprgJsRWVVJcYGdHlwxwjjSu\noSZFPL9iv2P5mpG9AnpsfGzgaTyswFaBvrYhtNPaZ68b5lheR+l5Q6LoGbIuxB9Bf/xuygCDamJN\nG/Od1zvaR1M3VcBmgT7U4CLmcrn/w60h1sbeKMe/Z4H2Gkn0MeUgIRToCSFEcfYK9KBITwixH5+B\nnjH2NmOshDG2S1j3BGPsKGNsm/Y3Q7jvYcZYHmMshzE2zaiKB+Pkabropxe98okQEs72HVdjAJo/\nR/TvArjczfoXOecjtL+lAMAYGwzgBgDnaY95lTEWNldDxMEkz1wb+GxIxGnBpsO+CxG/6DFkPybS\nVifnhvr9lHMdy6cV6cXk89PBOV8NoMJXOc1VAD7inNdxzvMB5AEYE0L9dDXn/c2O5fP7BD7hCHF6\n5AvHCR7+EcQUgsTpQOmpkLdBKTr00yEm0rFcUlXrpaR1hHIYcD9jbIfWtNMSNXsBOCKUKdTWhZ3+\n3YMfZt6C0iA0y+jqf3I4T/SYEMaqlu8uDnkbf5jm7GL55pqDIW/Pznp2au9YfvqbfRJrop9gA/1r\nAPoBGAGgCMDz2np3VzvdnpcyxuYwxrIZY9mlpaVBVkOu6lr7BifRGCERVCBuHNPbsVysyJFTMMpO\nObN3Tujn38xSrYl97+081eUSYbL5tKT2Xkp6dvkQZwqKQ+X6ZBWVLahAzzkv5pw3cc7PAngDzuaZ\nQgBpQtFUAMc8bGMe5zyTc56ZnJzsrkjYi4qgdtFQXDm8p2PZznnNxAvbH9w5Nqht2Pn1Ey3b7Uy0\nNyqI+aBVFVSkYoyJCSSuAdDSYLsIwA2MsRjGWAaA/gA2tn68lSXEOtvvqF9+aMYLR6/7i6sl1kSu\n+RsOOZaDHUjWRJEeALByr7MZrEdirMSahBd/ulcuALABwADGWCFj7A4AzzLGdjLGdgC4BMBvAYBz\nvhvAJwD2AFgG4D7OuVIN2aOEi7jiUGsSmoVbCmVXwdJowpFmYi+ZW8b2kViT8BLpqwDn/EY3q9/y\nUv4pAE+FUqlw9vKNIzH0iW8BAAXl6kw1JtuqHGtepwkX4pkmaZaWpGYmymBQI3OAxOx2L67IlVgT\nQpwodxDxxjaBXq95JEVnaPIRQpR3ut76UwraJtB/uJFGchJCAqfCeBnbBPpV+5xtwN3iYyTWhBAn\nI840ib5UuNBtm0BfecaZ0CyZAn1IaKCYfnYerZRdBeLD/A0FsqsQMtsE+k0FJxzL790eNul3LOms\nMFHXp3ePl1cRBdQJk3nreQBCZwqhuViYnvFULbXRW1KXjnREH4ojJ5zDwnt1Cm6YOWm2cm+JY/nR\nmYN0227VGesHJ5nemj3asZyvQBoEWwZ6PalwoSZQf1/qzKUSGRFat77BKQmhVsfSPshyjortrWO/\n70Xbj+q2LTuKEKZzXL3f+mM8KNCHqEaBrleBWn+g3LEc1S60j9CoPvZOr1stNAuMDDE3yze/ucCx\nTAPQiIgCfRD6dHEeedl9piUxd3cwRqY5g9uqnBIvJYkvg4Szo/wy+43aPiw0sfxMSJhHKNAH5dZx\nzhwaNo/ziA5xZqNenZ1t/L96Z1Oo1SEaOwb6137McyxHtqORwiIK9EG4dbwz0FPWwNBk0kxfRCfr\n8pxNinpe2FYBBfogxEQ6p8E9UWOvCcf1ngkqknL6E50crnA23VDPOle2+JYZ2TNG7DVhBwU2bBIg\nxOpsEeir64wbyblg4xHfhRRCDVWEWI8tAr0YnV6/ZZS8eijgLF2TIMRybBHoxdA0Io0u/oWChtYT\nu5h0TlfZVdCNLQL9ydPOppsQx/fYXpUwwCc2il5Moq7rR6c5louraiXWJHS2+Kb+IAzEiYmI8FKS\n+CIOBx+b0cVLSRKI685P1X2bdh/MF6oIYdauBRafz8IWgf648GucGBflpSTxRTyyuXRgN122ObBH\nvC7bsZrS6jrHsl4DfNKSnAPQaIxHaLolOLtoNjZZ+7W0RaB/Z12B7CooY+nO447lWcLAMRI48UdT\nr4vcz/58uGO5WoH0usHQa2Ih8QCk0eJnR7YI9MQYek1Ifc/F/XTZjtW0E14/vToJpCTGOpb/++MB\nXbZpBWInAb0GS8VGOZt5rT73OgX6II1JT5JdBWX8bJgzAVVNnX2OQpftKnIsXzE8RZdt9hACfeHJ\nM7ps0wrEvP56dRKIEkZtWz0dOQX6ID04pb/sKiijndA+/bqNjkLFa0cJsfpcOxKPQncW2meawgOl\npxzLRuRPend9ge7bNBMF+iB1jouWXQUl7S+ull0F03ySXWjo9sXcL6rbmF/hWL5jUl+JNQlPFOiD\nJLbZ2TElrFGazvouQ0hrK/c5m27E5ivSjAJ9kJI6OI/odx+zzymy0Spq6nwXIoQEhAJ9kLrFO48a\n3qXum7rZcvik7CoQohzlA33hCePbKbceoeBECAlfygf6D7OcQ5evGdnLkH00WXwwBSHEvZvG9pZd\nBV0oH+jFGHz7xAx5FVHAydP2mk2LEL1G2cqmfKDfcNA5j2RcDCU0C8UOoV92TIiTghPj0HujHzEx\nXH2jdbuEKf+J2C60n7ez+jjmMHLhucmyq6CMhfdO0HV7HWMiHctmXKNSWXehq2Zdo3VHxyof6EVd\nOtIgp1CIibdUmpRBNr0yV7YQRxpbPY+6bJcN6u5YtvKlOJ+BnjH2NmOshDG2S1iXxBhbwRjL1f53\n1tYzxthLjLE8xtgOxlhYzdun1zBzdxptMNJn0bZjjuVrR+l7YXtIrwRdt2clGV076Lo9MY96XYP6\nn0sjdU9wHtFXnTFu7mmj+XNE/y6Ay1utmwtgJee8P4CV2m0AmA6gv/Y3B8Br+lSThIMsYZh5vM4/\nmsNSO+m6vXBXIhxp6/1azprgTB/9xdajum47HIm93sQjcL1d8Owqw7ZtNJ+BnnO+GkBFq9VXAZiv\nLc8HcLWw/j3e7CcAnRhj+qTlC3MWPqvz21GTsiHaYWak74Uh+3q7Y5Kzd5kduv4eEz6X4sQrxCnY\nNvrunPMiAND+t0w11AvAEaFcobZOSdOH9JBdBWXECxcQ9ZqEI5wVlBt3kTQm0tm77Mtt6h/R55U4\nM1dOMfCI3sr0vhjr7qqS228tY2wOYyybMZZdWlrqrkjYu3Wc8xTZBrHJUKlJcY7lJTuLvJRUg1np\nmG1wQI+fhC7UE6iTgFvBBvriliYZ7X/LeWghgDShXCqAY3CDcz6Pc57JOc9MTrZmV72MZOdFNG6L\nxhvjjExzttH/uN+aP/xEjtMWnxTEDMEG+kUAZmvLswF8JayfpfW+GQegsqWJR0U9hCvyP+RQcArF\nkF6JjmVx0mxCfKmutW5vGLP4071yAYANAAYwxgoZY3cAeAbAFMZYLoAp2m0AWArgIIA8AG8AuNeQ\nWocJcc5UsZ2QhGZNbpnsKhAL+XKb20YDIoj0VYBzfqOHuya7KcsB3Bdqpazon8tzcN8l58iuBiGE\ntGGbkbFirw5CCLET2wT6MRlJsqugjGnnURc2QqxE6UBfdsp5Ue/igd28lCSBMDKVBCFEf0oH+uW7\njzuWf5mZ5qUk8YULAwW6KpKjmxB/PHbFYMdypUXz3Sgd6KtrGx3L0ZSjOyQ7jzpz0cdGUl7/UDQI\nCfDemJVpyD6Gpyb6LkT8cm73jo5lq06+o3T0o+zz+jlQ6uw+GhdNgT4U4hwJg3sak7VTzLpIQtMt\n3vlafr3dml051Q70FOl182l2oWN59oR0Q/aR2N4ebf/i+Au9c9G36NXZfsm9fjUx3ZDtDugR71iu\nb7LmCHilA30t5eLWzfoDznwiRjWD2SUn/Z6iKseyUWdHd13Q17FcUWPN5oZAjezd2fB9WHUUrtKB\n/oUV+w3fR7JwYbKgrMbw/ansd1MGyK6CKXKLqx3LeueibxEZ4TxTqKixR0qJfsn6TuDizjvrCgzf\nhxGUDvRmeOe20Y7lExa9UBMu9J5pyc7EZrD3NxySWBNj1TY4E5qldo7zUtLeKNCHiHrz6CdCaK+2\n6imyP8pNaEoRc9J/IlxfUc2zy3Icy3a5xhMMilIhah9FPVD0khDrTFPx18V7JNbEWEt2mJvQ9UyD\numl8316XL7sKlkCBPkRpwoQZOcervZQkvoi9USgnPSH6oUCvozV5lF5XL8VV9riASIgZKNDraGN+\n6znUCSFEPgr0OqKZkQgh4cgWgd6M/rWEEHWlJFo7pYQtAv0zPx8muwqEEAt751ejfRcKY8oGenEg\nxSgThkYTQtQVHWHtUGnt2ntRJeSNjjAocZQdPf+L4bKroIzxfbvIrgLxkzVTmTkpG+iJMWKi6COj\nl5dvGim7CsRPSXHRjuV9x6u8lAxPyn5rG89a/Tc4fIizSw3oHu+lJAkn53Tr6LsQ8UvnDs5Af7yy\nVmJNgqNsoH/y692m7cvq7Xe+7Ch0zi7Vw+K9D2Q7Veec9YwbfCzywvXUzGaE4ioK9GFj+e5i0/bV\ns5PawU/MlcIMns3lj9PUTlUsJmszemKc83raZzrB124eZdq+6hqtN8+FsoHeTPUWfOMDsaPQOfVd\nhMHRKbmjM7+/iq/r1sPO17JrR2MnWRf7IGwQJo5RkZmZK634uaRAr4NBKWrPjPT3pfscy+0Nni9W\nvNh79OQZQ/clgzgxuJnmry+Qsl8jnal3nmmaMbtUi5o662UDpUCvgzsuyHAs1yqcEtYMg4UfzfJT\n6qWUqDwjJ8/+hoPqHdHX1Duvd9Sb+AP64nfGz1ynNwr0OmgnNGdQb5/Q9Bd69aj4Uj7+lXmdBESy\nfmCM9PX2Y47lWOr26xW9OjoQA73Rbdh2QmdHoTH6wrls81YfdCyLM2qRtijQ62BU706OZSteqAlX\nX249KrsKJIwVWbA/uywU6HUQKfSj311U6aUkCUSdpAuXhKiGAr3O9hyz3vDocLV0p7lzqxKiKgr0\nOludS9MJ6sXokaOE2IXygf5nw3uaur/VNKk1ISTMRIbyYMZYAYBqAE0AGjnnmYyxJAAfA0gHUADg\nes75idCqGUIdZe1YQbPH95FdBUJIEPQ4or+Ecz6Cc56p3Z4LYCXnvD+AldptaTLTadKRUJwVOrNf\nOaKXxJoQItdX902UXYWgGdF0cxWA+dryfABXG7APr8TgdMPo3mbvXinVtc7RhzGRyrf0mWak0CXX\nSLeOo7MwvQxPM+c9M0Ko31wO4FvG2GbG2BxtXXfOeREAaP+7hbiPgO046uziGE3BKSRMePmG9LJP\nNkSjPTpzkCn7oZz0BAixjR7ARM75McZYNwArGGP7fD5Co/0wzAGA3r31Peq+74Mtum7PzmikrzF6\ndmpvyn7GZCSZsh8S3kI63OWcH9P+lwD4AsAYAMWMsRQA0P6XeHjsPM55Juc8Mzk5OZRqtKFi1kNZ\nZEybliTM5qOqznHmPMfkeGcqZFXHePy/yweavk9usb6/QQd6xlgHxlh8yzKAqQB2AVgEYLZWbDaA\nr0KtJJHn2WU5pu/zqauHmL5PM4i5e6JMmpUsPtZ50l6qYDZQALhxTJrp+1yw8Yjp+wxFKJ+27gDW\nMsa2A9gIYAnnfBmAZwBMYYzlApii3VbenAv7yq6CIbLyK0zfp3gUqpKPNzmDQ0Q7c5rExKa3BoXy\nMIk/mgmx5k060iKjawfT9xmKoNvoOecHAbSZlJJzXg5gciiVsqLJA7u5ZNMjwVMxPTHgOiWjWcQ8\nTI1n1Qn0mwqcByBnOUc7k0fM5Byvwvh+XUzdZyioS4pOIiPooqVexPbPMoWaG9ZKTo+xXqHpBMUm\n8kiTmsFEdmq6IYJ21DtFN+IB/bvrCmRVQ3dr8+QG+vc2HJK6fz3JPukrr6mXXIPAUKDXidjmKs5l\nSQKX3sXZ/rnveLXEmpBwtWqf2858prHamSYFep2IF2f2FKnZjc0sPRJjHctZ+eo0NxD9HK44LbsK\nlkKBXifxwpX/okrqx6+X03R2RNzonhDruxBxoEBvgPcVaguVrUnVLjgkJLuP0UxugaBAbwAZfc8J\nsZMdhRToA6F0oO+m6MAbQggJhHKBXszncaXJs0sRQkg4Ui7Qi92e+iZTilZCEtubnyJAVQ9M7i+7\nCkFRLtCLl+6uGJ4irR4qECdw+dLCs+vY3ZuzM30XIn6JNClHkd7UC/TC2GgZyY5UcuK0c/TfUJp0\nRDfDUs19LbvHU1dEvcwc5jx4bGyyTu4g9QK97AooRBzta/aBzGiF5/o1u8sop2+FbvoJzcGn6hq9\nlAwv6gV6i00IEM7KTjmP6JnJuXxioyIcy6q9p2YH+o4xoU4kF76mnddd2r6t9LFULtA3Nlno1Q9z\nJdW10vYttoWa/SNjtCmDzQ1OXTqq2824Q7S8H7HsQyek7TtQygX6/cXykmD94vxUafs2woo9xdL2\nnaBYT5Hjlc4fzWnn9ZBYE7WkJsVJ2/fRE9bJt6NcoH/u2/3S9n31yF7S9m2EL7celbbv2MgI34Us\npPJMg2P5nG7yuv1W1zb4LhTmxKa8KyX2rCuptk4GS+UCvUyqTYF34rS8oPDri9SamnH57uOOZfH6\ng9lUSKFdUy93GsEWr/5wQNq+A0WBXkdqtSTLpdpgN5nNYCIV8vuLZ0dNVroiKhEFeh2J1wzPUtZF\nIth5NDyScN3zv82yqxCygrIax3J7iWdHVkKBXlfOSN9IgV43p+ut01853NUo0HSzcq9zdqlOcdES\na2IdFOh1lCLMjLQ2r1RiTdSyeHuR7CqQMHKovMZ3IeKCAr2OOggDU45U0CxTenno8x2yq0DCSJnF\nJuYOBxToDTJv9UHZVSBESduPnJRdBctRNtD/cdoAqfs/S70BCFHS8LROsqsQMGUDfT/J3fOKKs1J\nH1DX2IQcg7vMpXZub+j2CbGSESZnH9WDsoF+Uv+usqtgise/3I1p/1qNkirjflismoObuFdV24Dv\n95nTr59zjm92FqG+0TopfVWkbKC3Wsa+xTuOYVcQfa2zDzVPRC4OIvHlvQ0FKKr0/2JxZISyHxNb\nevCjbbj93WwcO2l8h4Ef95dcMayUAAAPDUlEQVTing+24IUV8lKTyMA5x1tr81EaJmkSLP0Nbmw6\ni0e+2IkjFeGfXKi2oQkPfbYd5cJUh0dPnsEjX+xEY9NZ3P/hVlzx8lq3j80vq8ETi3a7HYTVThul\n5W+3/ZLqWjz+1W7c9vYmr+XEfCJPXzvUv40rYM+xKjy9dK9yqZFFBVr3xOraRqTPXYKhf16OEzX1\n2HbkJJ7/Nsfn47/ZWYQPsw47bv/n+1xkHSx3W7Zl8prjbg4sNhVU4OWVuX7Xe8HGw1i6M/y62jY0\nncXDC3e6HDzllpzCXxfvweinvpNYMydrHfa2svnQCXyQdRgfZB1GwTMz0RBmM768vTYf6/LK8NZt\no7Fwy1F8kl2IiHYMT187DADw0GfbsS6vHDOGOhMzpc9dAgDI+dvliNESe13y3A8AmufDXbyjCK/d\nPArvrC/AxvwKx+POco7ffLQV/ZI7usxr+Un2EXy2uRCf/Ho86hvPYsoLq5u3X1yN9LlLsP3xqUiM\ni8KpukYM+fNyJHWIxsUDkrFwizOh2ZCe4dMmWVJVi6teWYcP7hyLvskdsS6vDDe/mYVnrxuG6zPT\nXMpuP3ISV72yDs/9Yjiu0zKLvrnmILLyK/DGLPfT6/38tfU409CEByb3R4eYSPzh0+3okRCLrPxy\n3DC6N6Ij2+HNtfl4/44xmPHvNYiPjcKlA5Pxx2kDXbaTV3IKs97Kwlf3T8IDC7Y61v/6Qvk5fA6W\nNgf6af9q/ixU1zVixktrHNeVFm45itiodrhmZC/cf6nrHKkHSk/hng+2AAB+yCnBvFmZjkSCBc/M\nbFP2tx9vb7P/7/YU47lvcxzpGHYdq8Ty3cXY8tgU/JBTggUbD+PTuydg8Y5juP/DrYhox9zm8I+O\nlHOcev3oNMzfcAhAc5K4oU98C6D5x+xnw3viw6zD+MtVQ9o87tpX12HW+HRH8sNb3szCzGEpuHFM\nb8PrzMLhyCUzM5NnZ2cH/Lj1eWW46c0sAMBfrzoPj32123Ff6w+dWVoCtSeTB3bD1SN74f+EL787\n79w2Guf1TMCYv68MuA6DUxLw2i2jsLeoCnf/b4tj/Xu3j8Gstze6fUz/bh2RW3LK7X3ij46ZWr+W\n4hf+2pG9EBMVgQUbnUeWax66BIfKT+OWt7Lw6d3j8eTXu7HraBUA4N6L++HGMb1xwbOrAADf/e4i\nXPbCjx73nfWnyThd3+T4kfXHjiemIiE2yutn4PVbRuHyIeZnXHx44U6X1ypYV4/oiS+3HXPcfnTm\nIPxtyV4AwOzxffC/rMOO9yg+JhLVQc7CNLJ3J2w97L0bZceYSOx6clpQ2w/FofIaXPTPH/wu/8dp\nA/DehgIUVzWfzSfERqKq1vm65D01PejmUcbYZs65z0mBLR3oV+8v9Ri4wjXQW5Gs13LC0ytxzKTe\nS3r5/vcX4dLnPf+AfHnfRIyQ0D1vwcbDeHjhTtP3a6QZQ3vg1ZvPN32/FTX1GPXXFbpuM9jvmL+B\n3tJt9OLk1UQ9/7tzrOwqBMxbkAeAgT3iTaqJq7ho9ZJ/PfeL4VL2m9TBevl1LB3oa+qsn6CJeEZ5\n4fQzMk29ydbbKTbFpJEMC/SMscsZYzmMsTzG2Fwj9rE5DOdsnGryfKAqU3F0cbSkrqq9u8ibcs8o\nFOf9Z8injjEWAeAVANMBDAZwI2NssN77+XxLod6bDFnHWEt3ZAorKn6P29HgMyKBUYcXYwDkcc4P\ncs7rAXwE4CqD9hVW7rpAfvc5VcicW5WEPxk9wazKqEDfC8AR4Xahtk55XSx4oSZcMTo3J0QXRgV6\nd99QlwZXxtgcxlg2Yyy7tFSdSTq6JcT6LkQIISYyKtAXAhCHKaYCOCYW4JzP45xncs4zk5OTg9pJ\nDwqqhBDik1GBfhOA/oyxDMZYNIAbACzSeyfnSuqTTAghVmJIoOecNwK4H8ByAHsBfMI53+39UYG7\nZEBwZwKEEGInhvUF5JwvBbDUqO0DnptuNj1ymZG7tZW+yR1kV4HoJDaqHWobwivxX7AS20dJ3X96\nlzgUlId/1twWlh4Z62nkZHJ8jLkVUdhvLztX6v43/mmy1P3r6eoRPaXuf/1cdV7LjY/IfS7/1yqr\nZyiGmTBjlaUD/bDURPzj56650jO6yj8CHZySENTj/nfHWPxmcuAfoNsnZgS1P1+uHN4TVwwzP9Oi\nSEYvph//eDG2Pz416MffNNaZdvbze8Zj2nndceekDGm5WVr4m6Nl1R8uRv7TMwyujdOHd47FG7My\nMXlgN78fEyG56+0AN9cHNz96Gb797YV4ZMYg3DnJ/+/k5/dM0LNqblk6e2WLyjMNGP5kc05oWZkW\nRXWNTXj+2/2Yt/ogLujfFWtyyxz3dY6LwonTztmg9vxlGgY/vhyAs+6n6xtx7GQt0rvE4cGPt2Hx\nDudkC2Myklzy0IuPW72/FKXVdfj9p805wEf17oQtQqrX3Kemo/8j37g89sO7xqJbfAwu0/LUA8DY\njCS88MsR6NUpPOaKPVJxGo1nuUvK4A0PX4rjlbUYlJKAgY8tc6zf9MhlqGtswqR/NKcjnnZedwxK\nScC/vmue4KKl+WLOhX1x1wV98fQ3e11y7+/5yzTERTe3aFbXNr9PDy/c6XgPsh+9DLFREThRU489\nRVV4e20+svIrsOoPF+NQeQ0G9khAj8RYzHxpDQ6Vn5aSRtebqS/+iP3Fp7D98alopx3mnalvQsNZ\njpvf+AkF5addPodHKs448tYDwK3j+uC+S84BB8f4p7932fYVw1Lw1DVDcaa+Cd0TYjD1xdXILTmF\nTnFR+O53F2HGv9egpLoO2Y9ehvjYSNw5PxtrcstcvrNFlWfQIyEWAx5dhnptfolVf7jY5b3/z00j\nccUwuWdHALC3qAopibEY8ZcV6J4Qg6w/uTYZH6+sxbinV2JYaiJ2FDpnjxuUkoAv7p2A/cXVGNIz\nMaTR0rZIU2wFe4uqMP3fa7DmoUuQluTMN9KSzrjgmZkuy/54ccV+LN99HMsevNCv8i+tzMWi7cfw\n3e8u8rivBxZsxdGTZ7D50Al8fs8EnN8nPJNg+fNa1TU2YcCjyxAd0Q77n5quy3bf21CAl7/Ps+X1\nH0+vTaCf20As21WEu/+3BTeN7Y2nrh6CjIeXYkD3eCz/rX+f+XBz13vZ6BgTiRd/OULX7VKgD3Ov\nrMpDXHQEfjUxw9AvTGtvrc1HXWMT7r34HMP3ZYT3NhSg8nQD/s9LExfnHL/5aBtuGJOGCf38myT+\nzTUHcZZzzLmwn041Vcffl+7FoJR4XDMy1WW9mZ9b4h4FegvZW1SFnw6W41cGtbUTYoSdhZXYVngS\nt47rI7sqtuVvoKdUi2FgUEoCBgV5AZcQWYamJmKoCT1GSOgs3euGEEKIbxToCSFEcRToCSFEcRTo\nCSFEcRToCSFEcRToCSFEcRToCSFEcRToCSFEcWExMpYxVgrgUJAP7wqgzGcptdBztgd6zvYQynPu\nwzn3OQNTWAT6UDDGsv0ZAqwSes72QM/ZHsx4ztR0QwghiqNATwghilMh0M+TXQEJ6DnbAz1nezD8\nOVu+jZ4QQoh3KhzRE0II8cLSgZ4xdjljLIcxlscYmyu7PoFgjKUxxlYxxvYyxnYzxn6jrU9ijK1g\njOVq/ztr6xlj7CXtue5gjI0StjVbK5/LGJstrD+fMbZTe8xLjEmeUVnDGItgjG1ljC3WbmcwxrK0\n+n/MGIvW1sdot/O0+9OFbTysrc9hjE0T1ofdZ4Ix1okx9hljbJ/2fo9X/X1mjP1W+1zvYowtYIzF\nqvY+M8beZoyVMMZ2CesMf1897cMrzrkl/wBEADgAoC+AaADbAQyWXa8A6p8CYJS2HA9gP4DBAJ4F\nMFdbPxfAP7TlGQC+AcAAjAOQpa1PAnBQ+99ZW+6s3bcRwHjtMd8AmC77eWv1+h2ADwEs1m5/AuAG\nbfl1APdoy/cCeF1bvgHAx9ryYO39jgGQoX0OIsL1MwFgPoA7teVoAJ1Ufp8B9AKQD6C98P7eptr7\nDOBCAKMA7BLWGf6+etqH17rK/hKE8CKPB7BcuP0wgIdl1yuE5/MVgCkAcgCkaOtSAORoy/8FcKNQ\nPke7/0YA/xXW/1dblwJgn7DepZzE55kKYCWASwEs1j7EZQAiW7+vAJYDGK8tR2rlWOv3uqVcOH4m\nACRoQY+1Wq/s+4zmQH9EC16R2vs8TcX3GUA6XAO94e+rp314+7Ny003Lh6lFobbOcrRT1ZEAsgB0\n55wXAYD2v5tWzNPz9ba+0M162f4F4CEAZ7XbXQCc5Jw3arfFejqem3Z/pVY+0NdCpr4ASgG8ozVX\nvckY6wCF32fO+VEAzwE4DKAIze/bZqj9Prcw4331tA+PrBzo3bVDWq4LEWOsI4DPATzIOa/yVtTN\nOh7EemkYY1cAKOGcbxZXuynKfdxnmeeM5iPUUQBe45yPBFCD5tNtTyz/nLU246vQ3NzSE0AHANPd\nFFXpffZF6nO0cqAvBJAm3E4FcExSXYLCGItCc5D/gHO+UFtdzBhL0e5PAVCirff0fL2tT3WzXqaJ\nAK5kjBUA+AjNzTf/AtCJMdYyUb1YT8dz0+5PBFCBwF8LmQoBFHLOs7Tbn6E58Kv8Pl8GIJ9zXso5\nbwCwEMAEqP0+tzDjffW0D4+sHOg3AeivXcmPRvNFnEWS6+Q37Qr6WwD2cs5fEO5aBKDlyvtsNLfd\nt6yfpV29HwegUjttWw5gKmOss3YkNRXN7ZdFAKoZY+O0fc0StiUF5/xhznkq5zwdze/X95zzmwGs\nAnCdVqz1c255La7TynNt/Q1ab40MAP3RfOEq7D4TnPPjAI4wxgZoqyYD2AOF32c0N9mMY4zFaXVq\nec7Kvs8CM95XT/vwTOZFGx0uhMxAc2+VAwAekV2fAOs+Cc2nYjsAbNP+ZqC5bXIlgFztf5JWngF4\nRXuuOwFkCtu6HUCe9vcrYX0mgF3aY/6DVhcEJT//i+HsddMXzV/gPACfAojR1sdqt/O0+/sKj39E\ne145EHqZhONnAsAIANnae/0lmntXKP0+A3gSwD6tXu+jueeMUu8zgAVovgbRgOYj8DvMeF897cPb\nH42MJYQQxVm56YYQQogfKNATQojiKNATQojiKNATQojiKNATQojiKNATQojiKNATQojiKNATQoji\n/j96nzZ+iupJRQAAAABJRU5ErkJggg==\n",
-      "text/plain": [
-       "<matplotlib.figure.Figure at 0x7fe8b92a1b00>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "# This allows us to plot right here in the notebook\n",
-    "%matplotlib inline\n",
-    "\n",
-    "# Hacking up a noisy pulse train\n",
-    "from matplotlib import pyplot as plt\n",
-    "\n",
-    "n = 100000\n",
-    "noise = np.random.normal(size=n) * 3\n",
-    "pulses = np.maximum(np.sin(np.arange(n) / (n / 23)) - 0.3, 0.0)\n",
-    "waveform = ((pulses * 300) + noise).astype(np.int16)\n",
-    "plt.plot(waveform)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now decorate this `zero_suppress` function to run as a vectorized ufunc on the CUDA device. Check out [the solution](../edit/solutions/zero_suppress_solution.py) if you get stuck."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 53,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@vectorize(['int16(int16, int16)'], target='cuda')\n",
-    "def zero_suppress(waveform_value, threshold):\n",
-    "    if waveform_value < threshold:\n",
-    "        result = 0\n",
-    "    else:\n",
-    "        result = waveform_value\n",
-    "    return result"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 54,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "[<matplotlib.lines.Line2D at 0x7fe8b80f5c50>]"
-      ]
-     },
-     "execution_count": 54,
-     "metadata": {},
-     "output_type": "execute_result"
-    },
-    {
-     "data": {
-      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3Xl8VdW5N/DfQ0aSAEkgIPMYQVRk\niAgiToizRa/2VdqqrVjux+FabXu9ent9b6fXWqf2tbUqVutQr9qqVcQBEahWrGAYRIaEhDkkJIEQ\nhgQywLp/ZOfsleTkTHtYZ639fD+ffLLPPjtnP2d6svfaaz2LhBBgjDFmrh6qA2CMMeYtTvSMMWY4\nTvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWa4VNUBAEC/fv3EiBEjVIfB\nGGNaWb169T4hREG07ZIi0Y8YMQLFxcWqw2CMMa0Q0c5YtuOmG8YYMxwnesYYMxwnesYYMxwnesYY\nMxwnesYYMxwnesYYMxwnesYYMxwn+k5qDh3DRxv3qg5Dey3HT+AXizbh0cWlqDjQqDocI3xWtg87\n9jWoDkN7h4614Mrf/QNn/2op/lFWqzocXyTFgKlk8acV2/GzdzcBAH40+2R8/9xRyExLURyVnhZ8\nug3PfbYdAPD75eXY8dAViiPSV0NTKx54ZwPeWrMHAPi1dKC+sRkTf74kdPvG51YF4vXkI3rLnvqj\noSQPAI8t2YLfLStTGJHeinfUdbj96OJSRZHo79GPSkNJnjkjJ/kg4URvmf9S1xIMTy7fqiASMywv\n7XhK/Pvl5Yoi0duxluP404odHdbNeGiZmmCYtjjRWzZWHlIdgjGaW0+EXV/f2OxzJPr7e2lNl3V7\n6o92OWNiLBJO9Mx1jy0J30zzw7985XMk5rru6X+qDkE73f1zXLe73udI/MeJHsCBhu6PNI+1HPcx\nEjM888m2sOuXldTgky3B6OXglo83dz2iZ4n58xfhCz1e/eQKnyPxHyd6AHOf/aLb+8Y98KGPkZjv\n3a8qVYeglTdWV3R739FmPgiJx9vruv/stR4P39xoCk70AEr2HlYdgjG+2LY/4v2REhfrqKk1ciK/\n5g/mH4m6pfrQsYj3N3OiZyx2n2+NnOhZ7Mb+V+SzST5AiV3t4aaI99/25zU+RaIGJ/oYmH5a56Yn\nlvLYA5Z8fvzXyB0BTL92xIk+Bve+sV51CFo4dKxFdQiMhRXL2Y8QwodI1Ah8oo/lzX1rLY9KjMWD\n721WHULgmJyc/FZec0R1CJ4JfKJ/OMah+fyFiu4AD4jy3artPHDKLa0nzP2OR030RDSUiJYT0WYi\n2khEP7DW5xPREiIqs37nWeuJiJ4gonIiWk9Ek71+Ek489ffYyhw0dTPak9kWb6yOabuGplaPI9Hf\nVzEO4rl+Qfddg1mbyvqjMW23ZtcBjyNRJ5Yj+lYAPxJCnAJgGoA7iGg8gPsALBVCFAJYat0GgMsA\nFFo/8wE85XrUTGsle7ncRDRzAjCIxy+L1sc2duMnf9vgcSTqRE30QogqIcQaa/kwgM0ABgOYA+BF\na7MXAVxtLc8B8JJo8wWAXCIa6HrkPvtwA9eodwu3gjE//eqDEtUhKBdXGz0RjQAwCcBKAAOEEFVA\n2z8DAP2tzQYD2C39WYW1ThtbH7y8y7pfvrcpzJYsmlH9srus4zotiXn2piLVIWiJDyziSPRElAPg\nTQB3CyEinXtTmHVdXmoimk9ExURUXFurpg/rZ2X7uqz7/L4LkdKj61PYd4QvNEby3vqqLutSexDe\nvnNG2O1PGHzhyyuzxw8Iu77qYGxt0Mx2/tgC/Ob6M1SH4ZuYEj0RpaEtyb8ihHjLWl3d3iRj/W6v\nvlQBYKj050MAdGkkE0IsEEIUCSGKCgoKEo3fkeVhSsAOyu2pIBL9rQ1zIevfLixE78w0nDOmn4KI\nzPLW7WcDAC499aQu9x1o4PEL8UhLITxz4xRMGprX5b79RyKPoNVVLL1uCMBzADYLIR6X7loI4GZr\n+WYA70jrb7J630wDcLC9iSfZtE91x5zbF+YLMuuUtta8Z26c4nc4xpk8rC0p/fraCV3u4zr/3Tse\n5syx9BeXISM1BSPCNCtuqTazL30sR/QzANwI4EIiWmf9XA7gIQCziagMwGzrNgC8D2AbgHIAzwK4\n3f2wvXHlBPua8S0zRiqMRD/hKgOeNrgPACA7IxW/vvZ0v0PS1pMRZuPqk5WGH198cod1XOe/e+HG\nv/SQmmb7Zqd3uC9SJVudxdLr5jMhBAkhJgghJlo/7wsh9gshZgkhCq3fddb2QghxhxBitBDidCFE\n1zn6ktTogpzQcnpq15eGB00l7vozh3W4vX7PQUWRJL9Hogzi+8604R1u741SmTHI9keYawIAFgTk\nAnfgR8bK5s20j+JvO390l/srDvBFr3BaEij6VlLFfeljNSSv43WjcM0RLLyzHlza4fa8czqeqU8e\nlutnOMpworf0zU5H78y00O0+PdPw5m3TO2zz9Cc8WXg44f4BvnTL1Ih/c99bX3sVjnH+dnvHnkt9\nczIURaK/B64c3+F22yVI8wU20Xduhvl2p9NhAF2uyr+ycpenMZnk1EG9VYdgjIJenNiZM4FN9J2L\nmd0yY0SXbXqE6U/Pugp37YKPOpkuXv3+NNUheC6wib5zMbPcrPRutmTRlHUq78pHoO4Z3jcr7Pon\nv5XUtQK1MnVkvuoQPBfYRM/c07kYVOcua+0mDOnjRzha63x2JHf5lV3RzXrWvaz0lLDrO4+EDzeQ\nUnec6JljnQdLfbNoaNjtemWmdrjNZRC6WryxY/G8y0+PLaFHm5Q9iDZ06sL7wvcidxBod8cr5s0f\ny4k+inAFzlhknbuwtZtzRsfadi0nuMZ/Z5sqO3Y7PXVQbGdB+7kWUxfff6njEJ5JMXalbGw+7kU4\nSnGiB3BrN4kJ6FqhbWutmUOk/fCNiYM63N57kAf6dPbEsu5HxUayjT+XXVR1+nylpQQ33QX3mUvm\nnzeq2/s6d7M9auB/e79kpnVsIz3vkb+rCcRAjy3ZojoElsQCmeg7X/Dq3yuz2207D6gweQJhxpiZ\nApnojziYs/TeN9e7GIl53rkjfP35dmeO6FoaljHVPvn381WH4KlAJvrt+xoS/ttmniQ8ojOGRr7g\ndfZork3fnXg/l6t+MsujSMyz6j8jv1bD+3YsWXzwqFk1/gOZ6Fvj7NY3LD/8oBUWv2+dNSz6RgH1\n5Y66DrfHndQr4vaRmhxZR72kOlaxePHzHd4EokggE31LnEflC7uZDo8BFQca49p+QG9OTrF6/V+n\nR9+IxSTc9KCRLC0xa9BUIBP99Qvim1ygT8/4jgaC5KEPSlSHYKx4P3fVXJe+W+Hml4jkq931HkWi\nRiATfbyCUso0EWt3mfWF0Jlp7crMPZzomSN76u1a9BeO6x/333MZBNs6h0eRJo7oTBTPBtdR4BP9\nv18yNu6/4Rl+wnvwmvjnhV2yudqDSPT0Pw7nO5j3wpcuRaK/t9ftUR1CUgl8oo+1/oVsySZOTuH0\nT6A88YJPt3kQSTBFmx81SO55Pf4J0wf2MbejQOAT/Vkj+8b9N08sLfMgEv3FOlFLP2lSktU7D3gV\njtZGF2RH3wjAOWN4XEI0sTYpvnLrWR5Hok7gEv2xlo7tmPF2uwKATTyxtSPv/ht3V41mSF5sYzee\nvnGKx5Ho76bpXacJDWdUQY7HkagTuETPp7fq9Y5z8EoQ/fdV46NvBCAnIzX6RgF33skFqkNQLnCJ\nnq/Gq5fNyamLzr2P+mbzdIxuSbR7dNXBo9E30kQAE73qCMzBF6Xdc7zTB7NPFp/1JMpJLStZZb05\nA9ACl+jvem1tQn93yakDXI5EfyUuXavggT5dp2NkiXNrhPDzK7a78jjJIHCJXh7J+cTcSTH/3R++\nzRe9OttVF1+dm+40tfBAn8PH7NLZCfQPCOGJcYAeLo1kf299lSuPkwwCl+hl3zhjUPSNLIn0zjHd\nTinR33tp/APP2vH4M6DluF1oz8nL8dGmvdE3MpyTr+r/vTK2i+C6CXSid6K+kXvvrNpul9W97bzR\nCT/OCb5wgnfWVYaW505NvJTzIW4GQ8vxxD9PZ43KdzGS5MGJPkGcmzqKt2fD0PyeoWWnNV5MsFWa\novIkB6Wcaw5zW7989j0rgfpLJuJEnyAuaOnMR3efF1q+/ZU1CiNJDnL98/PHxtfv+4rTB4aWf7es\n3LWYdHVAOtueN3NkXH87fmBvt8NJCoFK9E6Lkb08b6pLkbDMtEB99OKSkZoS1/aFA8wd0ZmIH76+\nLrQ8fVR8JU5MLUkeqG+b0/leZxbaR1rPfWZO1yuWXMZGmUKwszsuGONRJHpqkHoemZq44xWoRO/m\ne86nyM7wF9A9aSmB+hqzBATqE+JW/1rGGNNJoBI9Y4wFUdRET0TPE1ENEW2Q1v2UiPYQ0Trr53Lp\nvvuJqJyISonoEq8CT0T9Ue777hYuDseCYEv1YdUhuCKWI/oXAFwaZv1vhBATrZ/3AYCIxgO4AcCp\n1t/8gYji60Lgoe21drGj/7h0nMJI9PcRFzRjhrprVmFo+UhTa4Qt9RE10QshPgVQF207yxwArwkh\nmoQQ2wGUA0iaPonXL/gitHzuyTwzjxP/+vLq0PJ/XXGKwkj0V3PYeRGuUweZ2f9bhdyeduXQY4bU\nDnLSRn8nEa23mnbyrHWDAeyWtqmw1iWdU05y/sVoPe6su6YphveNbdq7SJx2fdXZ22udT2R9p9TF\nkkcaO5MhjfF46pOtCiNxT6KJ/ikAowFMBFAF4DFrfbhuLWEbc4loPhEVE1FxbW1tgmEkLtb5TSPh\n2araXHRKYsPMLzvtpNDyjv3u1BDXkRt1zy+Qhvpf/eQKx4+nKzfa1Oeeadca+kfZPsePlwwSSvRC\niGohxHEhxAkAz8JunqkAMFTadAiAys5/bz3GAiFEkRCiqKBAz6m+UrmiJYDE+8SfPdoetRjka7vy\nhe3NPw93OSyWx3ArGr09udwe35LoFIJuHAQmm4QSPRENlG5eA6C9R85CADcQUQYRjQRQCGCVsxCT\nV2oP7p3qxHem2ZM2B3nijepD9nPvmZ5Y34XOM1QFVUmVfUQ/Ls4RxiaLpXvlqwD+CWAsEVUQ0TwA\nDxPR10S0HsAFAO4BACHERgB/AbAJwIcA7hBCmHE1w3KudJSwsy64zQ1ukM8ENrs0W5WOPtzovIY8\nl3puUyo13YznC9QhsfS6mSuEGCiESBNCDBFCPCeEuFEIcboQYoIQ4htCiCpp+/8nhBgthBgrhPjA\n2/D9d9UE+2RmzwFzJg9W7ZfvbVYdgtbSuQxCF3MmJmU/ECX40xGnbxbZlyDeXFOhMBLGbJlpSTNc\nhSUhTvQOfLy5JvpGjDGmWGAS/eFjPMUaYyx+dQZ0ow5Mon/soy2qQ2CMMSUCk+jLpTk5ebi4Mycc\nztTFbEcNGWJvsr0HnQ9oUy0wib76kP1mDeyT+OTLDDgsFXrKSA3MR8gTGysPqg6BRWFCp4vAfEvL\npCP6p74zRWEkBpAO6N+7a6a6OAwgnxtNGZ7X7Xbx4rMuZ34gVbBsMKCCZWASvYynXnNGHpSSncHd\n+pz4a7FdA/CWGSNde9y6Rv0vIKp090V2ot9Tr/94Gc54DjU26//fPl4Pvm8PbkpxWBdkwpA+TsPR\n2gcb7FGxw/tmufa4f1vjvCJmkMmjtk0obMaJ3qHGAF5Mk8vgpjms9zNxaK7TcLR2+Jh9oHDaYGf/\n9D7+4bmh5c+36p+cmHs40SdAPvIKeltodkaqo78/Y4id6D8v5+TkxJj+dhGvigCW56g40BhavuqM\nQQojST6c6BPwzSlDQssBz/NId9jrRu4B9a0/rnQaDrPInQ+C4tlPt4WWuYR4R5zoE3Dj9BGhZS4P\n68ykYe71NGHBtrHSroB6hzTjFuNEn5A+8pySLcFqo3d7yr9E668z1lnxzgOh5TH9cxRGknwCkei9\nnNv1g6+rom9kkJK9wa0bz5iuApHovexT/GjAauhwSxVj+glEopeHH84/d5S6OAzAMxkxpp9AJHo5\nNX337BGqwjACp3nG9BOMRC9lJ+525YzgI3oWEL+9fmJoWffPfSAS/aoddaHlHpzoHVm7yx4VO3VE\nvsJIzFLIvUSSTmqKnSveWVepMBLnApHo5RFzvTKdjeQMuq219kCcIXk9XXlMTnJAkQf/NFs87G0W\nBD2kejc79jcojMS5QCT6F1bsCC1npHK/bSfW7bbrp8+b6U61RfkLFSSle+0qoG41Dcw7x35Pgvmq\nuud0qfbQsRa9/2kGItHXHG5SHYIxNlfZ/ehPHeRO5cmA5nm0nrCTh1u9ma4/c2hoeWut3kehiXJr\nsFRuVlr0jTQRiETPktvL885SHYISDU32qGq3inBlSmesv1tW5spj6uDwsZbQsluznuVIBft0v7TH\niT5B10warDoEY/TLSQ8tb98XnKPQJZvsWvQzRvdz5TGH5tvXTTTvKBKXJZuqQ8tuXYeTa9LvqmuM\nsGXy40SfIDdnAwo6+Qv15y92KozEX3LTilu9weTX8r0AleeoPmQ3z557coHrj79ovd6vJSf6BMld\nr5h7dmt+5BSPZSU1qkMwxgcb7EQ898xhCiNJTpzoEyRfQJR7TzBnuMQCS8T6Crs3WF52eoQtg4kT\nfYL65WSElstqONG7pb6xJfpGjLG4cKJPkJzoX/o8OO3KXpNrijPG3GF8oq8+dMzzfWysPBh9I8YY\nU8T4RC/34vBqwuCG5mDNMsVYUFwxYaDqEFxhfKI/0tQaWp4rjRpk8fNypi7GktGYAjPqMBmf6CsO\nHA0t9++dqTAS/X2+db/qEBjzlVyD6MQJfXuEGZ/o5RFzug9jTiYXjuuvOgRjvPC9M119vDypRotc\nGoDFL10qp9DQ3Bphy+RmfKKXZaVziWK3DMrlsyO35GW52+87pYf9td5SfSTCliyaC6QDGo0P6IOV\n6E/q411y0vm0LlZf77F7F82fOVphJGZxu0piivSt1n1mJNXkCq06v5ZREz0RPU9ENUS0QVqXT0RL\niKjM+p1nrScieoKIyoloPRFN9jL4ZBKEEZ0rt9szdQ3rm+XqY8+dGqxh68da7J5aw/tmu/rYp0nJ\n6csdwRqX4GVp4d9+rG810FiO6F8AcGmndfcBWCqEKASw1LoNAJcBKLR+5gN4yp0wWTL4dEutZ48d\ntOsnxR4m4LtmFYaWy6rNH7Vd39gcWr7sNO+6Q77w+Q7PHttrURO9EOJTAHWdVs8B8KK1/CKAq6X1\nL4k2XwDIJSIzOqJGYf7xvLfk1y8I3Th31nlXjvmMobmh5aMt5o/xOHzMvkg6usDdsyNTJNpGP0AI\nUQUA1u/2KxaDAeyWtquw1nVBRPOJqJiIimtrvTtS9NJ/XzVedQjGGJZvNwUdD0Az2Ftr9viynw82\n7I2+keYWb7Sf4+WnB+K4Mm5uX4wNdwIe9lsrhFgghCgSQhQVFLhfP9oP407qHVoOQG7yVLY0m8+v\n3i9RGIk/VnNNH9dU1ttlTgblujNhvWkSTfTV7U0y1u/2wtoVAOThp0MAVCYeXnKbNEw6ReYyCI7M\nmWiXpzh0lPt+s9g1H+fvXjSJJvqFAG62lm8G8I60/iar9800AAfbm3hMlJlmz8/58ebqCFuyaHpn\n2r0l1u2uVxgJ000ZjxWIKpbula8C+CeAsURUQUTzADwEYDYRlQGYbd0GgPcBbANQDuBZALd7EnUS\n4uTknm0BmjeWOSd3+2XhRR0qKoSY281ds8JsKwDc4TQoHb38xU784urTVIfBGGNdBGZkbFZ6SvSN\nGGPMQEYnenn04eRheQojMYtXdf0ZY94wOtHLPWFmncLVFt2SmWr0x4Yx4xj9jZXrpwetloqX8nPc\nrbbIWDL7j0vHhZZ1LftsdKI/INXAkLtCsvhtrjoUWs5K43LPTshVEB++doIn++BSAO45bbA9MLKh\nSc8++0Yn+h4UsEpZHtp/xP6nKbiyjyNyV9wZhf082cdoQ6bASwbyGI+FX/lTusJthid61RGY44+f\nbQstf/fsEZ7sw8sSs8lELmnt1Wc0LYDXUWZ5NOvZaYPtss9H+Ig++dQcblIdgjH+XmoXnst1eUak\ndr0yg9EkVNdgt/OmpXjzFbznopNDy8cCUMESAGZ6dHaUIv03bmrV87U0OtE/vmSLr/s72KjnhZpk\n8eOLx6oOwRdyXf9+ORme7KO39E9zT/1RT/aRbM4cme/5Pp75ZFv0jZKQ0YneD6/NnxZa3raPa244\nMWOMN0dkQZQhdT5YHZBZpnIygnFGmAhO9A4FpV3ZD6kBuajiRx+BPj3tz+XzK7Z7v0NFlkrFBN2e\nktEknOgd8qqNNYjkLrBvrq5QGIm3Xv9yd/SNXFRq8HSC97y+TnUIWuAs5VBBL7uNlTsdOiMn+n+U\n6TnrWCyaWv2dKtHkSXEOSdMIsu5xondI7mP79lo9+9gmo7fXGTtfDWO+40TvolruzskYS0Kc6F0U\nhImYGWP6CUSinzKcSxQzxoLL2ER/4oR9Bap/L28GpTDGgmHu1KGqQ3DE2ETffNzu2fCDiwoVRsIY\n092tM0epDsERYxO9PChlDFfyY4w5oPtQPmMTvdx3OJUHNbnm/98wUXUIxvCq2iJzn+5DEYzNgFUH\nj6kOwUgpASlT4IdfX+fNpCPMfflSxdaSvYcibJmcjE30X0mTOzD3jOnPzWC6GJzbU3UIxsjLthP9\nXg0PIo1N9D9ftEl1CMbYWHkwtMzJw5nGZnvIvtelCR7/P2d4u4OA0nFgpLGJvq6hOfpGLhl3Ui/f\n9qXCEameCHlcevHWc0Z6+viqNftY52bSsOCMH3nEx2awoxpO5GJsovfTYcMLK8n/NL1uoZebhoSB\n1bhWbq8LLednezNTV7u0FPvd0vEoNB5ezXoWTstx/T6XnOhdkJFm9st42ytrQstZ6SkRtnROLvts\n4lSQ8hG9nxe2f/ruRt/25ZdWaazM9NF9fduvn2dlbjE7Q/nkgSvHh5blD5+JvG66GdA7M7RccaDR\n032pcPyEmqPBXfvNey0PSFN3+nn2t6ykOvpGSYYTvQt6SnXUdWy/SyYzxthHZopyoqfuVjRRxtd7\nDkbfSDNbFE2o8qWGUzNyondBD+kol/uZOyOfMZwwMdP7yOuzL9X+WmzP1NUrk6f0jIQTvQsG9rGb\nG3oY/uXy0+pd+h05Mf/w5DSx40TvgqH5WaHlHfsbFEZilpIqc+c6ZcxPnOhd5mf/fdPJg4sYY4nj\nRO+yReurVIdgjI8316gOgTEjGJ/o5R4xflizk9uVGWPJxfhEL18o9UPJXm5XZowlF0eJnoh2ENHX\nRLSOiIqtdflEtISIyqzfSgtufGfacJW7N8qT35qsOgTGlMnL0rcLpxtH9BcIISYKIYqs2/cBWCqE\nKASw1LqtTD+eL9aRFmmk70k+nx0xlkwW332u6hAS5kXTzRwAL1rLLwK42oN9RCT3fLlqwkC/d2+U\nxiZ7pG+vzFSFkZhlZmE/X/ZjejVQP/Xvre+BjtNELwB8RESriWi+tW6AEKIKAKzfvs+X9uqqXaFl\n00cHeo2kT8jJA8wux+ynH1081pf9jCzI9mU/LLk5PUSbIYSoJKL+AJYQUUmsf2j9Y5gPAMOGDXMY\nRkePLC519fGCLIX/UXpiQG9/mhQnB6gmPeueoyN6IUSl9bsGwN8ATAVQTUQDAcD6HbYztBBigRCi\nSAhRVFBQ4CQM5qHymiO+7zPb41LIySDPp/rpfXPs/ZQa2iPsR7NPVh1C0ks40RNRNhH1al8GcDGA\nDQAWArjZ2uxmAO84DZKp85uPt/i+zwf/5XTf9+kHuYR1qk/F7/r0tHuK7D2k31ynsbhhqrstArF4\ne+0e3/fphJMj+gEAPiOirwCsAvCeEOJDAA8BmE1EZQBmW7eNd/XEQapD8MQ/t+73fZ+mzkv7lpQc\nUlP8GcIiF9kzaa4EubJpX49n6gqnj2ZdLRNuoxdCbAPQZfZhIcR+ALOcBKWja6cMMbKaXpOC2XRM\nLU5c3+h/HST5zKHFoEQvT8mo4vOi2zSXxo+M9YvfpRZMJn+HmlrNmcilst7/phO511nVQXOabuR/\nWiqmgHj4Q706fHCid0kPnnDENSekTP/hhr0KI3HXC5/vULr/n727Sen+3SQfT6voQt2q2aQ4nOhd\nIndDNOkUWYWcDLtFcaGBzWHMuZ2K531Q0RvNCU70LsmVLs7sOXBUYST6O21wn9BypUHNDcw9n27Z\npzoErXCid8nwvvYIxD31nOjdomoCaJbcMtM4dcWDXy0PfLTRnHZl1XTr3cD8kRWAQXVu4kTvAZ7U\n2j2aXfNiPvlLcYXqELTCid4DG/YcUh0CY4yFGJfo5RFz54zxpxQsY4wlM+MSvTxirnBAjsJIGGOm\nKRquZzVQ4xK9kIZSnDkiX2EkZjljaK7qEFiCzj2Zq8O6ZYamrQTGJXp54NIlp56kMBL9yQO/Ftw4\nRWEkZkn3qaBZuweuOMXX/ZlMHsynE+MSffXhptByCpclcKS+sSW03C+H5951yykD/Z2pKyOVuyK6\n5bLT7YPHZgUF/xJlXKLfWHlQdQjGkJvB/P6fedZIc5vd/K6TIoytB+q/IXlZoeXG5laFkcTHuETf\ng6e+c03NIfvsyO/CUemp9kfTtEFTx31O9Lo2N8TirlmFyvat08fSwESvOgJz1B5pir6RR9J8bsf2\n03lj/b042tfgZrf0FHVf+LW79RkYady3qfqQuuR03ZQhyvbthRVl6gpHZUtHoTodOXWnWprG76oJ\nZs5GpoLKQrG76/SpaWVcon9jtbqh0RedMkDZvr3wxXb/pxFsl5Fq1kfz4FH7wvawvlkRtvSWTu3K\nsZiq8FrOkSZ9Xkuzvk2KjSrIjr6RRlSWcrh2sllnR4vWV4WWe2eqm2/UhLOjw8fsf5pD8tTNL/zI\nYn1mmeJE7yK+POAe+UjNgNyETZXJUf9opcKzNLe0HDfhE+EvTvQukjumnOCyi46YNgbi483VqkMA\nAPx0of7TCcq9lkz7nHiFE72r7A9dM08n6JoGw9qVVdpV16g6BMd2SNMIDspV13SjE070LsrOsEcg\nVvIsU65Zta0u+kYsMHRqG09FlsTEAAAHuklEQVQWnOhdNLCPfXSxpVqvyYOT2R8/26Y6BJZEVm3n\nf/zxMjbR+104qrMV5Tx5sVs28kQujDlibKK/deZIpfs3aVJr1f80D2vUX5mxZGRsoldd8sakmjt3\nXjhGdQiMhajsOw8A10warHT/iTA30Svu1W5Qnuf6QSypqK7do2OXTnMTvX7vRdLyu3IlY5Go/jSq\n3n8izE30qvevOgAX6XgEw8xl0nfLL8YmetWfBtVNR24y55kw5pyO/2jMTfSK6fhh6I5Jz4XpT/XH\nUceDOGMTvX5vRfLS8YPNzMXXjOJnbqJX/Fkw6cNo0FNhzDEdvw/mJno+CnWNSf+0GHNKx6+DuYle\nwzcjWXGnG8Zk+n0hPEv0RHQpEZUSUTkR3efVfpKVScnRoKfCmGM6HkR6kuiJKAXAkwAuAzAewFwi\nGu/FvrqNwc+dJeH+3cRNN4zpzasj+qkAyoUQ24QQzQBeAzDHo32FxbmJMcbapHr0uIMB7JZuVwA4\ny+2dfLKlFr9cFH5qtJQeai8/LC+txezHP1Eag1tSU9T/1zTltUwGur+WZTVq53rITLUnGHLjtbz+\nzKG4deYox48TiVeJPlxm6DCJKhHNBzAfAIYNG5bQTnIyUlE4IKfDuvYPwfdmjEjoMZ0iAoQAzh9b\ngKz0lOh/kMT2NzSjrqEZ35wyVMn+//PycXjw/RLMGNMXfXqmKYnBLblZafhyxwG8d9c5Sva/4r4L\nMeOhZQDQ5Tujm9EFOfhw4178y2Q1VSTvvXQsnl+xHYA7r2U/H4q0kRDuT2JNRNMB/FQIcYl1+34A\nEEL8Ktz2RUVFori42PU4GGPMZES0WghRFG07r9o3vgRQSEQjiSgdwA0AFnq0L8YYYxF40nQjhGgl\nojsBLAaQAuB5IcRGL/bFGGMsMq/a6CGEeB/A+149PmOMsdgYOzKWMcZYG070jDFmOE70jDFmOE70\njDFmOE70jDFmOE8GTMUdBFEtgJ0J/nk/APtcDEcH/JyDgZ9zMDh5zsOFEAXRNkqKRO8EERXHMjLM\nJPycg4GfczD48Zy56YYxxgzHiZ4xxgxnQqJfoDoABfg5BwM/52Dw/Dlr30bPGGMsMhOO6BljjEWg\ndaLXeQJyIhpKRMuJaDMRbSSiH1jr84loCRGVWb/zrPVERE9Yz3U9EU2WHutma/syIrpZWj+FiL62\n/uYJSpLJX4kohYjWEtEi6/ZIIlppxf+6VdoaRJRh3S637h8hPcb91vpSIrpEWp90nwkiyiWiN4io\nxHq/p5v+PhPRPdbnegMRvUpEmaa9z0T0PBHVENEGaZ3n72t3+4hICKHlD9rKH28FMApAOoCvAIxX\nHVcc8Q8EMNla7gVgC9omUn8YwH3W+vsA/NpavhzAB2ibvWsagJXW+nwA26zfedZynnXfKgDTrb/5\nAMBlqp+3FdcPAfwPgEXW7b8AuMFafhrAbdby7QCetpZvAPC6tTzeer8zAIy0PgcpyfqZAPAigFut\n5XQAuSa/z2ibSnQ7gJ7S+/td095nAOcCmAxgg7TO8/e1u31EjFX1l8DBizwdwGLp9v0A7lcdl4Pn\n8w6A2QBKAQy01g0EUGotPwNgrrR9qXX/XADPSOufsdYNBFAire+wncLnOQTAUgAXAlhkfYj3AUjt\n/L6ibT6D6dZyqrUddX6v27dLxs8EgN5W0qNO6419n2HPGZ1vvW+LAFxi4vsMYAQ6JnrP39fu9hHp\nR+emm3ATkKuZRNIh61R1EoCVAAYIIaoAwPrd39qsu+cbaX1FmPWq/RbAvQBOWLf7AqgXQrRat+U4\nQ8/Nuv+gtX28r4VKowDUAviT1Vz1RyLKhsHvsxBiD4BHAewCUIW29201zH6f2/nxvna3j27pnOij\nTkCuAyLKAfAmgLuFEIcibRpmnUhgvTJEdCWAGiHEanl1mE1FlPu0ec5oO0KdDOApIcQkAA1oO93u\njvbP2WoznoO25pZBALIBXBZmU5Pe52iUPkedE30FgKHS7SEAKhXFkhAiSkNbkn9FCPGWtbqaiAZa\n9w8EUGOt7+75Rlo/JMx6lWYA+AYR7QDwGtqab34LIJeI2mc7k+MMPTfr/j4A6hD/a6FSBYAKIcRK\n6/YbaEv8Jr/PFwHYLoSoFUK0AHgLwNkw+31u58f72t0+uqVzotd6AnLrCvpzADYLIR6X7loIoP3K\n+81oa7tvX3+TdfV+GoCD1mnbYgAXE1GedSR1MdraL6sAHCaiada+bpIeSwkhxP1CiCFCiBFoe7+W\nCSG+DWA5gOuszTo/5/bX4jpre2Gtv8HqrTESQCHaLlwl3WdCCLEXwG4iGmutmgVgEwx+n9HWZDON\niLKsmNqfs7Hvs8SP97W7fXRP5UUbFy6EXI623ipbAfxEdTxxxn4O2k7F1gNYZ/1cjra2yaUAyqzf\n+db2BOBJ67l+DaBIeqxbAJRbP9+T1hcB2GD9ze/R6YKg4ud/PuxeN6PQ9gUuB/BXABnW+kzrdrl1\n/yjp739iPa9SSL1MkvEzAWAigGLrvX4bbb0rjH6fAfwMQIkV18to6zlj1PsM4FW0XYNoQdsR+Dw/\n3tfu9hHph0fGMsaY4XRuumGMMRYDTvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4\nTvSMMWa4/wVgxNtGhr5GaQAAAABJRU5ErkJggg==\n",
-      "text/plain": [
-       "<matplotlib.figure.Figure at 0x7fe8b80efb00>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "# This will throw an error until you successfully vectorize the `zero_suppress` function above.\n",
-    "# The noise on the baseline should disappear when zero_suppress is implemented\n",
-    "plt.plot(zero_suppress(waveform, 15))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Managing GPU Memory\n",
-    "\n",
-    "So far we have used NumPy arrays on the CPU as inputs and outputs to our GPU functions. As a convenience, Numba has been automatically transferring this data to the GPU for us so that it can be operated on by the GPU. With this implicit data transfer Numba, acting conservatively, will automatically transfer the data back to the CPU after processing. As you can imagine, this is a very time intensive operation.\n",
-    "\n",
-    "The [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) indicates:\n",
-    "\n",
-    "> **High Priority**: Minimize data transfer between the host and the device, even if it means running some kernels on the device that do not show performance gains when compared with running them on the host CPU.\n",
-    "\n",
-    "With this in mind, we ought to consider how to prevent this automatic data transfer back to the host so that we can perform additional work on the data, only paying the price of copying it back to the host when we are truly ready.\n",
-    "\n",
-    "The way to do this is to create **CUDA Device Arrays** and pass them to our GPU functions. Device arrays will not be automatically transfered back to the host after processing, and can be reused as we wish on the device before ultimately, and only if necessary, sending them, or parts of them, back to the host.\n",
-    "\n",
-    "To demonstrate, let's create our example addition ufunc again:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 55,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@vectorize(['float32(float32, float32)'], target='cuda')\n",
-    "def add_ufunc(x, y):\n",
-    "    return x + y"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 56,
-   "metadata": {
-    "collapsed": true,
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "n = 100000\n",
-    "x = np.arange(n).astype(np.float32)\n",
-    "y = 2 * x"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 57,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "1.12 ms ± 552 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit add_ufunc(x, y)  # Baseline performance with host arrays"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The `numba.cuda` module includes a function that will copy host data to the GPU and return a CUDA device array. Note that below when we try to print the content of the device array, we only get information about the array, and not its actual contents. This is because the data is on the device, and we would need to transfer it back to the host in order to print its values, which we will show how to do later:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 58,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x7fe8b80abe80>\n",
-      "(100000,)\n",
-      "float32\n"
-     ]
-    }
-   ],
-   "source": [
-    "from numba import cuda\n",
-    "\n",
-    "x_device = cuda.to_device(x)\n",
-    "y_device = cuda.to_device(y)\n",
-    "\n",
-    "print(x_device)\n",
-    "print(x_device.shape)\n",
-    "print(x_device.dtype)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Device arrays can be passed to CUDA functions just like NumPy arrays, but without the copy overhead:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 59,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "537 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit add_ufunc(x_device, y_device)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Because `x_device` and `y_device` are already on the device, this benchmark is much faster.\n",
-    "\n",
-    "We are, however, still allocating a device array for the output of the ufunc and copying it back to the host, even though in the cell above we are not actually assigning the array to a variable. To avoid this, we can create the output array with the [`numba.cuda.device_array()`](https://numba.pydata.org/numba-doc/dev/cuda-reference/memory.html#numba.cuda.device_array) function:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 60,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "collapsed": true
-   },
-   "source": [
-    "And then we can use a special `out` keyword argument to the ufunc to specify the output buffer:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 61,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "448 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
-     ]
-    }
-   ],
-   "source": [
-    "%timeit add_ufunc(x_device, y_device, out=out_device)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This call to `add_ufunc` does not involve any data transfers between the host and device and therefore runs the fastest. If and when we want to bring a device array back to the host memory, we can use the `copy_to_host()` method:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 62,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27.]\n"
-     ]
-    }
-   ],
-   "source": [
-    "out_host = out_device.copy_to_host()\n",
-    "print(out_host[:10])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "You may be thinking that we are not comparing apples to apples here since we have not been benchmarking the `to_device` calls when using the device arrays although the implicit data transfers are being counted towards the benchmarking when we use host arrays `a` and `b`, and you would be correct. Of course our `add_func` function is not particularly well suited for the GPU as discussed earlier. The above was only intended to demonstrate how the transfers can be eliminated.\n",
-    "\n",
-    "Be sure to benchmark your data transfers when exploring whether or not a trip to the GPU is worth it.\n",
-    "\n",
-    "Also, Numba provides additional methods for managing device memory and data transfer, check out [the docs](https://numba.pydata.org/numba-doc/dev/cuda/memory.html) for full details."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Optimize Memory Movement\n",
-    "\n",
-    "Given these ufuncs:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 63,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import math\n",
-    "\n",
-    "@vectorize(['float32(float32, float32, float32)'], target='cuda')\n",
-    "def make_pulses(i, period, amplitude):\n",
-    "    return max(math.sin(i / period) - 0.3, 0.0) * amplitude\n",
-    "\n",
-    "n = 100000\n",
-    "noise = (np.random.normal(size=n) * 3).astype(np.float32)\n",
-    "t = np.arange(n, dtype=np.float32)\n",
-    "period = n / 23"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "As it currently stands in the cell below, there is an unnecessary data roundtrip back to the host and then back again to the device in between the calls to `make_pulses` and `add_ufunc`.\n",
-    "\n",
-    "Update the cell below to use device allocations so that there is only one copy to device before the call to `make_pulses` and one copy back to host after the call to `add_ufunc`. Check out [the solution](../edit/solutions/make_pulses_solution.py) if you get stuck."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 71,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "noise_dev = cuda.to_device(noise)\n",
-    "t_dev = cuda.to_device(t)\n",
-    "pulses_dev = cuda.device_array(shape=(n,), dtype=np.float32)\n",
-    "\n",
-    "make_pulses(t_dev, period, 100.0, out=pulses_dev)\n",
-    "waveform = add_ufunc(pulses_dev, noise_dev)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 72,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "[<matplotlib.lines.Line2D at 0x7fe8b80061d0>]"
-      ]
-     },
-     "execution_count": 72,
-     "metadata": {},
-     "output_type": "execute_result"
-    },
-    {
-     "data": {
-      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJztnXd4FVX6x78nCSkEElqoAUIg9E6k\ngwhIVbEL6yquuP50V6wri20VdRXXVRdsyKorNhBBBSkiXUCKQXpNIZDQklBDST+/P+4kuWVun5lz\n58z7eZ48uXPu3Jl32ndOec/7Ms45CIIgCPMTJtoAgiAIQhtI0AmCICSBBJ0gCEISSNAJgiAkgQSd\nIAhCEkjQCYIgJIEEnSAIQhJI0AmCICSBBJ0gCEISIozcWYMGDXhSUpKRuyQIgjA927dvL+CcJ3hb\nz1BBT0pKQlpampG7JAiCMD2MsaO+rEddLgRBEJJAgk4QBCEJJOgEQRCSQIJOEAQhCSToBEEQkkCC\nThAEIQkk6ARBEJLgk6Azxp5gjO1jjO1ljM1ljEUzxloxxrYyxtIZY98wxiL1NpYITTjn+DYtB0Wl\n5aJNIQhL41XQGWPNADwKIJVz3hlAOIDxAN4A8A7nPAXAOQCT9DSUCF1WH8jD0wt2462fD4k2hSAs\nja9dLhEAYhhjEQBqAjgJYCiABcr3cwDcrL152nLhaimW7TmJd1YexoDpa1BRQQmyteBiUSkAoOBS\niWBLCMLaeJ36zzk/zhj7N4BjAK4C+BnAdgDnOedlymq5AJrpZqVGTJ67A78czq9aLimvQHRYuECL\nzE9m/iU8OX+XaDNMz9WSchwpuIyOTeNEm0KYGF+6XOoCGAegFYCmAGIBjFZZVbW6yxh7kDGWxhhL\ny8/PV1vFMI6euSx0/zLy4bpM0SZIwRPf7MSYmRuqWjsEEQi+dLkMB3CEc57POS8F8B2A/gDqKF0w\nAJAI4ITajznnsznnqZzz1IQEr8HCdIVzz8uE/zDRBkhC2tFzAICuL/2MRTuPC7ZGHrZknUHuuSui\nzTAMXwT9GIC+jLGajDEGYBiA/QDWArhdWWcigEX6mKgd3KkRMfjNtYIskRNOb8iACbN7Mz42byed\nSw2oqOAYP3sLBr6xFln5l0SbYwheBZ1zvhW2wc/fAexRfjMbwN8BPMkYywBQH8AnOtqpCRUVjsv5\nhcViDJEIZidEP+xUbaQRPsCcmjqzf8nCH/67BaXlFeo/ILwyZeHuqs9D31ov0BLj8CkeOuf8RQAv\nOhVnAeituUU6cvz8VZeyw6cL0bZRbQHWEEQ1YU6K/vrygwCAnLNXkJxQS4RJpoZzjgXbc0WbYTiW\nnyk64p1f8NPek6LNMB3HzlxB0tSlmJ9mvYdGa3bmnEdpuXoXy7cWFCUtuH3WZpeyx+btkL4ry/KC\nDgAPffm7aBNMx9YjZ1TLtx05a7Al5ib9dCFufn8TCi6pd/+RF5F/ZOZfwrxtx7BdGWS2Z9HOE8g9\n59pKlwlDU9AR8uCunnPnR5uRPX2sobaYmTOXaTKWloyZsQHFZe7HHZzHKmSDauiE3+RdLMKUBbvd\nfp9pEY8CLaiQvAvAaDyJuRUgQVd4f22GaBNMQ2a+5wlawyziUaAFmzIKvK4je78voR0k6ApvrqDA\nUr7iS7OVaum+UeZmMNSe8bO3GGCJNWCS97lIL+gfrc/ExvQC8ufVEF9EaPLXOwywRAJ80JetNNCs\nGQOmrxFtgq5IPyha6c87oXdzr+teLSlHTCQF6/JGeJh3FcqjSVu+4WNvypWSMtSMlP5xJYJE+hp6\nJXO35Xhd56lvdxpgibnJOXsFn/16xOt67tzwCEf2HL/g03o7c87rbAkhA5YRdF/Ynevbw2Vl7v10\nG1bsOy3aDGn4NVPdn9+ZTzd6f4lanaSpS0WbIBwSdDtkn3SgBYVFZd5XIjRn1YE80SZIQ9LUpfh4\nQ5ZoM3SBBJ3wE99d6A6euqijHQQROO+ukdNNmQSd8At/0sx9/zvF9fbEj7soOqUofBjXNyUk6IRu\nfPSLnM1arZg8l1w7RSGrPzoJuhM0K48wgkvFNBahJTln/ctKRDV0i/DxBvImcEdGHs3+1Iqb3t0o\n2gSpGPQv/7KPUQ3dIvxz2QHRJoQsw9+mGC1akVXgf8LyqyXlOlhiTeSUcxJ0VY4E8LARhN7M2Zwt\n2gRpkHUmMwm6Cv+V1EdVBJQNSjuW7CavGMIzJOgqzNt2TLQJ0kBRLLVj73Hy6yc8Q4KuQgU5umgG\nnUtCbzame48pbxVI0AldofEIbckrLBJtQsgx7cd9ok0IGUjQCZ/w18+X0Ieb39sk2oSQIz1Ad9pH\n5+7AxaJSja0RCwk64ROrDlCERa0IZvLaiQtUQ9eKxbtO4LNN2aLN0BSpBb2olPx2tSKYCbQV1JHu\nAJ0OQi+kFvSBbwSebqpQsqZYsCwOIpAUxXRxhMJLhA6yXQqpBd2fyIDOdHnpZw0tMT/BZMxJP12o\noSXmRzINEcrdH1MCbXukFnQiNCiXrRoUJOSVoR2bMnzL+GQVpBX0svIK0SYQCpQP05Evt9DENUIf\npBX0z37NFm2CNJQHOYp39Ay5PGoJhd4l3CGtoF+k3JeakZ4XfB/4Zh+TIRPe+TYtR7QJ0sAlG9GQ\nVtBPk79uSDHhvzR4RRB6I62gf6NBLYbcy4hQJEzS5Awi+M+qdNEmaIpPgs4Yq8MYW8AYO8gYO8AY\n68cYq8cYW8kYS1f+19XbWKNZsptCvwLy+eqKJPec+/GEBrWiDLTE/NBYgiu+1tBnAPiJc94eQDcA\nBwBMBbCac54CYLWyLBWUxNfG6BkbRJsgDWXl7t+OMyd0N9AS83OlxLOgW/EF6VXQGWNxAAYD+AQA\nOOclnPPzAMYBmKOsNgfAzXoZSYQ2zerEiDbBNHhq7PRv3cCnbZy+SONDALDtyFmP3792S2eDLAkd\nfKmhJwPIB/A/xtgOxtjHjLFYAI045ycBQPnfUEc7dWFs1ya4pUcz0WaYnmWPDRJtgmmY6yZ5So1w\nW7/4kskD8dyYDh638c1v5OUCAI987bkFPaJTY3z2p2u8bkemsTJfBD0CQE8AH3LOewC4DD+6Vxhj\nDzLG0hhjafn5+QGaqQ/v/6En/mnBt7iWLH10IOJjamDBQ/2wnITdK7PdxLUZlJIAAOjcLB5/HpyM\n7Olj3W7jzOXAQ1pYjSHtGqJhbc9dL6sP5Blkjf74Iui5AHI551uV5QWwCfxpxlgTAFD+q54Vzvls\nznkq5zw1ISFBC5u9UlzmPsrioBTHZm3NyAi8cVsXvU2Slk5N4wEAqUn10KFJHP7v2mTBFpkTtVpi\nt+Z13K6fR90uHtn+/PCqz97q36ckOpdeBZ1zfgpADmOsnVI0DMB+AIsBTFTKJgJYpIuFAXDtv9a5\n/W7y0BSXsjtTm+tojbnxNvDkzDUt6+lkifUY2amR2+/WHQ6t1m6oUd+PAdGlEnmzRfi43mQAXzHG\nIgFkAfgTbC+D+YyxSQCOAbhDHxP9x9Mbt3ereujUNA77TlQn3GXk1+uW0jL39ZsVjw92KRve0b0I\nWR1P8fmnjnbtN39ocGv8duQs1h5yFW9P3jKEf2zOkmcWs0+CzjnfCSBV5ath2ppjDIv+OsDnCIDr\nD+fj2rbGdBWFIpfd1NBnTuiBdo1rG2yNudl34oLb71rUq+lSFhbG8MCgZHVBr6Dgc+5wHn8Y26WJ\nZWI7STtT1BMR4WGIigj3ad2Jn27T2ZrQpv909SQhvVr6P4+spMzaIuSpJRgTqX4/9kuur1p+tYSy\ncfnKCzd0xI4Xrkd7C1RALCXo9/RtKdoEaYiO8P/W2SJR0zYQ3LnHRXk4l2Fh6i+BeeS66DPhYQx1\nYyMxtksT0abojqUE/ZWbyUVRKzwNOu2bNlK1/F6Lt3beXZOhWn7o1dF+b+tIweVgzTE1GQFEAG1q\ngQlw0gn68fNXA/rd8A6mmxcVssRG+TrWbi3ST18SbYI0TJqTplreuVmc29+Eu2ntAPLEhZFO0B9w\nc6G9TU9/d0JPPcyRkqHt6eUXCIFWNghX3CVNaZ1Qy+1vrmvn/r6VZXxHOkF35zftbUaou0EpwpUh\n7azr9UOIJ9DadHzNGi4TCyuRZfq/dILurlFVLzYy4G0WFpUG/FsZqQgyJR1Rza4XR3hdZ8Z4isJo\nT7EHf/4nr2/r8bfuIjB+uulIUDaFCtIJerabpliXZvEBb7PUopM43NVafGnNBHO+rUR8TA2v64zr\nTgHkfOGTialoWT/W4zoT+yeplr+/NlMHi4xHOkF3RzCzQT2MpViSuGjvIvTOXd0MsMTaUDwXRwa6\n6U6xR/Zn2TKCHgwnLZqftOBS4FH92jSUfxKHaDZlFog2IaTwZbKgPzFezAgJug/8lu05kL6s3PPJ\nVtXyPm5mL/rC4dP++w8T6jzxzS5k5VvPFTKY1nazOjFIaejeE8bsWELQP71PLQyNK8+Oaa9abtXg\nXQdPqYtvMAPMz3+/N+Dfmhl/o1b6Ss4567lCfrnlaFC/b1nfNW6OLFhC0Ie29y0C4IODW6NpfLRL\nuTXlPHgax7mey20Wbe2oDaxP7Od7KIpAYufIytsrD7uU+dM3Hh8TeIUk1LGEoPtDA5XsJmEWraGr\n8acBST6v+5frWutniMlQu4Uaqrzw3PHoMNc4/kQ1Syb7ni2rQxN5x3dI0J2IDHc9JVYPKmWPPy50\n9/ZL0s8Qk1FY5Nrl0ruV78lAZPfOCJYEL2nm7Lm1Z6KOloiFBN2J1291TUe3eNcJAZaEJt09pEUj\n1OGcY4BKGOJrkvwRdFJ0T0TX8F3K3I0ByTBgT4LuREojeZtjhBi0SBdHeu6Z2j7MjfDGPxaZf8Be\nKkE/f4WyoWsFTe/XjiINklGkUq5W3TkTxLyLUEEqQVebCFOLQrkGhFVSdpmFyIgw1Yw7FZIElTKa\nsV1dk12k55nfp18qQVfLs0hN1cD4hjLihBwPD3H1Giq3aJyhYJk5vodoE3RBKkGfsSrdpYz0PDAO\naTRA1I0GUTVDzcMoTKon2Dg8JbswM1LdDmpB75vV1WZWWCAprwhgqEpSgbJyOZIJBEPbRtpMP7fq\nLGZCHakEXa0/cc7912iy7eFv/6LJdqzGo8PauJTtP3lRgCWhRZ2a2sxWlGEgL1juH9BKs22Vm9wZ\nQCpBV/MjbVjb99l4hPao1SBPUCo2zboCn/lut0ZbIgBg2Z6Tok0ICqkE3eQv15AnpoY2afoe+vJ3\nTbZjFnblXnApmzJKPRCcv1gt+UpeoWso68cCDIuQnOCaDKPIQzYkMyCVoBP68tvzw0WbYEpmrXfN\nhtNOxQWR8M4NMze6lMXXDGxS0aSB2nXVhAok6CpkTx8r2oSQhHz6tSNWw6Tkaw/labatUCavsAh5\nhcWabY9J6ANHgk4QAgjUO6V/a9fkIpkSTIjxhZIybb2jZHQQkkbQc8+pJ4cmxKM2K8/KJNaNCfi3\nn96njdeWGdHaRXNgG9ccpGUmH4iTRtAHvrFWtAmEGzo1jRNtQkgxoXeLgH+rNiGmWOOaa6iidYW6\neT3XOSqlJp8jIY2gq+FPSE3CM08MbxvwbynQlyNREYHfl2qipnVXhJXJMHn3ldSKNzGIBAtaDlqZ\nDbWolbFRgZ+P/ipNWytTOzrwwWW1bgd6XWrH55uDy1cqGqkF/a9DXWcp+kqiRiEDzIi75NCB0rOF\naz5MvZImhxr5Kl4ZwQRIVKuhc4tEXFTrQq8boMuirPgs6IyxcMbYDsbYEmW5FWNsK2MsnTH2DWMs\n5DKvxgUR9P6LB3praIm5UNOH9o217Qf/10+HNN1eqHLLB5tcym7q3jTg7amJmkX0XNXNsGYkudLa\n408N/TEAB+yW3wDwDuc8BcA5AJO0NEw0aiEDisvMPYvMV3bnnncpG5iibbeJVZKR5J5zDXMQjAip\ndbm8tzYj4O2ZCbWX2ZRR7Yw3JITxSdAZY4kAxgL4WFlmAIYCWKCsMgfAzXoYGEos33NKtAmG8Pry\ng7rvg8ZJCX/Zf8I1qNvQ9q7RPK2MrzX0/wCYAqByOL0+gPOc88qO0FwAvqeD1xijwrFa1ZtgkMa1\nc4Ay7RD+88x3e1zKgs0lOkzlhWDmMQmvgs4YuwFAHud8u32xyqqqZ4Ex9iBjLI0xlpafH3yyXDWu\nGBRQZ8pCa0a266iDH7mJnxnh/LFv4H7sZobr4M/zicpELTO3Hn2poQ8AcBNjLBvAPNi6Wv4DoA5j\nrLIzMBHACbUfc85nc85TOeepCQkJGphMGE2sDgNPVEMPnLFdAh9UJbyzM8d1DMkseBV0zvkznPNE\nznkSgPEA1nDO7wawFsDtymoTASzSzUovqDUX+rSiLOlakdrS1e0wWJbvtcZ4hDMzxncPehv9VOK5\nWAGjknlsTC8wZD96EIwf+t8BPMkYy4CtT/0TbUzShjn3W9ftUGuoLq0doztTXJtAMSrOyp7jrvHr\nzYJfgs45X8c5v0H5nMU57805b8M5v4Nzrl1cSz85fdF119EaJGO4u4/1+iovFbtO+NEihgZ5I9iI\nDGLaP2EM5RXmdX6Q4u567nvX0W8tsGISArW+7T7JwTfxW6tkhyGIUGTtIX2cN4xACkHXqyE2pov1\nmsdqtXG1CH/+kppEYxp6YvbUaYQ2SCHoekVIa1ArSmVf2sY5sQpqiRlkJ+escTH65207Zti+QoWP\n7ukl2oSQQwpBP3vZuGnkx8+7JqmVCa2TCFQS7AQQM7JPZWajVgxo4/iC1DI1m1kY2amxaBNCDikE\n3Rk9I7CR/zThO/rdK85i9sE610TUhPWQUtAfura1bts287RgX9AzzWJvp370TRnm9ff1BT0jUtSt\nGXLBTXVFzzGCX56+TrdtG42Ugq7nYOblYrkHny6ruC3qxd0fbzVsXyIouGS9bhC9OHVBv67OFvXl\nyX0gpaA3rRN4El5vPDV/l27bDgV6v7Zat233b2OtgVG16IBEYGQVmDs1nFFIKeh6dhuUmDyJrEge\nHZoi2gRD0Wl82ZIcd4orf3uvREGWhDZyCjo9SJrROM410UeghGngz24mDmicys+exvHaXRcz4Hzv\n/G0EJbZQQ0pB15Ibu1k7st2Ch/uJNsG07HKK2tezRR3Ntn2NxSZqbc8+57BcL1bfQWGzOj9IKeha\n+lJPGthKs22ZkfgY6/mPa8Hvx865lA1KofDRgfLdjuMOyzXC9W3tmTUmuukFvULnMx+jQZAvM0PB\npALjkEp3i9Z5Wa2MXhPgKrlcYpy3l5aY/mk16YvUNERFWPuFFigfrHNN3Gy1bhIzs96kAbpML+h6\nD7MlNZDHR5UwDqvmn5WF2b9kiTYhIEwv6IU6T4ShGipBWA+zJrkwvaCv2n/a8H1uyTpj+D4Jwht7\nTSpChHaYXtCf+tb4mZvbjpw1fJ8E4Y0PKUCX5TG9oIvApC6qXrlaYnycmsKiUsP3aQSl5cbfJEv3\nnDR8n7IQGS6HFMpxFHaQ33TgdPjHT7rvY9pNnRyWpy7UJ32gaIyM0S87RmRjeuvObrrvwwhMLehq\ns7l2vThCgCVycm1b7SfCjOvuOPP2/FVrCF9cdIRoE0zL+sP6uxDK0ug2taAv23NKyH6PGZhaTCQR\nOsReYU6OppsyrDHAPLxDI823+bYktUpvlBswbbNVfTmSmJta0PMKxaSDW/h7rpD9Gs1d1zQXbYI0\n6DGz8daerhEHtx+Vb8DeCJ/wTk3jdN+HEZha0C9cNWZAbUwXa+Yu1DsAkpUwKgLorhz5XBd3OgU5\ni9IhHIVaJFA9k2rohakF3bn5rhdN4/VLmGE5rBVBtwrnpM56YYXQ0UYFzMvKN19SDVML+pVSYwLo\n1FCpERjRryea+rWiRJsgDSM6GtPKs4Ceo2tivCH7mbJwtyH70RJTC/rOY+e9r6QBCSrCJqr/Xi/U\nPIaa19W+ZaIWvfL4+asqa8qFFWrORjGqs345g+3JPWe++9LUgr7VoBmbvVrWNWQ/oUaEDpMt1MLx\nXjVpqFJ/CLdYtiZCDKYWdKPo1tw104xR/fdWQNaZt/YYFeTNCq0dveiu8pybDakEXU149UK2JvSv\nmeL8wS2g57rRrI5jt9juXLm8XIrLjAtHUSFBzUIqQTcSyfRcujEBkZSVGxcL/b0/9HBYNqob0iiK\nDYwr3zqhlmH70gupBN3IbsorAgJZyYoEFSMH3lvrmq1IL8Jkayo6YeS98dKNnbyvFOJIJehGvmGP\nnLls2L6MYNHOE8L2vTmzQNi+9eC4Cb0jQpXVB4zLdxBVw/xy6PUIGGPNGWNrGWMHGGP7GGOPKeX1\nGGMrGWPpyn/hriB6vs2fHtnOYXm3ZDPy1jnlUEzUwWXRHTmSCaBzpEU94rhUInkFHV9sOSraBFPh\nyyupDMBTnPMOAPoC+CtjrCOAqQBWc85TAKxWloWiZ4b6e/u1dFjeJFmt0hkjY1usOZhn2L6MYLXT\n8dzco6mbNQlvnDDQa0eGrj+vCsg5P8k5/135XAjgAIBmAMYBmKOsNgfAzXoZ6SsPDNJvSrDzxNCC\nS8W67SsUuLdfkmH7OlIgV/eVM3oKhezus84zsvu3NiaEglnxq0rLGEsC0APAVgCNOOcnAZvoA2io\ntXH+omsfutNDmZUvtwgNaNNAt23rEVwplKFJRYHjLOiPDkvRbV8ydF/5/GQxxmoBWAjgcc75RT9+\n9yBjLI0xlpafr3+ger2QwUc1VFj39BDRJhiKnp4obRqa39XOE86CXkPHVHHRKmEpzIZPZ4cxVgM2\nMf+Kc/6dUnyaMdZE+b4JANWOUM75bM55Kuc8NSFB+ww4RhETaf6LHSo0sVj0ytgo/e4d2e9L59ys\nanGViGp88XJhAD4BcIBz/rbdV4sBTFQ+TwSwSHvz3LPH4BlxMry9CTEM1LH7SnauOuUTbVG/piBL\nzIEvNfQBAO4BMJQxtlP5GwNgOoDrGWPpAK5Xlg3jxvc2Grk7gggYPbIVeSL9dKGh+5MZsyX79pq5\nlnO+Ee5nug/T1pzAeXBwsmgTCCIk2JJ1BimNaos2QwomzN6CFU8MFm2Gz0jjbtCiHjXFCPGcC4Ea\n3Y+7T4o2QRouFZsrtLM0gq6WoIHwjV05xiQKsQIvLt4n2gRskyxAl0jMFo5YGkG3Qko4vcgWEJcm\nuUGs4fs0gsW7xMXEIYLnHzd0FG1CUEgj6KTngSOicdNacv9pIzEyTIPsTOjdQrQJQSGNoJNbYeBc\nLCo1fJ/vTnCM471ZYIINs0MzUbXD7H790gh683rGT1bZceyc4fvUg5d/3O+w3L6x/h4Szi/gCf/d\novs+RdDAgIkwsg4flRqYKEQWpBH0cAGBGETGENeSMqf+qoUP9xdkiXxc31G/0LmV9Ghh/lyYarzw\nw17RJpgOaQS9bzJFYdOK2Civ0xMIHzFifsTzY809kOeOeb/lOCz3o2fcK9IIepgB/Yj39U9yWCbP\nGsIbrQzw5tEzD0Ao0btVPdEmhDzWuBM04qWbHHMOLt9LEzgIwihkz5+qBSToQWC2WWQEYWZIz71D\ngh4ERaU0Cm9mvk3LwU97T4k2g/ARs3hnXiwqxYGTPqeM0BQS9BCmpKwCZ+xS3Z2/UoKF23MFWiQX\nTy/YjYe+3C7aDMJHInRMbmHP7Ht6OSz7G1bkjx9vxegZG7Q0yWekEPSHh7QWbYIuPDF/J3q9uqpq\nefLcHXjq213Iyr8k0CrzUlpegZIya7eqjp65jDtm/YpCZTLZvhMXHPy9P96Q5VeugffXZiBp6lJc\nLSn3vnKQODsl6IXzHAl/k5jvNjhXgz1SCHpECLfFMvIKcc0/VyGvsMjv3y5VoubN/iUTAJB30VZb\nLwlywsX+Exex/WjoTYrKOXtF1+33emUl2j6/HD95GMzefvQsFu08rqsdRpGWfRa/O01+e3vlYfyW\nfQ6rD+Qhu+Ayxs7ciH8uPVD1/atLD6jmGrhaUo62z7meu89+zQaAqheEnhg1G9y5Pn7ZgJeVVkgh\n6CJnyn28IQtJU5eiooJj1vpMJE1dWhWgaf+Jixg9YwPyC4vx877TLr+9cLUUY2duQEae5xr3a8sO\nupQVl5XjwtXAHqIxMzfgtg9/xUshEBnQHncvvcKiUr+j3lVUcGQqLZkT56+irLwCF4tsg9gPffl7\n1XfO3PbhZjw2byfOXCr2uzYfahE/b5+1Gbd+8KvqdxwcZ5RQv7ty3UfbLC2vwEfrM5FVcAkl5RX4\n98+HPe7zlg82Yehb69x+f7m4rOqFWVZegSkLdmHRzuN46+dDXo7GODo0cZwpvWq/67MbqphO0Msr\nOH7NKHAoE5nA+VWldrPucB6mL7cJ739W2m76MTM3VOVE5AD+8tV23Phude1n7cE87DtxEc9+v8dl\nu2rxVbhSd5i79Rju/u9WdJv2c1C2f/ZrNorLxNU+4qIdJzDZX8bColK8vzYDFRUcY2ZuwIDpa1x+\nf+O7G3Hf/7a5lG9ML0Dys8sw7K312JhegP7T1+DlJY7hDYa9td5h+a6PNuO1ZdU11V6vrsJfvnLt\nXy9TWkdXSsrw+rIDKCotx5US24siv7DYYd2J/VqqHbYu3NEr0e13S3ZXz2hWa8t6enzm/JqN15cf\nxKcbs6vKyit41XmwZ+bqdOw4dh5Z+bbonReulqKotBxl5RUor+AoLa/Ac9/vwWPzdmJP7gXsO3ER\n89Ny8di8nXh3TQZKyirww47jwl+MNSMd70v71uzh04V4c8VBXCouQ9LUpbjlg014ffkB500Iw3RT\nAl9duh//25TtUBYK9aL7P0ur+pxVcBlJU5c6rsA5lu2xeVS0fW45tj47rKpGqha/ushDM2/O5qN+\n2fZb9lnM25aDN27r4jKw9NH6LIflTyam+rXtYGgSH4OLRdXp0vYev4D7P/sNa/42BG/9fAhzt+Wg\ndUIscs7aaudbss6ga2J81QO357itr/JIwWU0rxtTdWx//GRr1TYrvQ0+93LOth45i61O12HVgTzc\n8sEm/HVIGwzv2Agr95/Gnz9Pw/LHBmHutmP4fPNRnLpYhEU7T2DWH3ti/WHHioYRk90q+UOfFvjW\nbsC8zbPLqj4/8vUODGjdAHVjI6vKOK92A/T0/Fwutt2Hl4ptFYwwBvR5bRWKSyuwZ9rIqvV25V7A\n2yura+8/7zuFB7+wvRDjY2pjN8qdAAATH0lEQVQgqX5N7Mq9gN5JtslBc387hq+3HnPY13trMzBz\ndToOny7ElFHt/Th6fTl+/irOXCpG/VpRuOujzTh3pRTDOthCOuw4dh47jp1H64RauDO1uWBLTVRD\nLy2vQM7ZK/hRJd60yBq6r9jHSykpr0CPV1Y6lL3x00F8m5aDnq+sRI+Xf8bmLMfogz/sOI7Dp127\nCbYfPYf8wuKqWk1ZeYXDABXnHHfM2oyFv+eizXPL8dVWR2GbtT7TYblronFxQRrHRzssv/Tjflws\nKkPqq6tw8aqt1vu8XTyP8bO3YPSMDXh83g78cji/qvy6f6/DuPc34ZUl+/HemnSHbQbru7zj2Hk8\n8Hkazl0uwZ8/t720Z6xKr3pBpGXbam8Pffk75m5zFCgjJ8I45y11js/z94W7Hdar4MAdszY7rGM/\ndvDZpiOosNvGCqXLkIGh4FIJCovLHLoKK89NJfPTqqftX7hail3KQGFlK9NZzAHb+AUAfLAuU/fx\nFH8Z9K+1AIBzV9S7OacssJ1f+1baxvQCfLLxCJKmLsV8pzAGemGaGvoLP+x1ie1QRejrOaY5RTQE\ngAt2N8eH6xyF9bF5Ox2WH//GcbmS2z609ZG+fmsXTOjdAm2eWw4AePvObri1Z2JVq6CS5753DHh0\nReCAjye9q5y0VXDJMaXb0TNXcPTMFfzgFBht34mL2HfC1ff36BnvwjBzdbrXdXq8srLq80/7qs+p\np759I8fqve3q5/2nkTR1KTo0scVOL7hUXBW64tAp23mrfDkBtpdrvVpReGeVY5/5IbsE1MPfduy2\nsmfVAXXPkNMXi1XLAWBTRnUlplJAK5nQW2zt90pJOb7YnF217PwcAcB3v+fiyfm7qpbtW4pTFu7G\nndfofwymEXS3Yg5UNX/Mxke/ZHlfyUee+W4PXrHrJ35y/i5wDmx0Gm/wRk0D40G3bVQb6w7lq363\n/rB6ub98scV795R9V4GWONeaQ4HKLqgfdlTXxotKK/Dqkv0u58rbYH0gHAuxmrc/vLCo2olAbeKQ\nvZiLwhRdLse81LIoaI8N59r2U9/uwvc7/HPBMzLS4uPDUwzblwiMlHN/k1wcPFXosPzxxiMu6/jS\nciFCC1MI+uA313pfySBGdjJnayAUiY4wd3YYb4zu0sSwfXVsQmnoCJMIeihxY7emok2QBiO9QEQQ\nZWBYW9nPpbHtHfNCgu4nSfXlzFZPaE8IdqETPhCt04u4qFR/BwQSdD/p3CxetAmESWBUq9SMhNr6\n52atRK8gYP7Odg4EEnSCIEKeW3o0E22CKSBBJwidqBtbQ7QJ0mBEKj8ZIEEnCJ1oWDva+0pESPLt\nQ/1EmxAQJOgEQRBOmNUNlASdICShX3J90SYQgiFBJwhJ4GYIamQS9HA5vWxAUnkSdKKK1JZ1RZsg\nDSKyaBkZ3ZHwn1057hOJaEVQgs4YG8UYO8QYy2CMTdXKKH8Y3qGhiN1Kyad/uka0CdIwoE0Dw/f5\n1Ih2hu9TVvSYQxBfM9L7SkESsKAzxsIBvA9gNICOACYwxjpqZZivPDOmg9G7xNcP9DF8n0YQF228\nm123RDknaqU0rGX4PntRCyukqRfKgg6gN4AMznkW57wEwDwA47Qxy3daJxj/4PQXUPsizAX1fpib\niHDtL6ARiXiCEfRmAOyDlOcqZQThM50kDaUQirHQzcpwAfkOaugw/d+IIetgrFbNN+uyEmMPMsbS\nGGNp+fnaJC0g5KF2tGlyrPjFNUkUo18rXr25s2gTTEMwgp4LwD6nUiIAl4SfnPPZnPNUznlqQkJC\nELsjZGSsgTHDjaR/a/IJ1wrn3LNmxYg2WzCC/huAFMZYK8ZYJIDxABZrYxZhFYxMSm0kRmZ+IohK\nAr7rOOdljLFHAKwAEA7gU875Pi8/IwiCIHQiqGoE53wZgGUa2eI3Zo23QBAEoQemnin6jxsNd3sn\nCMIitG1kvEt0sJha0KNryJ1k2CpQrGsiFPnShBMITS3ohByQy7Z29E0md0mtMGM8exJ0AgAwuK04\nl9IRHRsL27ds0Lm0NqYWdBHxMmTlz4NaCdv37b0She1bNkTMqiRCB1MLusg+9IYGZiE3gkidMp0T\nxtKifk3RJhBuMKJr0dRPcbiAmNOViAiPqid6BCOyKrf2pJBGhCtGxKs3taCLhAbytESuTDtv39ld\ntAlECNKjhf6zoknQA6SNZP33BkT2JAhLQzV0BVkj8oUS9WL1D77vjvAwU9yGBBEU1IeusGnqUNEm\nuGD2Gu3eaSOrPi94qB+SBSQKqYQmFhHuePL6tqJNMBWmEHQRqdG8YWaXybaNaqFWVAQ2TLkO6/42\nBKmSx+7u3Mz/mD8LH+6HVU9ei+/+0l8Hi8QwZZT5co7WjBQ7G7xBLZs32+JHBgS9LT3ylDpjCkEH\ngNn39MKglGrPEtEZ6vskh268azXnnztTq329/31HNwBA83o1kRQiteNuzbUdMMqePrbq85LJgzC0\nfXUy8fgYxwrCtJs6Ieu1MU7Cz9CmYS30bOF4n8WYONzEsPba+6jbn1dnfg2gZX1z96YOy+N7t/B7\nG1rClaZ4k/gYrzlb353QwwiTPGIaQR/RqTG+mFQdW6F5PcH+tnZdLvf0benwVbKKSC58uD92vTjC\n4yYn9G6BtOeHY+uzw3BdO/WZmw8Pae3VtHaNq4XppRs7YtuzwzD91q5VZaEYA6eVG//p/7s2GUsf\nHVi1fOT1MQCAQSkNsPyxQdhn13XkjL1w39DVlkhjXPemLjX2if2TEBbG8NUDfVX98e9KbY4Z47vj\nlh7NMOf+3i7ff29Xi5//f/3c2iOado1r44tJrvZ74+4+jqLavnHtqs+z/tgLma+NwTgnIQaApnVi\nXMrUZiSP6lQ9u/Wdu7oje/rYqr9aguPKVz7mYcz2DNtXFADgp8cH4cDLo/DaLV1wQ9cmSLCbn7Lq\nycEO61IfugqVD891HmoGRsCVSx0XHYFXlBRZzZQbmAOYMb7adS2hdhR6tazrUjN0plfLumhQKwqN\n4qLxvz/1xh29EvHIdW2qZlKGhzE8PKS1Q/+3M3P/3Bcf3N2zavm+Aa3QMC4aYWEMjeKiEB7GVF84\nonnt1i749L5Uh7IfHxmIZ0Z3QKem8UhuEIvbeiaCMYbs6WPxxaQ+6NAkDrFRES6D5o8PTwEArHh8\nMOb+uS8A4NaeiTj4yijMGF9di7qnb0s8OrRN1XJ8TA10UumeeeP2rhjXvRneuas7ereqh7+Pao+H\nrrW9WGtHRaCHXS2+d6vQ6b7Knj4WE/s5VjYqk6r/sa/nmq/9/fvPW7o4fLfw4f5oUCsSix8ZgMiI\nMISHMYzxIfPU9ueH43O7F2ILpVI2dXR73NuvJSLCWMjlYq1b0/bM2s95udfunLZvHIeYyHD8oU8L\nMMYcXkBtGla/+ADyclGlR4u6OPDyKNzUzbVGYCQxSt/eA4OSAdgenvkP2WpnTetEY1z3Zvjl6esA\nAE8Mdx3YaRznPfDPm3d0w99GtqvqeXv9li6Ii67hcNNU1pbu7tMCSyYPRL/W9d3WarY+OxyZr41B\nRAjOCq0ZGYGh7Ruha2J10ugudp/X/G0I3rqzm+pvnR+Tx5Xz3Tg+Gv3sUsFVtkwqvWpGdW6MJ0c4\n9itPHdUezerEONRCnXl4SGtMHd0eq5+6FuueHuL12EQybZxjPs6mdWKw4vHB+McNnTz+zvn5sg/6\nFRsVgbTnr3fINjWyU2PseOF6l+3Y18hrO42FDWhjuza1oiPw8rjOyHhtjJejMZ4vJvXB9Fu7oE7N\nai+wl5VzWl/FM6zCjbfEb88NN2QipCn9AWMED5QAQFREuEvzq1mdGLz/h55VN2qL+jVd1mnfuDYO\nnirE6qeuRacXVwAAomuEoai0At3sBMyeR4el4OjZKxjZubppypjN06Z1w1o4eKoQfZPro3Mz2+9F\nN1ODYfEjA5E0dalfvxncNgFLdp/0ef03b++K2b9koa/KOEif5Po+e1W1FugZ5A9LJg90mAncTnlZ\n7Z02Ep2Ve3D3SyPwwdpMzFqfCQBgjOHWns1w8nwRANs53pJ11uN+6toJ3KSBtthAn9/fG8fPX8WW\nzDOIjLC9SGMjw3G5pBzTbuqMBwe3rhp4DEWa1olR7cf//P7eSFGJl163ZiSOnrniUp5gUKgQ8z75\nIcrYrp6bnksmD0RZBXfox74ztTmm3dTJbXOzeb2aLn2zO18YgdKKChwpuIylu0+ij11TPxReeMHw\n5PVtsTPnvM/rv3VnNzw9sh2ufXOdT+s3iovGCzdYJzlK5YveGfsXf1x0DTw+PAX/3ZCFd+6ydbcE\nOuM1sW6Mw/ltVicGt9kFYNv87DAUl1YgMiLMtC6r7qKTPj+2A26ftbmqhff1A31w/mqpYXaRoBtM\nRHgYIhS9fXdCD0yeuwPXJNXzu+8wXunba1AryqUVYHYeHZbi1/pREeFoWT8W9w9ohQ5N3HeV6Enr\nhFhcb8LQtU3io3Hygq0WHl0jHJluuj36tPLNq+vQq6O89hXHRdcAzBdq3CcqK1OV3Sv9DY75RIIu\nkBu7NUVyQiw6NVWvQQXLLT2sFSRKZErC1U8NEbZvZx65ro2q66oaSyYPrBJ0T1QOYDao5XlGcVSE\nuVuHwdKhcRwmDWyFif2ShOyfBF0weol55mtjfH6oCbn420jfJxDVrxWF+iHch202wsKY0O48EnRJ\nERlamJCPyl4Uq9fAQx0SdIIgvNKgVhSeHtkOY33wNyfEQYJOEIRP/PW6Nt5XIoQSejNMCIIgiIAg\nQScIgpAEEnSCIAhJIEEnCIKQBBJ0giAISSBBJwiCkAQSdIIgCEkgQScIgpAExg1MX88YywdwNMCf\nNwBQoKE5ZoCO2RrQMctPsMfbknOuHrPXDkMFPRgYY2mc81Tva8oDHbM1oGOWH6OOl7pcCIIgJIEE\nnSAIQhLMJOizRRsgADpma0DHLD+GHK9p+tAJgiAIz5iphk4QBEF4wBSCzhgbxRg7xBjLYIxNFW2P\nPzDGmjPG1jLGDjDG9jHGHlPK6zHGVjLG0pX/dZVyxhibqRzrbsZYT7ttTVTWT2eMTbQr78UY26P8\nZibzN+O0TjDGwhljOxhjS5TlVoyxrYr93zDGIpXyKGU5Q/k+yW4bzyjlhxhjI+3KQ+6eYIzVYYwt\nYIwdVK53P9mvM2PsCeW+3ssYm8sYi5btOjPGPmWM5THG9tqV6X5d3e3DI5zzkP4DEA4gE0AygEgA\nuwB0FG2XH/Y3AdBT+VwbwGEAHQH8C8BUpXwqgDeUz2MALAfAAPQFsFUprwcgS/lfV/lcV/luG4B+\nym+WAxgt+rgVu54E8DWAJcryfADjlc+zADysfP4LgFnK5/EAvlE+d1SudxSAVsp9EB6q9wSAOQAe\nUD5HAqgj83UG0AzAEQAxdtf3PtmuM4DBAHoC2GtXpvt1dbcPj7aKfgh8OJn9AKywW34GwDOi7Qri\neBYBuB7AIQBNlLImAA4pnz8CMMFu/UPK9xMAfGRX/pFS1gTAQbtyh/UEHmcigNUAhgJYotysBQAi\nnK8rgBUA+imfI5T1mPO1rlwvFO8JAHGKuDGncmmvM2yCnqOIVIRynUfKeJ0BJMFR0HW/ru724enP\nDF0ulTdNJblKmelQmpg9AGwF0IhzfhIAlP8NldXcHa+n8lyVctH8B8AUABXKcn0A5znnZcqyvZ1V\nx6Z8f0FZ399zIZJkAPkA/qd0M33MGIuFxNeZc34cwL8BHANwErbrth1yX+dKjLiu7vbhFjMIulo/\noelccxhjtQAsBPA45/yip1VVyngA5cJgjN0AII9zvt2+WGVV7uU70xwzbDXOngA+5Jz3AHAZtmay\nO0x/zEqf7jjYukmaAogFMFplVZmuszeEHqMZBD0XQHO75UQAJwTZEhCMsRqwiflXnPPvlOLTjLEm\nyvdNAOQp5e6O11N5okq5SAYAuIkxlg1gHmzdLv8BUIcxVpmY3N7OqmNTvo8HcBb+nwuR5ALI5Zxv\nVZYXwCbwMl/n4QCOcM7zOeelAL4D0B9yX+dKjLiu7vbhFjMI+m8AUpSR80jYBlMWC7bJZ5QR608A\nHOCcv2331WIAlSPdE2HrW68sv1cZLe8L4ILS3FoBYARjrK5SMxoBW//iSQCFjLG+yr7utduWEDjn\nz3DOEznnSbBdrzWc87sBrAVwu7Ka8zFXnovblfW5Uj5e8Y5oBSAFtgGkkLsnOOenAOQwxtopRcMA\n7IfE1xm2rpa+jLGaik2VxyztdbbDiOvqbh/uETmo4seAxBjYvEMyATwn2h4/bR8IWxNqN4Cdyt8Y\n2PoOVwNIV/7XU9ZnAN5XjnUPgFS7bd0PIEP5+5NdeSqAvcpv3oPTwJzg4x+Cai+XZNge1AwA3wKI\nUsqjleUM5ftku98/pxzXIdh5dYTiPQGgO4A05Vr/AJs3g9TXGcA0AAcVu76AzVNFqusMYC5sYwSl\nsNWoJxlxXd3tw9MfzRQlCIKQBDN0uRAEQRA+QIJOEAQhCSToBEEQkkCCThAEIQkk6ARBEJJAgk4Q\nBCEJJOgEQRCSQIJOEAQhCf8PpK5CeDRWBjAAAAAASUVORK5CYII=\n",
-      "text/plain": [
-       "<matplotlib.figure.Figure at 0x7fe8b9b27fd0>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "%matplotlib inline\n",
-    "from matplotlib import pyplot as plt\n",
-    "plt.plot(waveform)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Assessment"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The following exercise will require you to utilize everything you've learned so far to GPU-accelerate neural network calculations. Unlike previous exercises, there will not be any solution code available to you. Just like in this section, the other 2 notebooks in this course also have assessment problems. For those of you who successfully complete all 3, you will receive a **certificate of competency** in the course.\n",
-    "\n",
-    "**Please read the directions carefully before beginning your work to ensure the best chance at successfully completing the assessment.**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Accelerate Neural Network Calculations\n",
-    "\n",
-    "You will be refactoring a simple version of some code that performs work needed to create a hidden layer in a neural network. It normalizes grayscale values, weighs them, and applies an activation function.\n",
-    "\n",
-    "Your task is to move this work to the GPU using the techniques you've learned while retaining the correctness of the calculations."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Load Imports and Initialize Values"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Run this cell to import required libraries and intitialize values before beginning your work below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 73,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# You should not modify this cell, it contains imports and initial values needed to do work on either\n",
-    "# the CPU or the GPU.\n",
-    "\n",
-    "import numpy as np\n",
-    "from numba import cuda, vectorize\n",
-    "\n",
-    "# Our hidden layer will contain 1M neurons.\n",
-    "# When you assess your work below, this value will be automatically set to 100M.\n",
-    "n = 1000000\n",
-    "\n",
-    "greyscales = np.floor(np.random.uniform(0, 255, n).astype(np.float32))\n",
-    "weights = np.random.normal(.5, .1, n).astype(np.float32)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### GPU Accelerate"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "You will need to make modifications to each of the 3 cells in this section before assessing your work below. Follow the instructions in the comments."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 74,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# As you will recall, `numpy.exp` works on the CPU, but, cannot be used in GPU implmentations.\n",
-    "# This import will work for the CPU-only boilerplate code provided below, but\n",
-    "# you will need to modify this import before your GPU implementation will work.\n",
-    "# from numpy import exp\n",
-    "from math import exp"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 75,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# Modify these 3 function calls to run on the GPU.\n",
-    "@vectorize(['float32(float32)'], target = 'cuda')\n",
-    "def normalize(grayscales):\n",
-    "    return grayscales / 255\n",
-    "\n",
-    "@vectorize(['float32(float32, float32)'], target = 'cuda')\n",
-    "def weigh(values, weights):\n",
-    "    return values * weights\n",
-    "        \n",
-    "@vectorize(['float32(float32)'], target = 'cuda')\n",
-    "def activate(values):\n",
-    "    return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 84,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# Modify the body of this function to optimize data transfers and therefore speed up performance.\n",
-    "# As a constraint, even after you move work to the GPU, make this function return a host array.\n",
-    "def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):\n",
-    "    greyscales_dev = cuda.to_device(greyscales)\n",
-    "    weights_dev = cuda.to_device(weights)\n",
-    "    \n",
-    "    norm_dev = cuda.device_array(shape=(n,), dtype=np.float32)\n",
-    "    weighted_dev = cuda.device_array(shape=(n,), dtype=np.float32)\n",
-    "    activated_dev = cuda.device_array(shape=(n,), dtype=np.float32)\n",
-    "    \n",
-    "    normalize(greyscales, out=norm_dev)\n",
-    "    weigh(norm_dev, weights_dev, out=weighted_dev)\n",
-    "    activate(weighted_dev, out=activated_dev)\n",
-    "    \n",
-    "    out_host = activated_dev.copy_to_host()\n",
-    "    return out_host"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Check Your Work"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Feel free in this section to check your work and debug as needed before running the assessment below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 85,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "# You probably don't need to edit this cell, unless you change the name of any of the values being passed as\n",
-    "# arguments to `create_hidden_layer` below.\n",
-    "arguments = {\"n\":n,\n",
-    "            \"greyscales\": greyscales,\n",
-    "            \"weights\": weights,\n",
-    "            \"exp\": exp,\n",
-    "            \"normalize\": normalize,\n",
-    "            \"weigh\": weigh,\n",
-    "            \"activate\": activate}"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 86,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[0.3338969  0.11722048 0.08675494 ... 0.1048661  0.18585935 0.05007837]\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Use this cell (and feel free to create others) to self-assess your function\n",
-    "a = create_hidden_layer(**arguments)\n",
-    "print(a)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Run the Assessment"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Run the following 2 cells to assess your work."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 82,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from assessment import assess"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 87,
-   "metadata": {
-    "scrolled": false
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Setting n to 100 million.\n",
-      "\n",
-      "Your function returns a host np.ndarray: True\n",
-      "\n",
-      "Your function took 0.56s to run.\n",
-      "Your function runs fast enough (less than 1 second): True\n",
-      "\n",
-      "Your function returns the correct results: True\n",
-      "Congratulations, you passed! See the instructions below for how to get credit for your work to count toward a certificate in the course.\n"
-     ]
-    }
-   ],
-   "source": [
-    "assess(create_hidden_layer, arguments)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Get Credit for Your Work"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "After successfully passing the assessment above, revisit the webpage where you launched this interactive environment and click on the **\"ASSESS TASK\"** button as shown in the screenshot below. Doing so will give you credit for this part of the workshop that counts towards earning a **certificate of competency** for the entire course."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "![Run the assessment](images/run_the_assessment.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Summary\n",
-    "\n",
-    "Now that you have completed this session you are able to:\n",
-    "\n",
-    "- Use Numba to compile Python functions for the CPU\n",
-    "- Understand how Numba compiles functions\n",
-    "- GPU accelerate NumPy ufuncs\n",
-    "- GPU accelerate hand-written vectorized functions\n",
-    "- Optimize memory transfers between the CPU host and GPU device"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Download Content\n",
-    "\n",
-    "To download the contents of this notebook, execute the following cell and then click the download link below. Note: If you run this notebook on a local Jupyter server, you can expect some of the file path links in the notebook to be broken as they are shaped to our own platform. You can still navigate to the files through the Jupyter file navigator."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 88,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "./\r\n",
-      "./.ipynb_checkpoints/\r\n",
-      "./.ipynb_checkpoints/Introduction to CUDA Python with Numba-checkpoint.ipynb\r\n",
-      "./Introduction to CUDA Python with Numba.ipynb\r\n",
-      "./solutions/\r\n",
-      "./solutions/make_pulses_solution.py\r\n",
-      "./solutions/zero_suppress_solution.py\r\n",
-      "./solutions/monte_carlo_pi_solution.py\r\n",
-      "./images/\r\n",
-      "./images/DLI Header.png\r\n",
-      "./images/run_the_assessment.png\r\n",
-      "./images/numba_flowchart.png\r\n",
-      "tar: .: file changed as we read it\r\n"
-     ]
-    }
-   ],
-   "source": [
-    "!tar -zcvf section1.tar.gz ."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "[Download files from this section.](files/section1.tar.gz)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Appendix: Generalized Ufuncs\n",
-    "\n",
-    "Ufuncs broadcast a scalar function over array inputs but what if you want to broadcast a lower dimensional array function over a higher dimensional array?  This is called a *generalized ufunc* (\"gufunc\"), and it opens up a whole new frontier for applying ufuncs.\n",
-    "\n",
-    "Generalized ufuncs are a little more tricky because they need a *signature* (not to be confused with the Numba type signature) that shows the index ordering when dealing with multiple inputs.  Fully explaining \"gufunc\" signatures is beyond the scope of this tutorial, but you can learn more from:\n",
-    "\n",
-    "* The NumPy docs on gufuncs: https://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html\n",
-    "* The Numba docs on gufuncs: http://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator\n",
-    "* The Numba docs on CUDA gufuncs: http://numba.pydata.org/numba-doc/latest/cuda/ufunc.html#generalized-cuda-ufuncs\n",
-    "\n",
-    "Let's write our own normalization function.  This will take an array input and compute the L2 norm along the last dimension.  Generalized ufuncs take their output array as the last argument, rather than returning a value. If the output is a scalar, then we will still receive an array that is one dimension less than the array input. For example, computing the row sums of an array will return a 1 dimensional array for 2D array input, or 2D array for 3D array input."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 89,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "from numba import guvectorize\n",
-    "import math\n",
-    "\n",
-    "@guvectorize(['(float32[:], float32[:])'], # have to include the output array in the type signature\n",
-    "             '(i)->()',                 # map a 1D array to a scalar output\n",
-    "             target='cuda')\n",
-    "def l2_norm(vec, out):\n",
-    "    acc = 0.0\n",
-    "    for value in vec:\n",
-    "        acc += value**2\n",
-    "    out[0] = math.sqrt(acc)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "---"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "To test this, let's construct some points on the unit circle:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 90,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[[-0.53340748 -0.84585842]\n",
-      " [-0.6538219   0.75664848]\n",
-      " [-0.33922312  0.94070594]\n",
-      " [ 0.66932384 -0.7429708 ]\n",
-      " [-0.80223363  0.59701022]\n",
-      " [-0.88033165  0.47435871]\n",
-      " [ 0.65265486 -0.75765535]\n",
-      " [ 0.83383129 -0.55201937]\n",
-      " [ 0.64985792 -0.76005571]\n",
-      " [-0.83615082 -0.54849959]]\n"
-     ]
-    }
-   ],
-   "source": [
-    "angles = np.random.uniform(-np.pi, np.pi, 10)\n",
-    "coords = np.stack([np.cos(angles), np.sin(angles)], axis=1)\n",
-    "print(coords)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "As expected, the L2 norm is 1.0, up to rounding errors:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 91,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([1.        , 1.        , 1.        , 1.        , 1.        ,\n",
-       "       0.99999994, 1.        , 1.        , 1.        , 1.        ],\n",
-       "      dtype=float32)"
-      ]
-     },
-     "execution_count": 91,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "l2_norm(coords)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/nvidia1/bumashka.png b/nvidia1/bumashka.png
deleted file mode 100644
index 09053ec..0000000
Binary files a/nvidia1/bumashka.png and /dev/null differ
-- 
cgit v1.2.3