nvidia1: complete

author: leshe4ka46 <alex9102naid1@ya.ru> 2025-09-21 17:29:45 +0300
committer: leshe4ka46 <alex9102naid1@ya.ru> 2025-09-21 17:29:45 +0300
commit: 1aa689c205e8b113445f6c1de78bacad1ebb1080 (patch)
tree: f2dd55bd14cb12c03886a1df70df274b3e39b662
parent: 1d989271b7dc40565d89d9ff6acc9bc9d9eb4fd1 (diff)
4 files changed, 8161 insertions, 206 deletions
diff --git a/nvidia1/Custom+CUDA+Kernels+in+Python+with+Numba.ipynb b/nvidia1/Custom+CUDA+Kernels+in+Python+with+Numba.ipynb
new file mode 100644
index 0000000..9807623
--- /dev/null
+++ b/nvidia1/Custom+CUDA+Kernels+in+Python+with+Numba.ipynb
@@ -0,0 +1,5434 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Custom CUDA Kernels in Python with Numba\n",
+    "\n",
+    "In this section we will go further into our understanding of how the CUDA programming model organizes parallel work, and will leverage this understanding to write custom CUDA **kernels**, functions which run in parallel on CUDA GPUs. Custom CUDA kernels, in utilizing the CUDA programming model, require more work to implement than, for example, simply decorating a ufunc with `@vectorize`. However, they make possible parallel computing in places where ufuncs are just not able, and provide a flexibility that can lead to the highest level of performance.\n",
+    "\n",
+    "This section contains three appendices for those of you interested in futher study: a variety of debugging techniques to assist your GPU programming, links to CUDA programming references, and coverage of Numba supported random number generation on the GPU."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Objectives\n",
+    "\n",
+    "By the time you complete this section you will be able to:\n",
+    "\n",
+    "* Write custom CUDA kernels in Python and launch them with an execution configuration.\n",
+    "* Utilize grid stride loops for working in parallel over large data sets and leveraging memory coalescing.\n",
+    "* Use atomic operations to avoid race conditions when working in parallel."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The Need for Custom Kernels"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Ufuncs are fantastically elegant, and for any scalar operation that ought to be performed element wise on data, ufuncs are likely the right tool for the job.\n",
+    "\n",
+    "As you are well aware, there are many, if not more, classes of problems that cannot be solved by applying the same function to each element of a data set. Consider, for example, any problem that requires access to more than one element of a data structure in order to calculate its output, like stencil algorithms, or any problem that cannot be expressed by a one input value to one output value mapping, such as a reduction. Many of these problems are still inherently parallelizable, but cannot be expressed by a ufunc.\n",
+    "\n",
+    "Writing custom CUDA kernels, while more challenging than writing GPU accelerated ufuncs, provides developers with tremendous flexibility for the types of functions they can send to run in parallel on the GPU. Furthermore, as you will begin learning in this and the next section, it also provides fine-grained control over *how* the parallelism is conducted by exposing CUDA's thread hierarchy to developers explicitly.\n",
+    "\n",
+    "While remaining purely in Python, the way we write CUDA kernels using Numba is very reminiscent of how developers write them in CUDA C/C++. For those of you familiar with programming in CUDA C/C++, you will likely pick up custom kernels in Python with Numba very rapidly, and for those of you learning them for the first time, know that the work you do here will also serve you well should you ever need or wish to develop CUDA in C/C++, or even, make a study of the wealth of CUDA resources on the web that are most commonly portraying CUDA C/C++ code."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction to CUDA Kernels\n",
+    "\n",
+    "When programming in CUDA, developers write functions for the GPU called **kernels**, which are executed, or in CUDA parlance, **launched**, on the GPU's many cores in parallel **threads**. When kernels are launched, programmers use a special syntax, called an **execution configuration** (also called a launch configuration) to describe the parallel execution's configuration.\n",
+    "\n",
+    "The following slides (which will appear after executing the cell below) give a high level introduction to how CUDA kernels can be created to work on large datasets in parallel on the GPU device. Work through the slides and then you will begin writing and executing your own custom CUDA kernels, using the ideas presented in the slides."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"640\"\n",
+       "            height=\"390\"\n",
+       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/AC_CUDA_Python_1.pptx\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f8d844ee668>"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/AC_CUDA_Python_1.pptx', 640, 390)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## A First CUDA Kernel\n",
+    "\n",
+    "Let's start with a concrete, and very simple example by rewriting our addition function for 1D NumPy arrays. CUDA kernels are compiled using the `numba.cuda.jit` decorator. `numba.cuda.jit` is not to be confused with the `numba.jit` decorator you've already learned which optimizes functions **for the CPU**.\n",
+    "\n",
+    "We will begin with a very simple example to highlight some of the essential syntax. Worth mentioning is that this particular function could in fact be written as a ufunc, but we choose it here to keep the focus on learning the syntax. We will be proceeding to functions more well suited to being written as a custom kernel below. Be sure to read the comments carefully, as they provide some important information about the code."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from numba import cuda\n",
+    "\n",
+    "# Note the use of an `out` array. CUDA kernels written with `@cuda.jit` do not return values,\n",
+    "# just like their C counterparts. Also, no explicit type signature is required with @cuda.jit\n",
+    "@cuda.jit\n",
+    "def add_kernel(x, y, out):\n",
+    "    \n",
+    "    # The actual values of the following CUDA-provided variables for thread and block indices,\n",
+    "    # like function parameters, are not known until the kernel is launched.\n",
+    "    \n",
+    "    # This calculation gives a unique thread index within the entire grid (see the slides above for more)\n",
+    "    idx = cuda.grid(1)          # 1 = one dimensional thread grid, returns a single value.\n",
+    "                                # This Numba-provided convenience function is equivalent to\n",
+    "                                # `cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x`\n",
+    "\n",
+    "    # This thread will do the work on the data element with the same index as its own\n",
+    "    # unique index within the grid.\n",
+    "    out[idx] = x[idx] + y[idx]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "n = 4096\n",
+    "x = np.arange(n).astype(np.int32) # [0...4095] on the host\n",
+    "y = np.ones_like(x)               # [1...1] on the host\n",
+    "\n",
+    "d_x = cuda.to_device(x) # Copy of x on the device\n",
+    "d_y = cuda.to_device(y) # Copy of y on the device\n",
+    "d_out = cuda.device_array_like(d_x) # Like np.array_like, but for device arrays\n",
+    "\n",
+    "# Because of how we wrote the kernel above, we need to have a 1 thread to one data element mapping,\n",
+    "# therefore we define the number of threads in the grid (128*32) to equal n (4096).\n",
+    "threads_per_block = 128\n",
+    "blocks_per_grid = 32"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[   1    2    3 ... 4094 4095 4096]\n"
+     ]
+    }
+   ],
+   "source": [
+    "add_kernel[blocks_per_grid, threads_per_block](d_x, d_y, d_out)\n",
+    "cuda.synchronize()\n",
+    "print(d_out.copy_to_host()) # Should be [1...4096]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise: Tweak the Code\n",
+    "\n",
+    "Make the following minor changes to the code above to see how it affects its execution. Make educated guesses about what will happen before running the code:\n",
+    "\n",
+    "* Decrease the `threads_per_block` variable\n",
+    "* Decrease the `blocks_per_grid` variable\n",
+    "* Increase the `threads_per_block` and/or `blocks_per_grid variables`\n",
+    "* Remove or comment out the `cuda.synchronize()` call\n",
+    "\n",
+    "### Results\n",
+    "\n",
+    "In the example above, because the kernel is written so that each thread works on exactly one data element, it is essential for the number of threads in the grid equal the number of data elements.\n",
+    "\n",
+    "By **reducing the number of threads in the grid**, either by reducing the number of blocks, and/or reducing the number of threads per block, there are elements where work is left undone and thus we can see in the output that the elements toward the end of the `d_out` array did not have any values added to it. If you edited the execution configuration by reducing the number of threads per block, then in fact there are other elements through the `d_out` array that were not processed.\n",
+    "\n",
+    "**Increasing the size of the grid** in fact creates issues with out of bounds memory access. This error will not show in your code presently, but later in this section you will learn how to expose this error using `cuda-memcheck` and debug it.\n",
+    "\n",
+    "You might have expected that **removing the synchronization point** would have resulted in a print showing that no or less work had been done. This is a reasonable guess since without a synchronization point the CPU will work asynchronously while the GPU is processing. The detail to learn here is that memory copies carry implicit synchronization, making the call to `cuda.synchronize` above unnecessary."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise: Accelerate a CPU Function as a Custom CUDA Kernel\n",
+    "\n",
+    "Below is CPU scalar function `square_device` that could be used as a CPU ufunc. Your job is to refactor it to run as a CUDA kernel decorated with the `@cuda.jit` decorator.\n",
+    "\n",
+    "You might think that making this function run on the device could be much more easily done with `@vectorize`, and you would be correct. But this scenario will give you a chance to work with all the syntax we've introduced before moving on to more complicated and realistic examples.\n",
+    "\n",
+    "In this exercise you will need to:\n",
+    "* Refactor the `square_device` definition to be a CUDA kernel that will do one thread's worth of work on a single element.\n",
+    "* Refactor the `d_a` and `d_out` arrays below to be CUDA device arrays.\n",
+    "* Modify the `blocks` and `threads` variables to appropriate values for the provided `n`.\n",
+    "* Refactor the call to `square_device` to be a kernel launch that includes an execution configuration.\n",
+    "\n",
+    "The assertion test below will fail until you successfully implement the above. If you get stuck, feel free to check out a [solution](../edit/solutions/square_device_solution.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Refactor to be a CUDA kernel doing one thread's work.\n",
+    "# Don't forget that when using `@cuda.jit`, you must provide an output array as no value will be returned.\n",
+    "def square_device(a):\n",
+    "    return a**2\n",
+    "\n",
+    "@cuda.jit\n",
+    "def square_kernel(a, out):\n",
+    "    idx = cuda.grid(1)\n",
+    "    out[idx] = a[idx]*a[idx]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Leave the values in this cell fixed for this exercise\n",
+    "n = 4096\n",
+    "\n",
+    "a = np.arange(n)\n",
+    "out = a**2 # `out` will only be used for testing below"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "d_a = cuda.to_device(a)                  # TODO make `d_a` a device array\n",
+    "d_out = cuda.device_array_like(a) # TODO: make d_out a device array\n",
+    "\n",
+    "# TODO: Update the execution configuration for the amount of work needed\n",
+    "blocks = 128\n",
+    "threads = 32\n",
+    "\n",
+    "# TODO: Launch as a kernel with an appropriate execution configuration\n",
+    "# d_out = square_device(d_a)\n",
+    "square_kernel[blocks, threads](d_a, d_out)\n",
+    "cuda.synchronize()\n",
+    "d_out = d_out.copy_to_host()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from numpy import testing\n",
+    "testing.assert_almost_equal(d_out, out)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## An Aside on Hiding Latency and Execution Configuration Choices"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "CUDA enabled NVIDIA GPUs consist of several [**Streaming Multiprocessors**](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation), or **SMs** on a die, with attached DRAM. SMs contain all required resources for the execution of kernel code including many CUDA cores. When a kernel is launched, each block is assigned to a single SM, with potentially many blocks assigned to a single SM. SMs partition blocks into further subdivisions of 32 threads called **warps** and it is these warps which are given parallel instructions to execute.\n",
+    "\n",
+    "When an instruction takes more than one clock cycle to complete (or in CUDA parlance, to **expire**) the SM can continue to do meaningful work *if it has additional warps that are ready to be issued new instructions.* Because of very large register files on the SMs, there is no time penalty for an SM to change context between issuing instructions to one warp or another. In short, the latency of operations can be hidden by SMs with other meaningful work so long as there is other work to be done.\n",
+    "\n",
+    "**Therefore, of primary importance to utilizing the full potential of the GPU, and thereby writing performant accelerated applications, it is essential to give SMs the ability to hide latency by providing them with a sufficient number of warps which can be accomplished most simply by executing kernels with sufficiently large grid and block dimensions.**\n",
+    "\n",
+    "Deciding the very best size for the CUDA thread grid is a complex problem, and depends on both the algorithm and the specific GPU's [compute capability](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities), but here are some very rough heuristics that we tend to follow and which can work well for getting started:\n",
+    "\n",
+    "  * The size of a block should be a multiple of 32 threads (the size of a warp), with typical block sizes between 128 and 512 threads per block.\n",
+    "  * The size of the grid should ensure the full GPU is utilized where possible. Launching a grid where the number of blocks is 2x-4x the number of SMs on the GPU is a good starting place. Something in the range of 20 - 100 blocks is usually a good starting point.\n",
+    "  * The CUDA kernel launch overhead does increase with the number of blocks, so when the input size is very large we find it best not to launch a grid where the number of threads equals the number of input elements, which would result in a tremendous number of blocks. Instead we use a pattern to which we will now turn our attention for dealing with large inputs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Working on Largest Datasets with Grid Stride Loops\n",
+    "\n",
+    "The following slides give a high level overview of a technique called a **grid stride loop** which will create flexible kernels where each thread is able to work on more than one data element, an essential technique for large datasets. Execute the cell to load the slides."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"640\"\n",
+       "            height=\"390\"\n",
+       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/AC_CUDA_Python_2.pptx\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f8d53069710>"
+      ]
+     },
+     "execution_count": 37,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/AC_CUDA_Python_2.pptx', 640, 390)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## A First Grid Stride Loop\n",
+    "\n",
+    "Let's refactor the `add_kernel` above to utilize a grid stride loop so that we can launch it to work on larger data sets flexibly while incurring the benefits of global **memory coalescing**, which allows parallel threads to access memory in contiguous chunks, a scenario which the GPU can leverage to reduce the total number of memory operations:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from numba import cuda\n",
+    "\n",
+    "@cuda.jit\n",
+    "def add_kernel(x, y, out):\n",
+    "    \n",
+    "\n",
+    "    start = cuda.grid(1)\n",
+    "    \n",
+    "    # This calculation gives the total number of threads in the entire grid\n",
+    "    stride = cuda.gridsize(1)   # 1 = one dimensional thread grid, returns a single value.\n",
+    "                                # This Numba-provided convenience function is equivalent to\n",
+    "                                # `cuda.blockDim.x * cuda.gridDim.x`\n",
+    "\n",
+    "    # This thread will start work at the data element index equal to that of its own\n",
+    "    # unique index in the grid, and then, will stride the number of threads in the grid each\n",
+    "    # iteration so long as it has not stepped out of the data's bounds. In this way, each\n",
+    "    # thread may work on more than one data element, and together, all threads will work on\n",
+    "    # every data element.\n",
+    "    for i in range(start, x.shape[0], stride):\n",
+    "        # Assuming x and y inputs are same length\n",
+    "        out[i] = x[i] + y[i]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(100000,)\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "n = 100000 # This is far more elements than threads in our grid\n",
+    "x = np.arange(n).astype(np.int32)\n",
+    "y = np.ones_like(x)\n",
+    "\n",
+    "d_x = cuda.to_device(x)\n",
+    "print(d_x.shape)\n",
+    "d_y = cuda.to_device(y)\n",
+    "d_out = cuda.device_array_like(d_x)\n",
+    "\n",
+    "threads_per_block = 128\n",
+    "blocks_per_grid = 30"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[     1      2      3 ...  99998  99999 100000]\n"
+     ]
+    }
+   ],
+   "source": [
+    "add_kernel[blocks_per_grid, threads_per_block](d_x, d_y, d_out)\n",
+    "print(d_out.copy_to_host()) # Remember, memory copy carries implicit synchronization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise: Implement a Grid Stride Loop\n",
+    "\n",
+    "Refactor the following CPU scalar `hypot_stride` function to run as a CUDA Kernel utilizing a grid stride loop. Feel free to look at [the solution](../edit/solutions/hypot_stride_solution.py) if you get stuck."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from math import hypot\n",
+    "from numba import cuda\n",
+    "\n",
+    "@cuda.jit\n",
+    "def hypot_stride(a, b, c):\n",
+    "    start = cuda.grid(1)\n",
+    "    stride = cuda.gridsize(1)\n",
+    "    for i in range(start, a.shape[0], stride):\n",
+    "        c[i] = hypot(a[i], b[i])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# You do not need to modify the contents in this cell\n",
+    "n = 1000000\n",
+    "a = np.random.uniform(-12, 12, n).astype(np.float32)\n",
+    "b = np.random.uniform(-12, 12, n).astype(np.float32)\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_b = cuda.to_device(b)\n",
+    "d_c = cuda.device_array_like(d_b)\n",
+    "\n",
+    "blocks = 128\n",
+    "threads_per_block = 64\n",
+    "\n",
+    "hypot_stride[blocks, threads_per_block](d_a, d_b, d_c)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from numpy import testing\n",
+    "# This assertion will fail until you successfully implement the hypot_stride kernel above\n",
+    "testing.assert_almost_equal(np.hypot(a,b), d_c.copy_to_host(), decimal=5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Timing the Kernel\n",
+    "\n",
+    "Let's take the time to do some performance timing for the `hypot_stride` kernel. If you weren't able to successfully implement it, copy and execute [the solution](../edit/solutions/hypot_stride_solution.py) before timing."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### CPU Baseline\n",
+    "\n",
+    "First let's get a baseline with `np.hypot`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "6.07 ms ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit np.hypot(a, b)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Numba on the CPU\n",
+    "\n",
+    "Next let's see about a CPU optimized version:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from numba import jit\n",
+    "\n",
+    "@jit\n",
+    "def numba_hypot(a, b):\n",
+    "    return np.hypot(a, b)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "5.68 ms ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit numba_hypot(a, b)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Single Threaded on the Device\n",
+    "\n",
+    "Just to see, let's launch our kernel in a grid with only a single thread. Here we will use `%time`, which only runs the statement once to ensure our measurement isn't affected by the finite depth of the CUDA kernel queue. We will also add a `cuda.synchronize` to be sure we don't get any innacurate times on account of returning control to the CPU, where the timer is, before the kernel completes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 188 ms, sys: 128 ms, total: 316 ms\n",
+      "Wall time: 316 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "%time hypot_stride[1, 1](d_a, d_b, d_c); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Hopefully not too much of a surprise that this is way slower than even the baseline CPU execution."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Parallel on the Device"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
+      "Wall time: 696 µs\n"
+     ]
+    }
+   ],
+   "source": [
+    "%time hypot_stride[128, 64](d_a, d_b, d_c); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "That's much faster!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Atomic Operations and Avoiding Race Conditions\n",
+    "\n",
+    "CUDA, like many general purpose parallel execution frameworks, makes it possible to have race conditions in your code.  A race condition in CUDA arises when threads read to or write from a memory location that might be modified by another independent thread. Generally speaking, you need to worry about:\n",
+    "\n",
+    " * read-after-write hazards: One thread is reading a memory location at the same time another thread might be writing to it.\n",
+    " * write-after-write hazards: Two threads are writing to the same memory location, and only one write will be visible when the kernel is complete.\n",
+    " \n",
+    "A common strategy to avoid both of these hazards is to organize your CUDA kernel algorithm such that each thread has exclusive responsibility for unique subsets of output array elements, and/or to never use the same array for both input and output in a single kernel call. (Iterative algorithms can use a double-buffering strategy if needed, and switch input and output arrays on each iteration.)\n",
+    "\n",
+    "However, there are many cases where different threads need to combine results. Consider something very simple, like: \"every thread increments a global counter.\" Implementing this in your kernel requires each thread to:\n",
+    "\n",
+    "1. Read the current value of a global counter.\n",
+    "2. Compute `counter + 1`.\n",
+    "3. Write that value back to global memory.\n",
+    "\n",
+    "However, there is no guarantee that another thread has not changed the global counter between steps 1 and 3. To resolve this problem, CUDA provides **atomic operations** which will read, modify and update a memory location in one, indivisible step. Numba supports several of these functions, [described here](http://numba.pydata.org/numba-doc/dev/cuda/intrinsics.html#supported-atomic-operations).\n",
+    "\n",
+    "Let's make our thread counter kernel:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def thread_counter_race_condition(global_counter):\n",
+    "    global_counter[0] += 1  # This is bad\n",
+    "    \n",
+    "@cuda.jit\n",
+    "def thread_counter_safe(global_counter):\n",
+    "    cuda.atomic.add(global_counter, 0, 1)  # Safely add 1 to offset 0 in global_counter array"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Should be 4096: [1]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# This gets the wrong answer\n",
+    "global_counter = cuda.to_device(np.array([0], dtype=np.int32))\n",
+    "thread_counter_race_condition[64, 64](global_counter)\n",
+    "\n",
+    "print('Should be %d:' % (64*64), global_counter.copy_to_host())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Should be 4096: [4096]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# This works correctly\n",
+    "global_counter = cuda.to_device(np.array([0], dtype=np.int32))\n",
+    "thread_counter_safe[64, 64](global_counter)\n",
+    "\n",
+    "print('Should be %d:' % (64*64), global_counter.copy_to_host())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Assessment\n",
+    "\n",
+    "The following exercise will require you to utilize everything you've learned so far. Unlike previous exercises, there will not be any solution code available to you, and, there are a couple additional steps you will need to take to \"run the assessment\" and get a score for your attempt(s). **Please read the directions carefully before beginning your work to ensure the best chance at successfully completing the assessment.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### How to Run the Assessment\n",
+    "\n",
+    "Take the following steps to complete this assessment:\n",
+    "\n",
+    "1. Using the instructions that follow, work on the cells below as you usually would for an exercise.\n",
+    "2. When you are satisfied with your work, follow the instructions below to copy and paste code in into linked source code files. Be sure to save the files after you paste your work.\n",
+    "3. Return to the browser tab you used to launch this notebook, and click on the **\"Assess\"** button. After a few seconds a score will be generated along with a helpful message.\n",
+    "\n",
+    "You are welcome to click on the **Assess** button as many times as you like, so feel free if you don't pass the first time to make additional modifications to your code and repeat steps 1 through 3. Good luck!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Run the assessment](images/run_the_assessment.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "### Write an Accelerated Histogramming Kernel\n",
+    "\n",
+    "For this assessment, you will create an accelerated histogramming kernel. This will take an array of input data, a range, and a number of bins, and count how many of the input data elements land in each bin. Below is a working CPU implementation of histogramming to serve as an example for your work:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def cpu_histogram(x, xmin, xmax, histogram_out):\n",
+    "    '''Increment bin counts in histogram_out, given histogram range [xmin, xmax).'''\n",
+    "    # Note that we don't have to pass in nbins explicitly, because the size of histogram_out determines it\n",
+    "    nbins = histogram_out.shape[0]\n",
+    "    bin_width = (xmax - xmin) / nbins\n",
+    "    \n",
+    "    # This is a very slow way to do this with NumPy, but looks similar to what you will do on the GPU\n",
+    "    for element in x:\n",
+    "        bin_number = np.int32((element - xmin)/bin_width)\n",
+    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
+    "            # only increment if in range\n",
+    "            histogram_out[bin_number] += 1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 63,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([   3,   88,  445, 1576, 2969, 2854, 1548,  442,   72,    3],\n",
+       "      dtype=int32)"
+      ]
+     },
+     "execution_count": 63,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "x = np.random.normal(size=10000, loc=0, scale=1).astype(np.float32)\n",
+    "xmin = np.float32(-4.0)\n",
+    "xmax = np.float32(4.0)\n",
+    "histogram_out = np.zeros(shape=10, dtype=np.int32)\n",
+    "\n",
+    "cpu_histogram(x, xmin, xmax, histogram_out)\n",
+    "\n",
+    "histogram_out"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using a grid stride loop and atomic operations, implement your solution in the cell below. After making any modifications, and before running the assessment, paste this cell's content into [**`assessment/histogram.py`**](../edit/assessment/histogram.py) and save it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def cuda_histogram(x, xmin, xmax, histogram_out):\n",
+    "    '''Increment bin counts in histogram_out, given histogram range [xmin, xmax).'''\n",
+    "    nbins = histogram_out.shape[0]\n",
+    "    bin_width = (xmax - xmin) / nbins\n",
+    "    \n",
+    "    start = cuda.grid(1)\n",
+    "    stride = cuda.gridsize(1)\n",
+    "    for i in range(start, x.shape[0], stride):\n",
+    "        bin_number = np.int32((x[i] - xmin)/bin_width)\n",
+    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
+    "            cuda.atomic.add(histogram_out, bin_number, 1)\n",
+    "    pass  # Replace this with your implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 74,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "d_x = cuda.to_device(x)\n",
+    "d_histogram_out = cuda.to_device(np.zeros(shape=10, dtype=np.int32))\n",
+    "\n",
+    "blocks = 128\n",
+    "threads_per_block = 64\n",
+    "\n",
+    "cuda_histogram[blocks, threads_per_block](d_x, xmin, xmax, d_histogram_out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 75,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# This assertion will fail until you correctly implement `cuda_histogram`\n",
+    "np.testing.assert_array_almost_equal(d_histogram_out.copy_to_host(), histogram_out, decimal=2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "In this section you learned how to:\n",
+    "\n",
+    "* Write custom CUDA kernels in Python and launch them with an execution configuration.\n",
+    "* Utilize grid stride loops for working in parallel over large data sets and leveraging memory coalescing.\n",
+    "* Use atomic operations to avoid race conditions when working in parallel."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download Content\n",
+    "\n",
+    "To download the contents of this notebook, execute the following cell and then click the download link below. Note: If you run this notebook on a local Jupyter server, you can expect some of the file path links in the notebook to be broken as they are shaped to our own platform. You can still navigate to the files through the Jupyter file navigator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "!tar -zcvf section2.tar.gz ."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Download files from this section.](files/section2.tar.gz)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Appendix: Troubleshooting and Debugging\n",
+    "\n",
+    "### Note about the Terminal\n",
+    "\n",
+    "Debugging is an important part of programming.  Unfortuntely, it is pretty difficult to debug CUDA kernels directly in the Jupyter notebook for a variety of reasons, so this notebook will show terminal commands by executing Jupyter notebook cells using the shell.  These shell commands will appear in notebook cells with the command line prefixed by `!`. When applying the debug methods described in this notebook, you will likely run the commands in the terminal directly.\n",
+    "\n",
+    "### Printing\n",
+    "\n",
+    "A common debugging strategy is printing to the console.  Numba supports printing from CUDA kernels, with some restrictions.  Note that output printed from a CUDA kernel will not be captured by Jupyter, so you will need to debug with a script you can run from the terminal.\n",
+    "\n",
+    "Let's look at a CUDA kernel with a bug:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "import numpy as np\r\n",
+      "\r\n",
+      "from numba import cuda\r\n",
+      "\r\n",
+      "@cuda.jit\r\n",
+      "def histogram(x, xmin, xmax, histogram_out):\r\n",
+      "    nbins = histogram_out.shape[0]\r\n",
+      "    bin_width = (xmax - xmin) / nbins\r\n",
+      "\r\n",
+      "    start = cuda.grid(1)\r\n",
+      "    stride = cuda.gridsize(1)\r\n",
+      "\r\n",
+      "    for i in range(start, x.shape[0], stride):\r\n",
+      "        bin_number = np.int32((x[i] - xmin)/bin_width)\r\n",
+      "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\r\n",
+      "            histogram_out[bin_number] += 1\r\n",
+      "\r\n",
+      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
+      "xmin = np.float32(-4.0)\r\n",
+      "xmax = np.float32(4.0)\r\n",
+      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
+      "\r\n",
+      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
+      "\r\n",
+      "print('input count:', x.shape[0])\r\n",
+      "print('histogram:', histogram_out)\r\n",
+      "print('count:', histogram_out.sum())\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "! cat debug/ex1.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "When we run this code to histogram 50 values, we see the histogram is not getting 50 entries: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "input count: 50\r\n",
+      "histogram: [0 1 1 1 1 1 1 0 0 0]\r\n",
+      "count: 6\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "! python debug/ex1.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*(You might have already spotted the mistake, but let's pretend we don't know the answer.)*\n",
+    "\n",
+    "We hypothesize that maybe a bin calculation error is causing many of the histogram entries to appear out of range.  Let's add some printing around the `if` statement to show us what is going on:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "import numpy as np\r\n",
+      "\r\n",
+      "from numba import cuda\r\n",
+      "\r\n",
+      "@cuda.jit\r\n",
+      "def histogram(x, xmin, xmax, histogram_out):\r\n",
+      "    nbins = histogram_out.shape[0]\r\n",
+      "    bin_width = (xmax - xmin) / nbins\r\n",
+      "\r\n",
+      "    start = cuda.grid(1)\r\n",
+      "    stride = cuda.gridsize(1)\r\n",
+      "\r\n",
+      "    for i in range(start, x.shape[0], stride):\r\n",
+      "        bin_number = np.int32((x[i] - xmin)/bin_width)\r\n",
+      "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\r\n",
+      "            histogram_out[bin_number] += 1\r\n",
+      "            print('in range', x[i], bin_number)\r\n",
+      "        else:\r\n",
+      "            print('out of range', x[i], bin_number)\r\n",
+      "\r\n",
+      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
+      "xmin = np.float32(-4.0)\r\n",
+      "xmax = np.float32(4.0)\r\n",
+      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
+      "\r\n",
+      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
+      "\r\n",
+      "print('input count:', x.shape[0])\r\n",
+      "print('histogram:', histogram_out)\r\n",
+      "print('count:', histogram_out.sum())\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "! cat debug/ex1a.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This kernel will print every value and bin number it calculates.  Looking at one of the print statements, we see that `print` supports constant strings, and scalar values:\n",
+    "\n",
+    "``` python\n",
+    "print('in range', x[i], bin_number)\n",
+    "```\n",
+    "\n",
+    "String substitution (using C printf syntax or the newer `format()` syntax) is not supported.  If we run this script we see:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "in range 0.261026 5\r\n",
+      "in range -2.102431 2\r\n",
+      "in range 0.799183 5\r\n",
+      "in range 1.051908 6\r\n",
+      "in range -0.201711 4\r\n",
+      "in range -1.698864 2\r\n",
+      "in range 0.248762 5\r\n",
+      "in range 1.782836 7\r\n",
+      "in range -0.594408 4\r\n",
+      "in range 1.867431 7\r\n",
+      "in range 0.418070 5\r\n",
+      "in range 0.365282 5\r\n",
+      "in range -0.655639 4\r\n",
+      "in range 0.817385 6\r\n",
+      "in range 0.646000 5\r\n",
+      "in range 0.776718 5\r\n",
+      "in range -0.665656 4\r\n",
+      "in range 0.431279 5\r\n",
+      "in range 0.480257 5\r\n",
+      "in range 0.769916 5\r\n",
+      "in range 0.386032 5\r\n",
+      "in range -0.824273 3\r\n",
+      "in range -0.310682 4\r\n",
+      "in range -1.554290 3\r\n",
+      "in range 1.897843 7\r\n",
+      "in range -0.788933 4\r\n",
+      "in range -0.509624 4\r\n",
+      "in range -0.854971 3\r\n",
+      "in range 0.470186 5\r\n",
+      "in range 1.196934 6\r\n",
+      "in range 0.821883 6\r\n",
+      "in range 1.011266 6\r\n",
+      "in range -3.438190 0\r\n",
+      "in range 0.612806 5\r\n",
+      "in range 0.789266 5\r\n",
+      "in range -2.211243 2\r\n",
+      "in range 1.039794 6\r\n",
+      "in range 2.000385 7\r\n",
+      "in range -1.390927 3\r\n",
+      "in range 1.432608 6\r\n",
+      "in range 0.208954 5\r\n",
+      "in range -1.194161 3\r\n",
+      "in range 0.558909 5\r\n",
+      "in range 0.494454 5\r\n",
+      "in range 0.149325 5\r\n",
+      "in range -0.593924 4\r\n",
+      "in range 0.702312 5\r\n",
+      "in range 0.765463 5\r\n",
+      "in range -1.847362 2\r\n",
+      "in range 2.459083 8\r\n",
+      "input count: 50\r\n",
+      "histogram: [1 0 1 1 1 1 1 1 1 0]\r\n",
+      "count: 8\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "! python debug/ex1a.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Scanning down that output, we see that all 50 values should be in range.  Clearly we have some kind of race condition updating the histogram.  In fact, the culprit line is:\n",
+    "\n",
+    "``` python\n",
+    "histogram_out[bin_number] += 1\n",
+    "```\n",
+    "\n",
+    "which should be (as you may have seen in a previous exercise)\n",
+    "\n",
+    "``` python\n",
+    "cuda.atomic.add(histogram_out, bin_number, 1)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "### CUDA Simulator\n",
+    "\n",
+    "Back in the early days of CUDA, `nvcc` had an \"emulator\" mode that would execute CUDA code on the CPU for debugging.  That functionality was dropped in later CUDA releases after `cuda-gdb` was created.  There isn't a debugger for CUDA+Python, so Numba includes a \"CUDA simulator\" in Numba that runs your CUDA code with the Python interpreter on the host CPU.  This allows you to debug the logic of your code using Python modules and functions that would otherwise be not allowed by the compile.\n",
+    "\n",
+    "A very common use case is to start the Python debugger inside one thread of a CUDA kernel:\n",
+    "``` python\n",
+    "import numpy as np\n",
+    "\n",
+    "from numba import cuda\n",
+    "\n",
+    "@cuda.jit\n",
+    "def histogram(x, xmin, xmax, histogram_out):\n",
+    "    nbins = histogram_out.shape[0]\n",
+    "    bin_width = (xmax - xmin) / nbins\n",
+    "\n",
+    "    start = cuda.grid(1)\n",
+    "    stride = cuda.gridsize(1)\n",
+    "\n",
+    "    ### DEBUG FIRST THREAD\n",
+    "    if start == 0:\n",
+    "        from pdb import set_trace; set_trace()\n",
+    "    ###\n",
+    "\n",
+    "    for i in range(start, x.shape[0], stride):\n",
+    "        bin_number = np.int32((x[i] + xmin)/bin_width)\n",
+    "\n",
+    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
+    "            cuda.atomic.add(histogram_out, bin_number, 1)\n",
+    "\n",
+    "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\n",
+    "xmin = np.float32(-4.0)\n",
+    "xmax = np.float32(4.0)\n",
+    "histogram_out = np.zeros(shape=10, dtype=np.int32)\n",
+    "\n",
+    "histogram[64, 64](x, xmin, xmax, histogram_out)\n",
+    "\n",
+    "print('input count:', x.shape[0])\n",
+    "print('histogram:', histogram_out)\n",
+    "print('count:', histogram_out.sum())\n",
+    "```\n",
+    "\n",
+    "This code allows a debug session like the following to take place:\n",
+    "```\n",
+    "(gtc2017) 0179-sseibert:gtc2017-numba sseibert$ NUMBA_ENABLE_CUDASIM=1 python debug/ex2.py\n",
+    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(18)histogram()\n",
+    "-> for i in range(start, x.shape[0], stride):\n",
+    "(Pdb) n\n",
+    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(19)histogram()\n",
+    "-> bin_number = np.int32((x[i] + xmin)/bin_width)\n",
+    "(Pdb) n\n",
+    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(21)histogram()\n",
+    "-> if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
+    "(Pdb) p bin_number, x[i]\n",
+    "(-6, -1.4435024)\n",
+    "(Pdb) p x[i], xmin, bin_width\n",
+    "(-1.4435024, -4.0, 0.80000000000000004)\n",
+    "(Pdb) p (x[i] - xmin) / bin_width\n",
+    "3.1956219673156738\n",
+    "(Pdb) q\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "### CUDA Memcheck\n",
+    "\n",
+    "Another common error occurs when a CUDA kernel has an invalid memory access, typically caused by running off the end of an array.  The full CUDA toolkit from NVIDIA (not the `cudatoolkit` conda package) contain a utility called `cuda-memcheck` that can check for a wide range of memory access mistakes in CUDA code.\n",
+    "\n",
+    "Let's debug the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "import numpy as np\r\n",
+      "\r\n",
+      "from numba import cuda\r\n",
+      "\r\n",
+      "@cuda.jit\r\n",
+      "def histogram(x, xmin, xmax, histogram_out):\r\n",
+      "    nbins = histogram_out.shape[0]\r\n",
+      "    bin_width = (xmax - xmin) / nbins\r\n",
+      "\r\n",
+      "    start = cuda.grid(1)\r\n",
+      "    stride = cuda.gridsize(1)\r\n",
+      "\r\n",
+      "    for i in range(start, x.shape[0], stride):\r\n",
+      "        bin_number = np.int32((x[i] + xmin)/bin_width)\r\n",
+      "\r\n",
+      "        if bin_number >= 0 or bin_number < histogram_out.shape[0]:\r\n",
+      "            cuda.atomic.add(histogram_out, bin_number, 1)\r\n",
+      "\r\n",
+      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
+      "xmin = np.float32(-4.0)\r\n",
+      "xmax = np.float32(4.0)\r\n",
+      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
+      "\r\n",
+      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
+      "\r\n",
+      "print('input count:', x.shape[0])\r\n",
+      "print('histogram:', histogram_out)\r\n",
+      "print('count:', histogram_out.sum())\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "! cat debug/ex3.py"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "========= CUDA-MEMCHECK\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (31,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (30,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (29,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (28,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (27,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f0 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (26,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (25,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001fc is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (24,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e4 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (23,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (22,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (21,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (20,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (19,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (18,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f4 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (17,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (16,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f0 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (15,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f0 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (14,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f4 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (13,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f0 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (12,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (11,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (10,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (9,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e4 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (8,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (7,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f0 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (6,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001e8 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (5,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f4 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (4,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (3,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (2,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (1,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001ec is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00000900 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (0,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f43844001f4 is out of bounds\n",
+      "=========     Device Frame:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x900)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to \"unspecified launch failure\" on CUDA API call to cuMemcpyDtoH_v2. \n",
+      "=========     Saved host backtrace up to driver entry point at error\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuMemcpyDtoH_v2 + 0x1c9) [0x291fe9]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x966) [0x193d06]\n",
+      "=========     Host Frame:python [0x1944d4]\n",
+      "Traceback (most recent call last):\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "  File \"debug/ex3.py\", line 24, in <module>\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python [0x1945e6]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191e46]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "    histogram[64, 64](x, xmin, xmax, histogram_out)\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 755, in __call__\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "    cfg(*args)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 494, in __call__\n",
+      "    sharedmem=self.sharedmem)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 596, in _kernel_call\n",
+      "    wb()\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/args.py\", line 65, in <lambda>\n",
+      "    retr.append(lambda: devary.copy_to_host(self.value, stream=stream))\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py\", line 212, in _require_cuda_context\n",
+      "    return fn(*args, **kws)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py\", line 252, in copy_to_host\n",
+      "    _driver.device_to_host(hostary, self, self.alloc_size, stream=stream)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 1819, in device_to_host\n",
+      "    fn(host_pointer(dst), device_pointer(src), size, *varargs)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 290, in safe_cuda_api_call\n",
+      "    self._check_error(fname, retcode)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 325, in _check_error\n",
+      "    raise CudaAPIError(retcode, msg)\n",
+      "numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in CUDA_ERROR_LAUNCH_FAILED\n",
+      "========= ERROR SUMMARY: 33 errors\n"
+     ]
+    }
+   ],
+   "source": [
+    "! cuda-memcheck python debug/ex3.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The output of `cuda-memcheck` is clearly showing a problem with our histogram function:\n",
+    "```\n",
+    "========= Invalid __global__ write of size 4\n",
+    "=========     at 0x00000548 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+    "```\n",
+    "But we don't know which line it is.  To get better error information, we can turn \"debug\" mode on when compiling the kernel, by changing the kernel to look like this:\n",
+    "``` python\n",
+    "@cuda.jit(debug=True)\n",
+    "def histogram(x, xmin, xmax, histogram_out):\n",
+    "    nbins = histogram_out.shape[0]\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 80,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "========= CUDA-MEMCHECK\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (31,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f4 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (30,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (29,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (28,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (27,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (26,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (25,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (24,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (23,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (22,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f4 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (21,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f4 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (20,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e4 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (19,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001ec is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (18,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f4 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (17,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001ec is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (16,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001ec is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (15,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (14,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (13,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (12,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (11,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (10,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f4 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (9,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (8,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (7,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (6,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (5,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f4 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (4,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (3,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (2,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f4 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (1,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001e8 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n",
+      "========= Invalid __global__ write of size 4\n",
+      "=========     at 0x00001bb0 in /dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
+      "=========     by thread (0,0,0) in block (0,0,0)\n",
+      "=========     Address 0x7f4f464001f0 is out of bounds\n",
+      "=========     Device Frame:/dli/task/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) (cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>) : 0x1bb0)\n",
+      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x346) [0x297db6]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x19296b]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0xd0) [0x1161e0]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to \"unspecified launch failure\" on CUDA API call to cuMemcpyDtoH_v2. \n",
+      "=========     Saved host backtrace up to driver entry point at error\n",
+      "=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuMemcpyDtoH_v2 + 0x1c9) [0x291fe9]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call_unix64 + 0x4c) [0x6adc]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/../../libffi.so.6 (ffi_call + 0x1f2) [0x6282]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (_ctypes_callproc + 0x2ce) [0x12d6e]\n",
+      "=========     Host Frame:/home/appuser/Miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so [0x137a5]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x966) [0x193d06]\n",
+      "=========     Host Frame:python [0x1944d4]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "Traceback (most recent call last):\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "  File \"debug/ex3a.py\", line 24, in <module>\n",
+      "=========     Host Frame:python [0x192b83]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python [0x191fae]\n",
+      "=========     Host Frame:python [0x192be6]\n",
+      "=========     Host Frame:python [0x198a65]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x10cb) [0x1bc31b]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x1ab0) [0x1bcd00]\n",
+      "=========     Host Frame:python [0x191b76]\n",
+      "=========     Host Frame:python (_PyFunction_FastCallDict + 0x1be) [0x19308e]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x26f) [0x1116ff]\n",
+      "=========     Host Frame:python (_PyObject_Call_Prepend + 0x63) [0x116173]\n",
+      "=========     Host Frame:python (PyObject_Call + 0x3e) [0x11113e]\n",
+      "=========     Host Frame:python [0x16a101]\n",
+      "=========     Host Frame:python (_PyObject_FastCallDict + 0x8b) [0x11151b]\n",
+      "=========     Host Frame:python [0x198ade]\n",
+      "=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x30a) [0x1bb55a]\n",
+      "=========     Host Frame:python (PyEval_EvalCodeEx + 0x329) [0x1936c9]\n",
+      "=========     Host Frame:python (PyEval_EvalCode + 0x1c) [0x19445c]\n",
+      "=========     Host Frame:python [0x214d54]\n",
+      "    histogram[64, 64](x, xmin, xmax, histogram_out)\n",
+      "=========     Host Frame:python (PyRun_FileExFlags + 0xa1) [0x215151]\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 755, in __call__\n",
+      "=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1c3) [0x215353]\n",
+      "=========     Host Frame:python (Py_Main + 0x613) [0x218e43]\n",
+      "=========     Host Frame:python (main + 0xee) [0xe328e]\n",
+      "=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]\n",
+      "=========     Host Frame:python [0x1c1fff]\n",
+      "=========\n",
+      "    cfg(*args)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 494, in __call__\n",
+      "    sharedmem=self.sharedmem)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 571, in _kernel_call\n",
+      "    driver.device_to_host(ctypes.addressof(excval), excmem, excsz)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 1819, in device_to_host\n",
+      "    fn(host_pointer(dst), device_pointer(src), size, *varargs)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 290, in safe_cuda_api_call\n",
+      "    self._check_error(fname, retcode)\n",
+      "  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 325, in _check_error\n",
+      "    raise CudaAPIError(retcode, msg)\n",
+      "numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in CUDA_ERROR_LAUNCH_FAILED\n",
+      "========= ERROR SUMMARY: 33 errors\n"
+     ]
+    }
+   ],
+   "source": [
+    "! cuda-memcheck python debug/ex3a.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we get an error message that includes a source file and line number: `ex3a.py:17`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    15\t\r\n",
+      "    16\t        if bin_number >= 0 or bin_number < histogram_out.shape[0]:\r\n",
+      "    17\t            cuda.atomic.add(histogram_out, bin_number, 1)\r\n",
+      "    18\t\r\n",
+      "    19\tx = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "! cat -n debug/ex3a.py | grep -C 2 \"17\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At this point, we might realize that our if statement incorrectly has an `or` instead of an `and`.\n",
+    "\n",
+    "`cuda-memcheck` has different modes for detecting different kinds of problems (similar to `valgrind` for debugging CPU memory access errors).  Take a look at the documentation for more information: http://docs.nvidia.com/cuda/cuda-memcheck/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Appendix: CUDA References\n",
+    "\n",
+    "It's worth bookmarking Chapters 1 and 2 of the CUDA C Programming Guide for study after the completion of this course. They are written for CUDA C, but are still highly applicable to programming CUDA Python.\n",
+    "\n",
+    " * Introduction: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction\n",
+    " * Programming Model: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Appendix: Random Number Generation on the GPU with Numba\n",
+    "\n",
+    "GPUs can be extremely useful for Monte Carlo applications where you need to use large amounts of random numbers.  CUDA ships with an excellent set of random number generation algorithms in the cuRAND library.  Unfortunately, cuRAND is defined in a set of C headers which Numba can't easily compile or link to.  (Numba's CUDA JIT does not ever create C code for CUDA kernels.)  It is on the Numba roadmap to find a solution to this problem, but it may take some time.\n",
+    "\n",
+    "In the meantime, Numba version 0.33 and later includes the `xoroshiro128+` generator, which is pretty high quality, though with a smaller period ($2^{128} - 1$) than the XORWOW generator in cuRAND.\n",
+    "\n",
+    "To use it, you will want to initialize the RNG state on the host for each thread in your kernel. This state creation function initializes each state to be in the same sequence designated by the seed, but separated by $2^{64}$ steps from each other. This ensures that different threads will not accidentally end up with overlapping sequences (unless a single thread draws $2^{64}$ random numbers, which you won't have patience for):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 82,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda\n",
+    "from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32\n",
+    "\n",
+    "threads_per_block = 64\n",
+    "blocks = 24\n",
+    "rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can use these random number states in our kernel by passing it in as an argument:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 91,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def monte_carlo_mean(rng_states, iterations, out):\n",
+    "    thread_id = cuda.grid(1)\n",
+    "    total = 0\n",
+    "    for i in range(iterations):\n",
+    "        sample = xoroshiro128p_uniform_float32(rng_states, thread_id) # Returns a float32 in range [0.0, 1.0)\n",
+    "        total += sample\n",
+    "        \n",
+    "    out[thread_id] = total/iterations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 98,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.49990836\n"
+     ]
+    }
+   ],
+   "source": [
+    "out = cuda.device_array(threads_per_block * blocks, dtype=np.float32)\n",
+    "monte_carlo_mean[blocks, threads_per_block](rng_states, 10000, out)\n",
+    "print(out.copy_to_host().mean())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise: Monte Carlo Pi on the GPU\n",
+    "\n",
+    "Let's revisit Monte Carlo Pi generating algorithm from the first section, where we had compiled it with Numba on the CPU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 99,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from numba import njit\n",
+    "import random\n",
+    "\n",
+    "@njit\n",
+    "def monte_carlo_pi(nsamples):\n",
+    "    acc = 0\n",
+    "    for i in range(nsamples):\n",
+    "        x = random.random()\n",
+    "        y = random.random()\n",
+    "        if (x**2 + y**2) < 1.0:\n",
+    "            acc += 1\n",
+    "    return 4.0 * acc / nsamples"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 102,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "105 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "nsamples = 10000000\n",
+    "%timeit monte_carlo_pi(nsamples)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Your task is to refactor `monte_carlo_pi_device` below, currently identical to `monte_carlo_pi` above, to run on the GPU. You can use `monte_carlo_mean` above for inspiration, but at the least you will need to:\n",
+    "\n",
+    "- Decorate to be a CUDA kernel\n",
+    "- Draw samples for the thread from the device RNG state (generated 2 cells below)\n",
+    "- Store each thread's results in an output array which will be meaned on the host (as `monte_carlo_mean` did above)\n",
+    "\n",
+    "If you look two cells below you will see that all the data has already been initialized, the execution configuration created, and the kernel launched. All you need to do is refactor the kernel definition in the cell immediately below. Check out [the solution](../edit/solutions/monte_carlo_pi_solution.py) if you get stuck."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 116,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from numba import njit\n",
+    "import random\n",
+    "\n",
+    "# TODO: All your work will be in this cell. Refactor to run on the device successfully given the way the\n",
+    "# kernel is launched below.\n",
+    "@cuda.jit\n",
+    "def monte_carlo_pi_device(rng_states, nsamples, out):\n",
+    "    idx = cuda.grid(1)\n",
+    "\n",
+    "    if idx < out.size:\n",
+    "        acc = 0\n",
+    "        for i in range(nsamples):\n",
+    "            x = xoroshiro128p_uniform_float32(rng_states, idx)\n",
+    "            y = xoroshiro128p_uniform_float32(rng_states, idx)\n",
+    "            if x*x + y*y < 1.0:\n",
+    "                acc += 1\n",
+    "        out[idx] = 4.0 * acc / nsamples"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 117,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Do not change any of the values in this cell\n",
+    "nsamples = 10000000\n",
+    "threads_per_block = 128\n",
+    "blocks = 32\n",
+    "\n",
+    "grid_size = threads_per_block * blocks\n",
+    "samples_per_thread = int(nsamples / grid_size) # Each thread only needs to work on a fraction of total number of samples.\n",
+    "                                               # This could also be calcuated inside the kernel definition using `gridsize(1)`.\n",
+    "\n",
+    "rng_states = create_xoroshiro128p_states(grid_size, seed=1)\n",
+    "d_out = cuda.device_array(threads_per_block * blocks, dtype=np.float32)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 118,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.04 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit monte_carlo_pi_device[blocks, threads_per_block](rng_states, samples_per_thread, d_out); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 114,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3.140668\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(d_out.copy_to_host().mean())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/nvidia1/Effective+Memory+Use.ipynb b/nvidia1/Effective+Memory+Use.ipynb
new file mode 100644
index 0000000..2f038e9
--- /dev/null
+++ b/nvidia1/Effective+Memory+Use.ipynb
@@ -0,0 +1,2073 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Effective Use of the Memory Subsystem"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that you can write correct CUDA kernels, and understand the importance of launching grids that give the GPU sufficient opportunity to hide latency, you are going to learn techniques to effectively utilize GPU memory subsystems. These techniques are widely applicable to a variety of CUDA applications, and some of the most important when it comes time to make your CUDA code go fast.\n",
+    "\n",
+    "You are going to begin by learning about memory coalescing. To challenge your ability to reason about memory coalescing, and to expose important details relevent to many CUDA applications, you will then learn about 2-dimensional grids and thread blocks. Next you will learn about a very fast, user-controlled, on-demand memory space called shared memory, and will use shared memory to facilitate memory coalescing where it would not have otherwise been possible. Finally, you will learn about shared memory bank conflicts, which can spoil the performance possibilities of using shared memory, and a technique to address them."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Objectives"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By the time you complete this section, you will be able to:\n",
+    "* Write CUDA kernels that benefit from coalesced memory access patterns.\n",
+    "* Work with multi-dimensional grids and thread blocks.\n",
+    "* Use shared memory to coordinate threads within a block.\n",
+    "* Use shared memory to facilitate coalesced memory access patterns.\n",
+    "* Resolve shared memory bank conflicts."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The Problem: Uncoalesced Memory Access Hurts Performance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before you learn the details about what **coalesced memory access** is, run the following cells to observe the performance implications for a seemingly trivial change to the data access pattern within a kernel."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Creation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we define `n` and create a grid with threads equal to `n`. We also create an output vector with length `n`. For the inputs we create vectors of size `stride * n` for reasons that will be made clear below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 1024*1024 # 1M\n",
+    "\n",
+    "threads_per_block = 1024\n",
+    "blocks = int(n / threads_per_block)\n",
+    "\n",
+    "stride = 16\n",
+    "\n",
+    "# Input Vectors of length stride * n\n",
+    "a = np.ones(stride * n).astype(np.float32)\n",
+    "b = a.copy().astype(np.float32)\n",
+    "\n",
+    "# Output Vector\n",
+    "out = np.zeros(n).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_b = cuda.to_device(b)\n",
+    "d_out = cuda.to_device(out)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Kernel Definition"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In `add_experiment`, every thread in the grid will add an item in `a`, and an item in `b` and write the result to `out`. The kernel has been written such that we can pass a `coalesced` value of either `True` or `False` to affect how it indexes into the `a` and `b` vectors. You will see the performance comparison of the two modes below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def add_experiment(a, b, out, stride, coalesced):\n",
+    "    i = cuda.grid(1)\n",
+    "    # The above line is equivalent to\n",
+    "    # i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
+    "    if coalesced == True:\n",
+    "        out[i] = a[i] + b[i]\n",
+    "    else:\n",
+    "        out[i] = a[stride*i] + b[stride*i]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Launch Kernel Using Coalesced Access"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we pass `True` as the `coalesced` value, and observe the performance of the kernel over several runs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "227 µs ± 73.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, True); cuda.synchronize"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we make sure the kernel ran as expected:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_out.copy_to_host()\n",
+    "truth = a[:n] + b[:n]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, truth)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Launch Kernel Using Uncoalesced Access"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we pass `False`, to observe the perfomance of the uncoalesced data access pattern for `add_experiment`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "540 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, False); cuda.synchronize"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we make sure the kernel ran as expected:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_out.copy_to_host()\n",
+    "truth = a[::stride] + b[::stride]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, truth)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The performance of the uncoalesced data access pattern was far worse. Now you will learn why, and how to think about data access patterns in your kernels to obtain high performing kernels."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Presentation: Global Memory Coalescing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"800\"\n",
+       "            height=\"450\"\n",
+       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/coalescing-v3.pptx\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f9086df57b8>"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/coalescing-v3.pptx', 800, 450)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> _**Footnote**: for additional details about global memory segment size across a variety of devices, and with regards to caching, see [The CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory)._"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise: Column and Row Sums"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this exercise you will be asked to write a column sums kernel that uses fully coalesced memory access patterns. To begin you will observe the performance of a row sums kernel that makes uncoalesced memory accesses."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Row Sums"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we create an input matrix, as well as a vector for storing the solution, and transfer each of them to the device. We also define the grid and block dimensions to be used when we launch the kernel below. We set an arbitrary row of data to some arbitrary value to facilitate checking for correctness below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 16384 # matrix side size\n",
+    "threads_per_block = 256\n",
+    "blocks = int(n / threads_per_block)\n",
+    "\n",
+    "# Input Matrix\n",
+    "a = np.ones(n*n).reshape(n, n).astype(np.float32)\n",
+    "# Here we set an arbitrary row to an arbitrary value to facilitate a check for correctness below.\n",
+    "a[3] = 9\n",
+    "\n",
+    "# Output vector\n",
+    "sums = np.zeros(n).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_sums = cuda.to_device(sums)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "** Kernel Definition**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`row_sums` will use each thread to iterate over a row of data, summing it, and then store its row sum in `sums`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def row_sums(a, sums, n):\n",
+    "    idx = cuda.grid(1)\n",
+    "    sum = 0.0\n",
+    "    \n",
+    "    for i in range(n):\n",
+    "        # Each thread will sum a row of `a`\n",
+    "        sum += a[idx][i]\n",
+    "        \n",
+    "    sums[idx] = sum"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Row Sums Performance**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "11.6 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit row_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check for Correctness**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_sums.copy_to_host()\n",
+    "truth = a.sum(axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(truth, result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Column Sums"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we create an input matrix, as well as a vector for storing the solution, and transfer each of them to the device. We also define the grid and block dimensions to be used when we launch the kernel below. We set an arbitrary column of data to some arbitrary value to facilitate checking for correctness below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 16384 # matrix side size\n",
+    "threads_per_block = 256\n",
+    "blocks = int(n / threads_per_block)\n",
+    "\n",
+    "a = np.ones(n*n).reshape(n, n).astype(np.float32)\n",
+    "# Here we set an arbitrary column to an arbitrary value to facilitate a check for correctness below.\n",
+    "a[:, 3] = 9\n",
+    "sums = np.zeros(n).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_sums = cuda.to_device(sums)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "** Kernel Definition**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`col_sums` will use each thread to iterate over a column of data, summing it, and then store its column sum in `sums`. Complete the kernel definition to accomplish this. If you get stuck, feel free to refer to [the solution](../edit/solutions/col_sums_solution.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def col_sums(a, sums, ds):\n",
+    "    idx = cuda.grid(1)\n",
+    "    stride = cuda.gridsize(1)\n",
+    "    sum = 0\n",
+    "    for i in range(stride):\n",
+    "        sum += a[i][idx]\n",
+    "    \n",
+    "    sums[idx] = sum"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Performance**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assuming you have written `col_sums` to use coalesced access patterns, you should see a significant (almost 2x) speed up compared to the uncoalesced `row_sums` you ran above:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "7.88 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit col_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Correctness**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Confirm your kernel is working as expected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "col_sums[blocks, threads_per_block](d_a, d_sums, n)\n",
+    "cuda.synchronize()\n",
+    "result = d_sums.copy_to_host()\n",
+    "truth = a.sum(axis=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 51,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(truth, result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2 and 3 Dimensional Blocks and Grids"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Both grids and blocks can be configured to contain a 2 or 3 dimensional collection of blocks or threads, respectively. This is done mostly as a matter of convenience for programmers who often work with 2 or 3 dimensional datasets. Here is a very trivial example to highlight the syntax. You may need to read *both* the kernel definition and its launch before the concept makes sense."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "A = np.zeros((4,4)) # A 4x4 Matrix of 0's\n",
+    "d_A = cuda.to_device(A)\n",
+    "\n",
+    "# Here we create a 2D grid with 4 blocks in a 2x2 structure, each with 4 threads in a 2x2 structure\n",
+    "# by using a Python tuple to signify grid and block dimensions.\n",
+    "blocks = (2, 2)\n",
+    "threads_per_block = (2, 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This kernel will take an input matrix of 0s and write to each of its elements, its (x,y) coordinates within the grid in the format of `X.Y`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def get_2D_indices(A):\n",
+    "    # By passing `2`, we get the thread's unique x and y coordinates in the 2D grid\n",
+    "    x, y = cuda.grid(2)\n",
+    "    # The above is equivalent to the following 2 lines of code:\n",
+    "    # x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
+    "    # y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
+    "    \n",
+    "    # Write the x index followed by a decimal and the y index.\n",
+    "    A[x][y] = x + y / 10"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "get_2D_indices[blocks, threads_per_block](d_A)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[0. , 0.1, 0.2, 0.3],\n",
+       "       [1. , 1.1, 1.2, 1.3],\n",
+       "       [2. , 2.1, 2.2, 2.3],\n",
+       "       [3. , 3.1, 3.2, 3.3]])"
+      ]
+     },
+     "execution_count": 60,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "result = d_A.copy_to_host()\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise: Coalesced 2-Dimensional Matrix Add"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Creation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this cell we define 2048x2048 elmement input matrices `a` and `b`, as well as a 2048x2048 0-initialized output matrix. We copy these matrices to the device.\n",
+    "\n",
+    "We also define the 2-dimensional block and grid dimensions to be used below. Note that we are creating a grid with the same number of total threads as there are input and output elements, such that each thread in the grid will calculate the sum for a single element in the output matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 63,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 2048*2048 # 4M\n",
+    "\n",
+    "# 2D blocks\n",
+    "threads_per_block = (32, 32)\n",
+    "# 2D grid\n",
+    "blocks = (64, 64)\n",
+    "\n",
+    "# 2048x2048 input matrices\n",
+    "a = np.arange(n).reshape(2048,2048).astype(np.float32)\n",
+    "b = a.copy().astype(np.float32)\n",
+    "\n",
+    "# 2048x2048 0-initialized output matrix\n",
+    "out = np.zeros_like(a).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_b = cuda.to_device(b)\n",
+    "d_out = cuda.to_device(out)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2D Matrix Add"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Your job is to complete the TODOs in `matrix_add` to correctly sum `a` and `b` into `out`. As a challenge to your understanding of coalesced access patterns, `matrix_add` will accept a `coalesced` boolean indicating whether the access patterns should be coalesced or not. Both modes (coalesced and uncoalesced) should produce correct results, however, you should observe significant speedups below when running with `coalesced` set to `True`.\n",
+    "\n",
+    "If you get stuck, feel free to check out [the solution](../edit/solutions/matrix_add_solution.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 66,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def matrix_add(a, b, out, coalesced):\n",
+    "    x, y = cuda.grid(2)\n",
+    "    \n",
+    "    if coalesced == True:\n",
+    "        out[y][x] = a[y][x] + b[y][x]\n",
+    "    else:\n",
+    "        out[x][y] = a[x][y] + b[x][y]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Check Performance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run both cells below to launch `matrix_add` with both the coalesced and uncoalesced access patterns you wrote into it, and observe the performance difference. Additional cells have been provided to confirm the correctness of your kernel."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Coalesced**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "203 µs ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, True); cuda.synchronize"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 69,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_out.copy_to_host()\n",
+    "truth = a+b"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 70,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 70,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, truth)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Uncoalesced**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 71,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "586 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, False); cuda.synchronize"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 72,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_out.copy_to_host()\n",
+    "truth = a+b"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 73,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, truth)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Shared Memory"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So far we have been differentiating between host and device memory, as if device memory were a single kind of memory. But in fact, CUDA has an even more fine-grained [memory hierarchy](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy). The device memory we have been utilizing thus far is called **global memory** which is available to any thread or block on the device, can persist for the lifetime of the application, and is a relatively large memory space.\n",
+    "\n",
+    "We will now discuss how to utilize a region of on-chip device memory called **shared memory**. Shared memory is a programmer defined cache of limited size that [depends on the GPU](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) being used and is **shared** between all threads in a block. It is a scarce resource, cannot be accessed by threads outside of the block where it was allocated, and does not persist after a kernel finishes executing. Shared memory however has a much higher bandwidth than global memory and can be used to great effect in many kernels, especially to optimize performance.\n",
+    "\n",
+    "Here are a few common use cases for shared memory:\n",
+    "\n",
+    " * Caching memory read from global memory that will need to be read multiple times within a block.\n",
+    " * Buffering output from threads so it can be coalesced before writing it back to global memory.\n",
+    " * Staging data for scatter/gather operations within a block."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Shared Memory Syntax"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Numba provides [functions](https://numba.pydata.org/numba-doc/dev/cuda/memory.html#shared-memory-and-thread-synchronization) for allocating shared memory as well as for synchronizing between threads in a block, which is often necessary after parallel threads read from or write to shared memory.\n",
+    "\n",
+    "When declaring shared memory, you provide the shape of the shared array, as well as its type, using a [Numba type](https://numba.pydata.org/numba-doc/dev/reference/types.html#numba-types). **The shape of the array must be a constant value**, and therefore, you cannot use arguments passed into the function, or, provided variables like `numba.cuda.blockDim.x`, or the calculated values of `cuda.griddim`. Here is a convoluted example to demonstrate the syntax with comments pointing out the movement from host memory to global device memory, to shared memory, back to global device memory, and finally back to host memory:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will use `numba.types` to define the types of values in shared memory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 74,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import types, cuda"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Swap Elements Using Shared Memory**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following kernel takes an input vector, where each thread will first write one element of the vector to shared memory, and then, after syncing such that all elements have been written to shared memory, will write one element out of shared memory into the swapped output vector.\n",
+    "\n",
+    "Worth noting is that each thread will be writing a swapped value from shared memory that was written into shared memory by another thread."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 75,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def swap_with_shared(vector, swapped):\n",
+    "    # Allocate a 4 element vector containing int32 values in shared memory.\n",
+    "    temp = cuda.shared.array(4, dtype=types.int32)\n",
+    "    \n",
+    "    idx = cuda.grid(1)\n",
+    "    \n",
+    "    # Move an element from global memory into shared memory\n",
+    "    temp[idx] = vector[idx]\n",
+    "    \n",
+    "    # cuda.syncthreads will force all threads in the block to synchronize here, which is necessary because...\n",
+    "    cuda.syncthreads()\n",
+    "    #...the following operation is reading an element written to shared memory by another thread.\n",
+    "    \n",
+    "    # Move an element from shared memory back into global memory\n",
+    "    swapped[idx] = temp[3 - cuda.threadIdx.x] # swap elements"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "vector = np.arange(4).astype(np.int32)\n",
+    "swapped = np.zeros_like(vector)\n",
+    "\n",
+    "# Move host memory to device (global) memory\n",
+    "d_vector = cuda.to_device(vector)\n",
+    "d_swapped = cuda.to_device(swapped)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0, 1, 2, 3], dtype=int32)"
+      ]
+     },
+     "execution_count": 77,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vector"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "** Run Kernel**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "swap_with_shared[1, 4](d_vector, d_swapped)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Results**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([3, 2, 1, 0], dtype=int32)"
+      ]
+     },
+     "execution_count": 79,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Move device (global) memory back to the host\n",
+    "result = d_swapped.copy_to_host()\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Presentation: Shared Memory for Memory Coalescing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 80,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"800\"\n",
+       "            height=\"450\"\n",
+       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/shared_coalescing.pptx\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f9086acc438>"
+      ]
+     },
+     "execution_count": 80,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/shared_coalescing.pptx', 800, 450)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Excercise: Used Shared Memory for Coalesced Reads and Writes With Matrix Transpose"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this exercise you will implement what was just demonstrated in the presentation by writing a matrix transpose kernel which, using shared memory, makes coalesced reads and writes to the output matrix in global memory."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Coalesced Reads, Uncoalesced Writes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As reference, and for performance comparison, here is a naive matrix transpose kernel that makes coalesced reads from input, but uncoalesced writes to output."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from numba import cuda\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we create a 4096x4096 input matrix `a` as well as a 4096x4096 output matrix `transposed`, and copy them to the device.\n",
+    "\n",
+    "We also define a 2-dimensional grid with 2-dimensional blocks to be used below. Note that we have created a grid with a total number of threads equal to the number of elments in the input matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 4096*4096 # 16M\n",
+    "\n",
+    "# 2D blocks\n",
+    "threads_per_block = (32, 32)\n",
+    "#2D grid\n",
+    "blocks = (128, 128)\n",
+    "\n",
+    "# 4096x4096 input and output matrices\n",
+    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
+    "transposed = np.zeros_like(a).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_transposed = cuda.to_device(transposed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Naive Matrix Transpose Kernel**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This kernel correctly transposes `a`, writing the transposition to `transposed`. It makes reads from `a` in a coalesced fashion, however, its writes to `transposed` are uncoalesced."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 82,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def transpose(a, transposed):\n",
+    "    x, y = cuda.grid(2)\n",
+    "\n",
+    "    transposed[x][y] = a[y][x]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Performance**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 83,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.59 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Correctness**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_transposed.copy_to_host()\n",
+    "expected = a.T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 85,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, expected)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Refactor for Coalesced Reads and Writes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Your job will be to refactor the `transpose` kernel to use shared memory and make both reads to and writes from global memory in a coalesced fashion."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Imports**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 86,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda, types as numba_types"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Data Creation**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 87,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 4096*4096 # 16M\n",
+    "\n",
+    "# 2D blocks\n",
+    "threads_per_block = (32, 32)\n",
+    "#2D grid\n",
+    "blocks = (128, 128)\n",
+    "\n",
+    "# 4096x4096 input and output matrices\n",
+    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
+    "transposed = np.zeros_like(a).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_transposed = cuda.to_device(transposed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Write a Transpose Kernel that Uses Shared Memory**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Complete the TODOs inside the `tile_transpose` kernel definition.\n",
+    "\n",
+    "If you get stuck, feel free to check out [the solution](../edit/solutions/tile_transpose_solution.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 91,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def tile_transpose(a, transposed):\n",
+    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
+    "    # and that `a` is a multiple of these dimensions.\n",
+    "    \n",
+    "    # 1) Create 32x32 shared memory array.\n",
+    "    \n",
+    "    temp = cuda.shared.array((32,32), dtype=types.int32)\n",
+    "\n",
+    "    # Compute offsets into global input array. Recall for coalesced access we want to map threadIdx.x increments to\n",
+    "    # the fastest changing index in the data, i.e. the column in our array.\n",
+    "    # Note: `a_col` and `a_row` are already correct.\n",
+    "    a_col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
+    "    a_row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
+    "    \n",
+    "    # 2) Make coalesced read from global memory (using grid indices)\n",
+    "    # into shared memory array (using thread indices).\n",
+    "    \n",
+    "    temp[cuda.threadIdx.y, cuda.threadIdx.x] = a[a_row, a_col]\n",
+    "\n",
+    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
+    "    \n",
+    "    cuda.syncthreads()\n",
+    "    \n",
+    "    # 4) Calculate transposed location for the shared memory array tile\n",
+    "    # to be written back to global memory. Note that blockIdx.y*blockDim.y \n",
+    "    # and blockIdx.x* blockDim.x are swapped (because we want to write to the\n",
+    "    # transpose locations), but we want to keep access coalesced, so match up the\n",
+    "    # threadIdx.x to the fastest changing index, i.e. the column./\n",
+    "    # Note: `t_col` and `t_row` are already correct.\n",
+    "    t_col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
+    "    t_row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
+    "\n",
+    "    # 5) Write from shared memory (using thread indices)\n",
+    "    # back to global memory (using grid indices)\n",
+    "    # transposing each element within the shared memory array.\n",
+    "    \n",
+    "    transposed[t_row, t_col] = temp[cuda.threadIdx.x, cuda.threadIdx.y]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Performance**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Check the performance of your refactored transpose kernel. You should see a speedup compared to the baseline transpose performance above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 92,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.09 ms ± 60.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit tile_transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check Correctness**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 93,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_transposed.copy_to_host()\n",
+    "expected = a.T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 94,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 94,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, expected)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Why Such a Small Improvement?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "While this is a significant speedup for only a few lines of code, but you might think that the performance improvement is not as stark as you expected based on earlier performance improvements to use coalesced access patterns. There are 2 main reasons for this:\n",
+    "\n",
+    "1. The naive transpose kernel was making coalesced reads, so, your refactored version only optimized half of the global memory access throughout the execution of the kernel.\n",
+    "2. Your code as written suffers from something called shared memory bank conflicts, a topic to which we will now turn our attention."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Presentation: Memory Bank Conflicts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Execute the following cell to load the slides, then click on \"Start Slide Show\" to make them full screen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 95,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"800\"\n",
+       "            height=\"450\"\n",
+       "            src=\"https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/bank_conflicts.pptx\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f908e3479e8>"
+      ]
+     },
+     "execution_count": 95,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/bank_conflicts.pptx', 800, 450)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Assessment: Resolve Memory Bank Conflicts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As a final exercise, and to get credit towards a certificate in the course for this final section of the workshop, you will refactor the transpose kernel utilizing shared memory to be shared memory bank conflict free."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from numba import cuda, types as numba_types"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Creation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 4096*4096 # 16M\n",
+    "threads_per_block = (32, 32)\n",
+    "blocks = (128, 128)\n",
+    "\n",
+    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
+    "transposed = np.zeros_like(a).astype(np.float32)\n",
+    "\n",
+    "d_a = cuda.to_device(a)\n",
+    "d_transposed = cuda.to_device(transposed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Make the Kernel Bank Conflict Free"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `tile_transpose_conflict_free` kernel is a working matrix transpose kernel which utilizes shared memory so that both reads from and writes to global memory are coalesced. Your job is to refactor the kernel so that it does not suffer from memory bank conflicts.\n",
+    "\n",
+    "**Note:** Because this final exercise counts towards certification in the course, a solution will not be provided."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 96,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "@cuda.jit\n",
+    "def tile_transpose_conflict_free(a, transposed):\n",
+    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
+    "    # and that `a` is a multiple of these dimensions.\n",
+    "    \n",
+    "    # 1) Create 32x32 shared memory array.\n",
+    "    tile = cuda.shared.array((32, 33), numba_types.float32)\n",
+    "\n",
+    "    # Compute offsets into global input array.\n",
+    "    x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
+    "    y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
+    "    \n",
+    "    # 2) Make coalesced read from global memory into shared memory array.\n",
+    "    # Note the use of local thread indices for the shared memory write,\n",
+    "    # and global offsets for global memory read.\n",
+    "    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a[y, x]\n",
+    "\n",
+    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
+    "    cuda.syncthreads()\n",
+    "    \n",
+    "    # 4) Calculate transposed location for the shared memory array tile\n",
+    "    # to be written back to global memory.\n",
+    "    t_x = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
+    "    t_y = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
+    "\n",
+    "    # 5) Write back to global memory,\n",
+    "    # transposing each element within the shared memory array.\n",
+    "    transposed[t_y, t_x] = tile[cuda.threadIdx.x, cuda.threadIdx.y]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Check Performance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assuming you have correctly resolved the bank conflicts, this kernel should run significantly faster than both the naive transpose kernel, and, the shared memory (with bank conflicts) transpose kernel. In order to pass the assessment, your kernel will need to run on average in less than 840 µs.\n",
+    "\n",
+    "The first value printed by running the following cell will give you the average run time of your kernel."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 97,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "804 µs ± 2.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit tile_transpose_conflict_free[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Check Correctness"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In order to pass the assessment, your kernel also needs to work correctly. Run the following 2 cells to confirm this is true."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 98,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "result = d_transposed.copy_to_host()\n",
+    "expected = a.T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 99,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 99,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.array_equal(result, expected)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the Assessment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you have completed the refactor, observed it's run time to be less than 840 µs, and confirmed that it runs correctly, execute the following cells to run the assessment against your kernel definition."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 100,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from assessment import assess"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 101,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Your function took 808.28 µs to run.\n",
+      "\n",
+      "Your function runs fast enough (less than 840 µs): True\n",
+      "\n",
+      "Your function returns the correct results: True\n",
+      "\n",
+      "Congratulations, you passed! See the instructions below for how to get credit for your work to count toward a certificate in the course.\n"
+     ]
+    }
+   ],
+   "source": [
+    "assess(tile_transpose_conflict_free)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Get Credit for Your Work"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After successfully passing the assessment above, revisit the webpage where you launched this interactive environment and click on the **\"ASSESS TASK\"** button as shown in the screenshot below. Doing so will give you credit for this part of the workshop that counts towards earning a **certificate of competency** for the entire course."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Run the assessment](images/run_the_assessment.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that you have completed this session you are able to:\n",
+    "\n",
+    "* Write CUDA kernels that benefit from coalesced memory access patterns.\n",
+    "* Work with multi-dimensional grids and thread blocks.\n",
+    "* Use shared memory to coordinate threads within a block.\n",
+    "* Use shared memory to facilitate coalesced memory access patterns.\n",
+    "* Resolve shared memory bank conflicts."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download Content"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To download the contents of this notebook, execute the following cell and then click the download link below. Note: If you run this notebook on a local Jupyter server, you can expect some of the file path links in the notebook to be broken as they are shaped to our own platform. You can still navigate to the files through the Jupyter file navigator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "!tar -zcvf section3.tar.gz ."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Download files from this section.](files/section3.tar.gz)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/nvidia1/Introduction+to+CUDA+Python+with+Numba.ipynb b/nvidia1/Introduction+to+CUDA+Python+with+Numba.ipynb
index ba98a5c..c1e885e 100644
--- a/nvidia1/Introduction+to+CUDA+Python+with+Numba.ipynb
+++ b/nvidia1/Introduction+to+CUDA+Python+with+Numba.ipynb
@@ -172,7 +172,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 4,
    "metadata": {
     "collapsed": true
    },
@@ -203,7 +203,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
@@ -212,7 +212,7 @@
        "5.0"
       ]
      },
-     "execution_count": 3,
+     "execution_count": 5,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -230,7 +230,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
@@ -239,7 +239,7 @@
        "5.0"
       ]
      },
-     "execution_count": 5,
+     "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -259,14 +259,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "669 ns ± 0.53 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
+      "677 ns ± 0.576 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
      ]
     }
    ],
@@ -283,11 +283,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "188 ns ± 0.0139 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit hypot(3.0, 4.0)"
    ]
@@ -301,11 +307,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "124 ns ± 0.0233 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit math.hypot(3.0, 4.0)"
    ]
@@ -336,7 +348,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 10,
    "metadata": {
     "collapsed": true
    },
@@ -347,16 +359,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "# TODO: Import Numba's just-in-time compiler function\n",
+    "from numba import jit\n",
     "import random\n",
     "\n",
     "# TODO: Use the Numba compiler to compile this function\n",
+    "@jit\n",
     "def monte_carlo_pi(nsamples):\n",
     "    acc = 0\n",
     "    for i in range(nsamples):\n",
@@ -369,7 +383,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 12,
    "metadata": {
     "collapsed": true
    },
@@ -384,22 +398,34 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "10.5 ms ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit monte_carlo_pi(nsamples)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "349 ms ± 4.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit monte_carlo_pi.py_func(nsamples)"
    ]
@@ -419,12 +445,108 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 15,
    "metadata": {
-    "collapsed": true,
     "scrolled": true
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "hypot (float64, float64)\n",
+      "--------------------------------------------------------------------------------\n",
+      "# File: <ipython-input-4-1ae6c289554e>\n",
+      "# --- LINE 6 --- \n",
+      "# label 0\n",
+      "\n",
+      "@jit\n",
+      "\n",
+      "# --- LINE 7 --- \n",
+      "\n",
+      "def hypot(x, y):\n",
+      "\n",
+      "    # --- LINE 8 --- \n",
+      "\n",
+      "    # Implementation from https://en.wikipedia.org/wiki/Hypot\n",
+      "\n",
+      "    # --- LINE 9 --- \n",
+      "    #   x = arg(0, name=x)  :: float64\n",
+      "    #   y = arg(1, name=y)  :: float64\n",
+      "    #   $0.1 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)\n",
+      "    #   $0.3 = call $0.1(x, func=$0.1, args=[Var(x, <ipython-input-4-1ae6c289554e> (9))], kws=(), vararg=None)  :: (float64,) -> float64\n",
+      "    #   del x\n",
+      "    #   del $0.1\n",
+      "    #   x.1 = $0.3  :: float64\n",
+      "    #   del $0.3\n",
+      "\n",
+      "    x = abs(x);\n",
+      "\n",
+      "    # --- LINE 10 --- \n",
+      "    #   $0.4 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)\n",
+      "    #   $0.6 = call $0.4(y, func=$0.4, args=[Var(y, <ipython-input-4-1ae6c289554e> (9))], kws=(), vararg=None)  :: (float64,) -> float64\n",
+      "    #   del y\n",
+      "    #   del $0.4\n",
+      "    #   y.1 = $0.6  :: float64\n",
+      "    #   del $0.6\n",
+      "\n",
+      "    y = abs(y);\n",
+      "\n",
+      "    # --- LINE 11 --- \n",
+      "    #   $0.7 = global(min: <built-in function min>)  :: Function(<built-in function min>)\n",
+      "    #   $0.10 = call $0.7(x.1, y.1, func=$0.7, args=[Var(x.1, <ipython-input-4-1ae6c289554e> (9)), Var(y.1, <ipython-input-4-1ae6c289554e> (10))], kws=(), vararg=None)  :: (float64, float64) -> float64\n",
+      "    #   del $0.7\n",
+      "    #   t = $0.10  :: float64\n",
+      "    #   del $0.10\n",
+      "\n",
+      "    t = min(x, y);\n",
+      "\n",
+      "    # --- LINE 12 --- \n",
+      "    #   $0.11 = global(max: <built-in function max>)  :: Function(<built-in function max>)\n",
+      "    #   $0.14 = call $0.11(x.1, y.1, func=$0.11, args=[Var(x.1, <ipython-input-4-1ae6c289554e> (9)), Var(y.1, <ipython-input-4-1ae6c289554e> (10))], kws=(), vararg=None)  :: (float64, float64) -> float64\n",
+      "    #   del y.1\n",
+      "    #   del x.1\n",
+      "    #   del $0.11\n",
+      "    #   x.2 = $0.14  :: float64\n",
+      "    #   del $0.14\n",
+      "\n",
+      "    x = max(x, y);\n",
+      "\n",
+      "    # --- LINE 13 --- \n",
+      "    #   $0.17 = t / x.2  :: float64\n",
+      "    #   del t\n",
+      "    #   t.1 = $0.17  :: float64\n",
+      "    #   del $0.17\n",
+      "\n",
+      "    t = t / x;\n",
+      "\n",
+      "    # --- LINE 14 --- \n",
+      "    #   $0.19 = global(math: <module 'math' from '/home/appuser/Miniconda3/lib/python3.6/lib-dynload/math.cpython-36m-x86_64-linux-gnu.so'>)  :: Module(<module 'math' from '/home/appuser/Miniconda3/lib/python3.6/lib-dynload/math.cpython-36m-x86_64-linux-gnu.so'>)\n",
+      "    #   $0.20 = getattr(value=$0.19, attr=sqrt)  :: Function(<built-in function sqrt>)\n",
+      "    #   del $0.19\n",
+      "    #   $const0.21 = const(int, 1)  :: int64\n",
+      "    #   $0.24 = t.1 * t.1  :: float64\n",
+      "    #   del t.1\n",
+      "    #   $0.25 = $const0.21 + $0.24  :: float64\n",
+      "    #   del $const0.21\n",
+      "    #   del $0.24\n",
+      "    #   $0.26 = call $0.20($0.25, func=$0.20, args=[Var($0.25, <ipython-input-4-1ae6c289554e> (14))], kws=(), vararg=None)  :: (float64,) -> float64\n",
+      "    #   del $0.25\n",
+      "    #   del $0.20\n",
+      "    #   $0.27 = x.2 * $0.26  :: float64\n",
+      "    #   del x.2\n",
+      "    #   del $0.26\n",
+      "    #   $0.28 = cast(value=$0.27)  :: float64\n",
+      "    #   del $0.27\n",
+      "    #   return $0.28\n",
+      "\n",
+      "    return x * math.sqrt(1+t*t)\n",
+      "\n",
+      "\n",
+      "================================================================================\n"
+     ]
+    }
+   ],
    "source": [
     "hypot.inspect_types()"
    ]
@@ -447,11 +569,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'value'"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "@jit\n",
     "def cannot_compile(x):\n",
@@ -469,11 +600,24 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "TypingError",
+     "evalue": "Failed in nopython mode pipeline (step: nopython frontend)\nInternal error at <numba.typeinfer.ArgConstraint object at 0x7fe8ec48d898>:\n--%<----------------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 599, in new_error_context\n    yield\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 199, in __call__\n    assert ty.is_precise()\nAssertionError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 142, in propagate\n    constraint(typeinfer)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 200, in __call__\n    typeinfer.add_type(self.dst, ty, loc=self.loc)\n  File \"/home/appuser/Miniconda3/lib/python3.6/contextlib.py\", line 99, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 607, in new_error_context\n    six.reraise(type(newerr), newerr, tb)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\", line 659, in reraise\n    raise value\nnumba.errors.InternalError: \n[1] During: typing of argument at <ipython-input-17-d3b98ca43e8a> (3)\n--%<----------------------------------------------------------------------------\n\n\nFile \"<ipython-input-17-d3b98ca43e8a>\", line 3:\ndef cannot_compile(x):\n    return x['key']\n    ^\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n\nThis is not usually a problem with Numba itself but instead often caused by\nthe use of unsupported features or an issue in resolving types.\n\nTo see Python/NumPy features supported by the latest release of Numba visit:\nhttp://numba.pydata.org/numba-doc/dev/reference/pysupported.html\nand\nhttp://numba.pydata.org/numba-doc/dev/reference/numpysupported.html\n\nFor more information about typing errors and how to debug them visit:\nhttp://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile\n\nIf you think your code should work with Numba, please report the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mTypingError\u001b[0m                               Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-17-d3b98ca43e8a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      3\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'key'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mcannot_compile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'value'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m    346\u001b[0m                 \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpatch_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    347\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 348\u001b[0;31m             \u001b[0merror_rewrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'typing'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    349\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mUnsupportedError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    350\u001b[0m             \u001b[0;31m# Something unsupported is present in the user code, add help info\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36merror_rewrite\u001b[0;34m(e, issue_type)\u001b[0m\n\u001b[1;32m    313\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    314\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 315\u001b[0;31m                 \u001b[0mreraise\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    316\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    317\u001b[0m         \u001b[0margtypes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\u001b[0m in \u001b[0;36mreraise\u001b[0;34m(tp, value, tb)\u001b[0m\n\u001b[1;32m    656\u001b[0m             \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtp\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    657\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__traceback__\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mtb\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 658\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    659\u001b[0m         \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    660\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mTypingError\u001b[0m: Failed in nopython mode pipeline (step: nopython frontend)\nInternal error at <numba.typeinfer.ArgConstraint object at 0x7fe8ec48d898>:\n--%<----------------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 599, in new_error_context\n    yield\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 199, in __call__\n    assert ty.is_precise()\nAssertionError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 142, in propagate\n    constraint(typeinfer)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 200, in __call__\n    typeinfer.add_type(self.dst, ty, loc=self.loc)\n  File \"/home/appuser/Miniconda3/lib/python3.6/contextlib.py\", line 99, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 607, in new_error_context\n    six.reraise(type(newerr), newerr, tb)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\", line 659, in reraise\n    raise value\nnumba.errors.InternalError: \n[1] During: typing of argument at <ipython-input-17-d3b98ca43e8a> (3)\n--%<----------------------------------------------------------------------------\n\n\nFile \"<ipython-input-17-d3b98ca43e8a>\", line 3:\ndef cannot_compile(x):\n    return x['key']\n    ^\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n\nThis is not usually a problem with Numba itself but instead often caused by\nthe use of unsupported features or an issue in resolving types.\n\nTo see Python/NumPy features supported by the latest release of Numba visit:\nhttp://numba.pydata.org/numba-doc/dev/reference/pysupported.html\nand\nhttp://numba.pydata.org/numba-doc/dev/reference/numpysupported.html\n\nFor more information about typing errors and how to debug them visit:\nhttp://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile\n\nIf you think your code should work with Numba, please report the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n"
+     ]
+    }
+   ],
    "source": [
     "@jit(nopython=True)\n",
     "def cannot_compile(x):\n",
@@ -498,11 +642,24 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "TypingError",
+     "evalue": "Failed in nopython mode pipeline (step: nopython frontend)\nInternal error at <numba.typeinfer.ArgConstraint object at 0x7fe8ec43ada0>:\n--%<----------------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 599, in new_error_context\n    yield\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 199, in __call__\n    assert ty.is_precise()\nAssertionError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 142, in propagate\n    constraint(typeinfer)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 200, in __call__\n    typeinfer.add_type(self.dst, ty, loc=self.loc)\n  File \"/home/appuser/Miniconda3/lib/python3.6/contextlib.py\", line 99, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 607, in new_error_context\n    six.reraise(type(newerr), newerr, tb)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\", line 659, in reraise\n    raise value\nnumba.errors.InternalError: \n[1] During: typing of argument at <ipython-input-18-598d254e6e01> (5)\n--%<----------------------------------------------------------------------------\n\n\nFile \"<ipython-input-18-598d254e6e01>\", line 5:\ndef cannot_compile(x):\n    return x['key']\n    ^\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n\nThis is not usually a problem with Numba itself but instead often caused by\nthe use of unsupported features or an issue in resolving types.\n\nTo see Python/NumPy features supported by the latest release of Numba visit:\nhttp://numba.pydata.org/numba-doc/dev/reference/pysupported.html\nand\nhttp://numba.pydata.org/numba-doc/dev/reference/numpysupported.html\n\nFor more information about typing errors and how to debug them visit:\nhttp://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile\n\nIf you think your code should work with Numba, please report the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mTypingError\u001b[0m                               Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-18-598d254e6e01>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      5\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'key'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mcannot_compile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'value'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m    346\u001b[0m                 \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpatch_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    347\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 348\u001b[0;31m             \u001b[0merror_rewrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'typing'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    349\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mUnsupportedError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    350\u001b[0m             \u001b[0;31m# Something unsupported is present in the user code, add help info\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36merror_rewrite\u001b[0;34m(e, issue_type)\u001b[0m\n\u001b[1;32m    313\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    314\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 315\u001b[0;31m                 \u001b[0mreraise\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    316\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    317\u001b[0m         \u001b[0margtypes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\u001b[0m in \u001b[0;36mreraise\u001b[0;34m(tp, value, tb)\u001b[0m\n\u001b[1;32m    656\u001b[0m             \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtp\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    657\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__traceback__\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mtb\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 658\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    659\u001b[0m         \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    660\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mTypingError\u001b[0m: Failed in nopython mode pipeline (step: nopython frontend)\nInternal error at <numba.typeinfer.ArgConstraint object at 0x7fe8ec43ada0>:\n--%<----------------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 599, in new_error_context\n    yield\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 199, in __call__\n    assert ty.is_precise()\nAssertionError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 142, in propagate\n    constraint(typeinfer)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/typeinfer.py\", line 200, in __call__\n    typeinfer.add_type(self.dst, ty, loc=self.loc)\n  File \"/home/appuser/Miniconda3/lib/python3.6/contextlib.py\", line 99, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/errors.py\", line 607, in new_error_context\n    six.reraise(type(newerr), newerr, tb)\n  File \"/home/appuser/Miniconda3/lib/python3.6/site-packages/numba/six.py\", line 659, in reraise\n    raise value\nnumba.errors.InternalError: \n[1] During: typing of argument at <ipython-input-18-598d254e6e01> (5)\n--%<----------------------------------------------------------------------------\n\n\nFile \"<ipython-input-18-598d254e6e01>\", line 5:\ndef cannot_compile(x):\n    return x['key']\n    ^\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n\nThis is not usually a problem with Numba itself but instead often caused by\nthe use of unsupported features or an issue in resolving types.\n\nTo see Python/NumPy features supported by the latest release of Numba visit:\nhttp://numba.pydata.org/numba-doc/dev/reference/pysupported.html\nand\nhttp://numba.pydata.org/numba-doc/dev/reference/numpysupported.html\n\nFor more information about typing errors and how to debug them visit:\nhttp://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile\n\nIf you think your code should work with Numba, please report the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n"
+     ]
+    }
+   ],
    "source": [
     "from numba import njit\n",
     "\n",
@@ -553,11 +710,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([11, 22, 33, 44])"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "import numpy as np\n",
     "\n",
@@ -576,11 +742,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([101, 102, 103, 104])"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "np.add(a, 100) # Returns a new NumPy array resulting from adding 100 to every element in `a`"
    ]
@@ -594,11 +769,33 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "c: [[ 0  1  2  3]\n",
+      " [ 4  5  6  7]\n",
+      " [ 8  9 10 11]\n",
+      " [12 13 14 15]]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[10, 21, 32, 43],\n",
+       "       [14, 25, 36, 47],\n",
+       "       [18, 29, 40, 51],\n",
+       "       [22, 33, 44, 55]])"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "c = np.arange(4*4).reshape((4,4))\n",
     "print('c:', c)\n",
@@ -624,7 +821,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 26,
    "metadata": {
     "collapsed": true
    },
@@ -639,11 +836,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])"
+      ]
+     },
+     "execution_count": 27,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "nums = np.arange(10)\n",
     "add_ten(nums) # pass the whole array into the ufunc, it performs the operation on each element"
@@ -665,7 +871,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 28,
    "metadata": {
     "collapsed": true
    },
@@ -678,11 +884,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([11, 22, 33, 44])"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "add_ufunc(a, b)"
    ]
@@ -707,22 +922,34 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "997 ns ± 0.647 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit np.add(b, c)   # NumPy on CPU"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "680 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit add_ufunc(b, c) # Numba on GPU"
    ]
@@ -746,7 +973,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 33,
    "metadata": {
     "collapsed": true
    },
@@ -764,11 +991,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0.00926774], dtype=float32)"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "import numpy as np\n",
     "# Evaluate the Gaussian a million times!\n",
@@ -782,11 +1018,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "63.8 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "import scipy.stats # for definition of gaussian distribution, so we can compare CPU to GPU time\n",
     "norm_pdf = scipy.stats.norm\n",
@@ -795,11 +1037,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "4.15 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit gaussian_pdf(x, mean, sigma)"
    ]
@@ -815,7 +1063,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 37,
    "metadata": {
     "collapsed": true
    },
@@ -829,11 +1077,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "22.4 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit cpu_gaussian_pdf(x, mean, sigma)"
    ]
@@ -860,7 +1114,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 40,
    "metadata": {
     "collapsed": true
    },
@@ -884,7 +1138,28 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "TypeError",
+     "evalue": "'DeviceFunctionTemplate' object is not callable",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-44-edcd96b4594f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mpolar_to_cartesian\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1.0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2.0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;31mTypeError\u001b[0m: 'DeviceFunctionTemplate' object is not callable"
+     ]
+    }
+   ],
+   "source": [
+    "polar_to_cartesian(1.0, 2.0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
    "metadata": {
     "collapsed": true
    },
@@ -899,11 +1174,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 46,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([2.2186031, 1.3172561, 1.4999884, ..., 0.7888327, 2.3476734,\n",
+       "       1.5603681], dtype=float32)"
+      ]
+     },
+     "execution_count": 46,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "polar_distance(rho1, theta1, rho2, theta2)"
    ]
@@ -943,11 +1228,30 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<matplotlib.lines.Line2D at 0x7fe8b81bd400>]"
+      ]
+     },
+     "execution_count": 47,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3Xl8FEX6P/BPkZNAEggECCSQgMgh\nt+HGEwEB12td1xNcD9br67qXP1yP1d11dV2PXV2PxRNdxRMVAUFElNNAuM+QQAIEQk5IQiAn9fsj\nnZmaZO7p7pquft6vV17p6anprrme6a6ueopxzkEIIURd7WRXgBBCiLEo0BNCiOIo0BNCiOIo0BNC\niOIo0BNCiOIo0BNCiOIo0BNCiOIo0BNCiOIo0BNCiOIiZVcAALp27crT09NlV4MQQixl8+bNZZzz\nZF/lwiLQp6enIzs7W3Y1CCHEUhhjh/wpR003hBCiOAr0hBCiOAr0hBCiOAr0hBCiOAr0hBCiOAr0\nhBCiOAr0hBCiOAr0rZRU1+Lb3cdlV8PyGprO4q+L9+C55TkoPHFadnWUsDa3DAVlNbKrYXlVtQ24\n4uU1mPD0SqzJLZVdHVOExYCpcPHyylw8v2I/AODui/rhgcnnIC6aXqJgzFt9EG+tzQcA/GdVHgqe\nmSm5RtZVXduABz/ahpX7SgCAXssQlJ2qQ+bfvnPcvvWtjbZ4PemIXrO/uNoR5AHg9R8P4Lnl+708\ngnjz7Z5il9sPfrRVUk2s79llOY4gT0IjBnk7oUCvuf3dTW3Wvb0uX0JN1LD9yEmX219uOyapJtZW\nUVOP939yHeWePneJpNoQq6JAryk8cUZ2FZRxqq7R7fqDpadMron1bcwvd7v+0+wjJtfE+jjnsqsg\nDQV6orvp/17tdv2lz/+Is2ft+2ULhqeX64+f7TC3Igr4NLvQ7frWZ0wqokAPYOvhEx7vW7m32ON9\nxL0jFZ7Pjqg5LDBzP6eArpdVOe6vczz25S6Ta2I+CvQArnl1vcf77phP6ZMD0dB01uv96w+4b4og\n7lXVum8GA4DdxypNrIm1cc7xzS7P3abrGptMrI35KNATXX2+2f3pcYvvqfeI377f5/1scuZLa02q\nifXllni/PlRSVWdSTeSgQE909d1eCuR6uf1dOpvUy45C72c/Fzy7yqSayEGB3g8/HaTmBn9wzvEd\nXdMgYegPn26XXQWpKND74YZ5P8mugiV46tVAiBWUnVK3+cb2gX5/cbXsKihj3YEy2VWwnXKFg5PZ\nbB3oGWNpjLFVjLG9jLHdjLHfaOuTGGMrGGO52v/O2nrGGHuJMZbHGNvBGBtl9JMIxdQX3ff5bk31\nq/J6WLqzyK9ydh644i9Pg85a+yWdbfrkqydYi5OnGwyuiTz+HNE3Avg953wQgHEA7mOMDQYwF8BK\nznl/ACu12wAwHUB/7W8OgNd0r7UEFJt8a2jy70XydWGMADNfWuNXuTwfvUkI8NHGw36VU7mJ1meg\n55wXcc63aMvVAPYC6AXgKgDztWLzAVytLV8F4D3e7CcAnRhjKbrX3GT7jlMTj178PcKys0PllNpZ\nL19v9+9MU2UBtdEzxtIBjASQBaA757wIaP4xANBNK9YLgJiIo1BbZxl7/jKtzbqHPrP3VftgxUa1\n/Yg9+fUeCTWxvuszU2VXwZI2FlTIroJ0fgd6xlhHAJ8DeJBzXuWtqJt1bc7pGWNzGGPZjLHs0lI5\nyf83H2qb+mDR/RPd5qDfX0ynyN6s3u/+Pfz+9xe3WbfzaCXlvAnCs9cNd7u+tFrdi4hGGZaaiCev\nPE92NUzjV6BnjEWhOch/wDlfqK0ubmmS0f63jJQpBJAmPDwVQJsctZzzeZzzTM55ZnJycrD1D8my\nXW1P6YaldpJQE+tzF+gfmNwfPTu1xzndOkqokVpev6W5T8MF/bu2uY8CfeA+vGscJp7Tpc36EzX1\nEmpjPH963TAAbwHYyzl/QbhrEYDZ2vJsAF8J62dpvW/GAahsaeIJN2+soQRbejleVdtm3dTB3QEA\nX9w7wezqKOfyIc2XuV6+cWSb+yoUDU56aHJz5pj31HR0jInEOd3i29y386iaHQX8OaKfCOBWAJcy\nxrZpfzMAPANgCmMsF8AU7TYALAVwEEAegDcA3Kt/tY1xzUjnpYTbJ2ZIrIn1LN7R9rd8SK9EAEB8\nbBReucm1l+1Z6sbk0ROLdnu8r1NcNP78s8Eu6x78mGbv8sRdV97ICGfY69Wpvct9s97eaHidZPCn\n181azjnjnA/jnI/Q/pZyzss555M55/21/xVaec45v49z3o9zPpRzbpmEHed2d/7Cd4qLanM/9RYJ\n3sxhrh2vtlMXS4/eXV/g9f6rR7j2bSg7RUf0nuwp8nY5EZh/+xiTaiKX7UfGim6flO5YnnNhX/Tp\nEudy//HKts0TBKhxM7jHXW8bEeX591/rdvl2zF1/B+LOlf9Z53L7qWuGuNy2y/UjCvSa8/t0Rkxk\nhON2bFQE5t2a6VJG9Qx3wSp20z7/2d3e2+Vf/eGAUdVRTuvPYaKbs03in5vH9pFdBSlsG+hP17se\nhd40pnebMud2t8evvRHSu3aQXQVltI+O8F2IBG1A97YXZVVj20B/batZpVq3IQMAo1PkoHWMaTsW\ngZBwNG/W+bKrYDjbBvrWKQ1io+ioKVitp2jrS0fzurnufPejYTc/epnJNVFX7yTXa3Fn6tVLYGjb\nQE/088/lOS633fVYAoAJ/doOUCGuSqpdr3cMT3M/gK9LxxiX25UKZ14MVm2Da8CeNd59+3zrM/e/\nLVEvRQcFeh/ev8Me3a/09OrN7k+Fu8W7BqdG6q7axpJW4xFuGJ3moaSrNXly0oiEs+daHYD8fuoA\nvx73QZZ/2S6thAI9gAvP9ZyCYWI/165tqg6R1lOPxFi361t/0eoaKdC39sqqPJfbURH+fUUb/UwR\nbSdvrnUd+Z7Y3r69lSjQA7hlbNseNy1aX48tor70QUtr1Rb6s5fXSqpJ+Ap28NODH2/TuSZEJRTo\nAUw9r4fH+1q33/k78w/x7WBZjewqEGILtgz0oaTIVfFCjZ5+fWFf2VUgJGCPzhwkuwqGsmWg95X/\nwhuaBs+7udMHer3/tgnp5lTEgk6eDqzZ5tO7xxtUE/V8NGec1/vvvMD1AKVesetHtgz09dTbwzC+\nBpnNpkDv0bd7XPP/eOqm2mJ0epKR1VHKCA/dVD1ZuKXQoJrIYctAH2gPhTUPXWJQTayvdb9vXzJo\nMJXfljxwgewqKCPQQe4fbTriu5CF2DLQ3/7upoDKp3Zu77uQTf3jmxzfhUhQWudK9yXQH107ERMW\n+mPbkZMG1UQOWwb6QHvOUM4bz9YfKJNdBaI5SaNjiQe2DPREP+K4gsu9dFP1hEbHOm04UB7S46vO\nUKBv4W5mKTuzfaD/29VDfBdqRcWkR3p45udDA36MuykI7eqLrUdDevx1r2/QqSbW99y31KQosn2g\nP69nQsCP+WpbaF9IVQUzxPyJrz3Pj0pIsF5ZFfjENoH2zLES2wf6kb07B/yYvy3Za0BNrM/faxkX\nCbmFqF3ZvUEp/h2A0LgE3zylem7tzdmZvgtZlO0CfaUO7ZiUBiE0//rlCNlVCHvj+/qX0vmxKwYb\nXBPru2NShl/lurZK/awS2wV6CtLyxcXQJC++3HWhf8Epoh31CPPF37Mjldku0NPVeP0Ee1E60D7N\ndtDUKv9SUodoSTWxvlByWYl2HVUn3YntAv2BUsqYqJd31uf7LuSH1kHOjhpadTOlH8PgZR86oct2\n8hXKrmq7QD/77Y2O5f7dOvr9uC/vm2hEdSytulafZrDymjpdtmNlRypO67Id+tHU76z9T1/s1GU7\n4cB2gV60+IFJfpdVuetVsAp0OuKh1jRAjM+htLvvDSEzqyrahfD6jc1wJorT60AmHNg60IdyekyD\npoBvdh13LG99bErQ2zlLkR7Zhyocy4NS4oPejmo5WoIRShv9I4rmpbd1oA9F6xnm7a5zCBcPT9RQ\nX/pdR51H4tMGB55KooVqedRDNemcrr4LCdopmteKAj2R4ivhmsctb2VJrEl4WLDxsGN5SGpiQI/t\nLczF+zwN/XcxY2hKQOX7dInzXciCbBXoQ71I8ws/R9gR34b2cgaziprgJsRWVVJcYGdHlwxwjjSu\noSZFPL9iv2P5mpG9AnpsfGzgaTyswFaBvrYhtNPaZ68b5lheR+l5Q6LoGbIuxB9Bf/xuygCDamJN\nG/Od1zvaR1M3VcBmgT7U4CLmcrn/w60h1sbeKMe/Z4H2Gkn0MeUgIRToCSFEcfYK9KBITwixH5+B\nnjH2NmOshDG2S1j3BGPsKGNsm/Y3Q7jvYcZYHmMshzE2zaiKB+Pkabropxe98okQEs72HVdjAJo/\nR/TvArjczfoXOecjtL+lAMAYGwzgBgDnaY95lTEWNldDxMEkz1wb+GxIxGnBpsO+CxG/6DFkPybS\nVifnhvr9lHMdy6cV6cXk89PBOV8NoMJXOc1VAD7inNdxzvMB5AEYE0L9dDXn/c2O5fP7BD7hCHF6\n5AvHCR7+EcQUgsTpQOmpkLdBKTr00yEm0rFcUlXrpaR1hHIYcD9jbIfWtNMSNXsBOCKUKdTWhZ3+\n3YMfZt6C0iA0y+jqf3I4T/SYEMaqlu8uDnkbf5jm7GL55pqDIW/Pznp2au9YfvqbfRJrop9gA/1r\nAPoBGAGgCMDz2np3VzvdnpcyxuYwxrIZY9mlpaVBVkOu6lr7BifRGCERVCBuHNPbsVysyJFTMMpO\nObN3Tujn38xSrYl97+081eUSYbL5tKT2Xkp6dvkQZwqKQ+X6ZBWVLahAzzkv5pw3cc7PAngDzuaZ\nQgBpQtFUAMc8bGMe5zyTc56ZnJzsrkjYi4qgdtFQXDm8p2PZznnNxAvbH9w5Nqht2Pn1Ey3b7Uy0\nNyqI+aBVFVSkYoyJCSSuAdDSYLsIwA2MsRjGWAaA/gA2tn68lSXEOtvvqF9+aMYLR6/7i6sl1kSu\n+RsOOZaDHUjWRJEeALByr7MZrEdirMSahBd/ulcuALABwADGWCFj7A4AzzLGdjLGdgC4BMBvAYBz\nvhvAJwD2AFgG4D7OuVIN2aOEi7jiUGsSmoVbCmVXwdJowpFmYi+ZW8b2kViT8BLpqwDn/EY3q9/y\nUv4pAE+FUqlw9vKNIzH0iW8BAAXl6kw1JtuqHGtepwkX4pkmaZaWpGYmymBQI3OAxOx2L67IlVgT\nQpwodxDxxjaBXq95JEVnaPIRQpR3ut76UwraJtB/uJFGchJCAqfCeBnbBPpV+5xtwN3iYyTWhBAn\nI840ib5UuNBtm0BfecaZ0CyZAn1IaKCYfnYerZRdBeLD/A0FsqsQMtsE+k0FJxzL790eNul3LOms\nMFHXp3ePl1cRBdQJk3nreQBCZwqhuViYnvFULbXRW1KXjnREH4ojJ5zDwnt1Cm6YOWm2cm+JY/nR\nmYN0227VGesHJ5nemj3asZyvQBoEWwZ6PalwoSZQf1/qzKUSGRFat77BKQmhVsfSPshyjortrWO/\n70Xbj+q2LTuKEKZzXL3f+mM8KNCHqEaBrleBWn+g3LEc1S60j9CoPvZOr1stNAuMDDE3yze/ucCx\nTAPQiIgCfRD6dHEeedl9piUxd3cwRqY5g9uqnBIvJYkvg4Szo/wy+43aPiw0sfxMSJhHKNAH5dZx\nzhwaNo/ziA5xZqNenZ1t/L96Z1Oo1SEaOwb6137McyxHtqORwiIK9EG4dbwz0FPWwNBk0kxfRCfr\n8pxNinpe2FYBBfogxEQ6p8E9UWOvCcf1ngkqknL6E50crnA23VDPOle2+JYZ2TNG7DVhBwU2bBIg\nxOpsEeir64wbyblg4xHfhRRCDVWEWI8tAr0YnV6/ZZS8eijgLF2TIMRybBHoxdA0Io0u/oWChtYT\nu5h0TlfZVdCNLQL9ydPOppsQx/fYXpUwwCc2il5Moq7rR6c5louraiXWJHS2+Kb+IAzEiYmI8FKS\n+CIOBx+b0cVLSRKI685P1X2bdh/MF6oIYdauBRafz8IWgf648GucGBflpSTxRTyyuXRgN122ObBH\nvC7bsZrS6jrHsl4DfNKSnAPQaIxHaLolOLtoNjZZ+7W0RaB/Z12B7CooY+nO447lWcLAMRI48UdT\nr4vcz/58uGO5WoH0usHQa2Ih8QCk0eJnR7YI9MQYek1Ifc/F/XTZjtW0E14/vToJpCTGOpb/++MB\nXbZpBWInAb0GS8VGOZt5rT73OgX6II1JT5JdBWX8bJgzAVVNnX2OQpftKnIsXzE8RZdt9hACfeHJ\nM7ps0wrEvP56dRKIEkZtWz0dOQX6ID04pb/sKiijndA+/bqNjkLFa0cJsfpcOxKPQncW2meawgOl\npxzLRuRPend9ge7bNBMF+iB1jouWXQUl7S+ull0F03ySXWjo9sXcL6rbmF/hWL5jUl+JNQlPFOiD\nJLbZ2TElrFGazvouQ0hrK/c5m27E5ivSjAJ9kJI6OI/odx+zzymy0Spq6nwXIoQEhAJ9kLrFO48a\n3qXum7rZcvik7CoQohzlA33hCePbKbceoeBECAlfygf6D7OcQ5evGdnLkH00WXwwBSHEvZvG9pZd\nBV0oH+jFGHz7xAx5FVHAydP2mk2LEL1G2cqmfKDfcNA5j2RcDCU0C8UOoV92TIiTghPj0HujHzEx\nXH2jdbuEKf+J2C60n7ez+jjmMHLhucmyq6CMhfdO0HV7HWMiHctmXKNSWXehq2Zdo3VHxyof6EVd\nOtIgp1CIibdUmpRBNr0yV7YQRxpbPY+6bJcN6u5YtvKlOJ+BnjH2NmOshDG2S1iXxBhbwRjL1f53\n1tYzxthLjLE8xtgOxlhYzdun1zBzdxptMNJn0bZjjuVrR+l7YXtIrwRdt2clGV076Lo9MY96XYP6\nn0sjdU9wHtFXnTFu7mmj+XNE/y6Ay1utmwtgJee8P4CV2m0AmA6gv/Y3B8Br+lSThIMsYZh5vM4/\nmsNSO+m6vXBXIhxp6/1azprgTB/9xdajum47HIm93sQjcL1d8Owqw7ZtNJ+BnnO+GkBFq9VXAZiv\nLc8HcLWw/j3e7CcAnRhj+qTlC3MWPqvz21GTsiHaYWak74Uh+3q7Y5Kzd5kduv4eEz6X4sQrxCnY\nNvrunPMiAND+t0w11AvAEaFcobZOSdOH9JBdBWXECxcQ9ZqEI5wVlBt3kTQm0tm77Mtt6h/R55U4\nM1dOMfCI3sr0vhjr7qqS228tY2wOYyybMZZdWlrqrkjYu3Wc8xTZBrHJUKlJcY7lJTuLvJRUg1np\nmG1wQI+fhC7UE6iTgFvBBvriliYZ7X/LeWghgDShXCqAY3CDcz6Pc57JOc9MTrZmV72MZOdFNG6L\nxhvjjExzttH/uN+aP/xEjtMWnxTEDMEG+kUAZmvLswF8JayfpfW+GQegsqWJR0U9hCvyP+RQcArF\nkF6JjmVx0mxCfKmutW5vGLP4071yAYANAAYwxgoZY3cAeAbAFMZYLoAp2m0AWArgIIA8AG8AuNeQ\nWocJcc5UsZ2QhGZNbpnsKhAL+XKb20YDIoj0VYBzfqOHuya7KcsB3Bdqpazon8tzcN8l58iuBiGE\ntGGbkbFirw5CCLET2wT6MRlJsqugjGnnURc2QqxE6UBfdsp5Ue/igd28lCSBMDKVBCFEf0oH+uW7\njzuWf5mZ5qUk8YULAwW6KpKjmxB/PHbFYMdypUXz3Sgd6KtrGx3L0ZSjOyQ7jzpz0cdGUl7/UDQI\nCfDemJVpyD6Gpyb6LkT8cm73jo5lq06+o3T0o+zz+jlQ6uw+GhdNgT4U4hwJg3sak7VTzLpIQtMt\n3vlafr3dml051Q70FOl182l2oWN59oR0Q/aR2N4ebf/i+Au9c9G36NXZfsm9fjUx3ZDtDugR71iu\nb7LmCHilA30t5eLWzfoDznwiRjWD2SUn/Z6iKseyUWdHd13Q17FcUWPN5oZAjezd2fB9WHUUrtKB\n/oUV+w3fR7JwYbKgrMbw/ansd1MGyK6CKXKLqx3LeueibxEZ4TxTqKixR0qJfsn6TuDizjvrCgzf\nhxGUDvRmeOe20Y7lExa9UBMu9J5pyc7EZrD3NxySWBNj1TY4E5qldo7zUtLeKNCHiHrz6CdCaK+2\n6imyP8pNaEoRc9J/IlxfUc2zy3Icy3a5xhMMilIhah9FPVD0khDrTFPx18V7JNbEWEt2mJvQ9UyD\numl8316XL7sKlkCBPkRpwoQZOcervZQkvoi9USgnPSH6oUCvozV5lF5XL8VV9riASIgZKNDraGN+\n6znUCSFEPgr0OqKZkQgh4cgWgd6M/rWEEHWlJFo7pYQtAv0zPx8muwqEEAt751ejfRcKY8oGenEg\nxSgThkYTQtQVHWHtUGnt2ntRJeSNjjAocZQdPf+L4bKroIzxfbvIrgLxkzVTmTkpG+iJMWKi6COj\nl5dvGim7CsRPSXHRjuV9x6u8lAxPyn5rG89a/Tc4fIizSw3oHu+lJAkn53Tr6LsQ8UvnDs5Af7yy\nVmJNgqNsoH/y692m7cvq7Xe+7Ch0zi7Vw+K9D2Q7Veec9YwbfCzywvXUzGaE4ioK9GFj+e5i0/bV\ns5PawU/MlcIMns3lj9PUTlUsJmszemKc83raZzrB124eZdq+6hqtN8+FsoHeTPUWfOMDsaPQOfVd\nhMHRKbmjM7+/iq/r1sPO17JrR2MnWRf7IGwQJo5RkZmZK634uaRAr4NBKWrPjPT3pfscy+0Nni9W\nvNh79OQZQ/clgzgxuJnmry+Qsl8jnal3nmmaMbtUi5o662UDpUCvgzsuyHAs1yqcEtYMg4UfzfJT\n6qWUqDwjJ8/+hoPqHdHX1Duvd9Sb+AP64nfGz1ynNwr0OmgnNGdQb5/Q9Bd69aj4Uj7+lXmdBESy\nfmCM9PX2Y47lWOr26xW9OjoQA73Rbdh2QmdHoTH6wrls81YfdCyLM2qRtijQ62BU706OZSteqAlX\nX249KrsKJIwVWbA/uywU6HUQKfSj311U6aUkCUSdpAuXhKiGAr3O9hyz3vDocLV0p7lzqxKiKgr0\nOludS9MJ6sXokaOE2IXygf5nw3uaur/VNKk1ISTMRIbyYMZYAYBqAE0AGjnnmYyxJAAfA0gHUADg\nes75idCqGUIdZe1YQbPH95FdBUJIEPQ4or+Ecz6Cc56p3Z4LYCXnvD+AldptaTLTadKRUJwVOrNf\nOaKXxJoQItdX902UXYWgGdF0cxWA+dryfABXG7APr8TgdMPo3mbvXinVtc7RhzGRyrf0mWak0CXX\nSLeOo7MwvQxPM+c9M0Ko31wO4FvG2GbG2BxtXXfOeREAaP+7hbiPgO046uziGE3BKSRMePmG9LJP\nNkSjPTpzkCn7oZz0BAixjR7ARM75McZYNwArGGP7fD5Co/0wzAGA3r31Peq+74Mtum7PzmikrzF6\ndmpvyn7GZCSZsh8S3kI63OWcH9P+lwD4AsAYAMWMsRQA0P6XeHjsPM55Juc8Mzk5OZRqtKFi1kNZ\nZEybliTM5qOqznHmPMfkeGcqZFXHePy/yweavk9usb6/QQd6xlgHxlh8yzKAqQB2AVgEYLZWbDaA\nr0KtJJHn2WU5pu/zqauHmL5PM4i5e6JMmpUsPtZ50l6qYDZQALhxTJrp+1yw8Yjp+wxFKJ+27gDW\nMsa2A9gIYAnnfBmAZwBMYYzlApii3VbenAv7yq6CIbLyK0zfp3gUqpKPNzmDQ0Q7c5rExKa3BoXy\nMIk/mgmx5k060iKjawfT9xmKoNvoOecHAbSZlJJzXg5gciiVsqLJA7u5ZNMjwVMxPTHgOiWjWcQ8\nTI1n1Qn0mwqcByBnOUc7k0fM5Byvwvh+XUzdZyioS4pOIiPooqVexPbPMoWaG9ZKTo+xXqHpBMUm\n8kiTmsFEdmq6IYJ21DtFN+IB/bvrCmRVQ3dr8+QG+vc2HJK6fz3JPukrr6mXXIPAUKDXidjmKs5l\nSQKX3sXZ/rnveLXEmpBwtWqf2858prHamSYFep2IF2f2FKnZjc0sPRJjHctZ+eo0NxD9HK44LbsK\nlkKBXifxwpX/okrqx6+X03R2RNzonhDruxBxoEBvgPcVaguVrUnVLjgkJLuP0UxugaBAbwAZfc8J\nsZMdhRToA6F0oO+m6MAbQggJhHKBXszncaXJs0sRQkg4Ui7Qi92e+iZTilZCEtubnyJAVQ9M7i+7\nCkFRLtCLl+6uGJ4irR4qECdw+dLCs+vY3ZuzM30XIn6JNClHkd7UC/TC2GgZyY5UcuK0c/TfUJp0\nRDfDUs19LbvHU1dEvcwc5jx4bGyyTu4g9QK97AooRBzta/aBzGiF5/o1u8sop2+FbvoJzcGn6hq9\nlAwv6gV6i00IEM7KTjmP6JnJuXxioyIcy6q9p2YH+o4xoU4kF76mnddd2r6t9LFULtA3Nlno1Q9z\nJdW10vYttoWa/SNjtCmDzQ1OXTqq2824Q7S8H7HsQyek7TtQygX6/cXykmD94vxUafs2woo9xdL2\nnaBYT5Hjlc4fzWnn9ZBYE7WkJsVJ2/fRE9bJt6NcoH/u2/3S9n31yF7S9m2EL7celbbv2MgI34Us\npPJMg2P5nG7yuv1W1zb4LhTmxKa8KyX2rCuptk4GS+UCvUyqTYF34rS8oPDri9SamnH57uOOZfH6\ng9lUSKFdUy93GsEWr/5wQNq+A0WBXkdqtSTLpdpgN5nNYCIV8vuLZ0dNVroiKhEFeh2J1wzPUtZF\nIth5NDyScN3zv82yqxCygrIax3J7iWdHVkKBXlfOSN9IgV43p+ut01853NUo0HSzcq9zdqlOcdES\na2IdFOh1lCLMjLQ2r1RiTdSyeHuR7CqQMHKovMZ3IeKCAr2OOggDU45U0CxTenno8x2yq0DCSJnF\nJuYOBxToDTJv9UHZVSBESduPnJRdBctRNtD/cdoAqfs/S70BCFHS8LROsqsQMGUDfT/J3fOKKs1J\nH1DX2IQcg7vMpXZub+j2CbGSESZnH9WDsoF+Uv+usqtgise/3I1p/1qNkirjflismoObuFdV24Dv\n95nTr59zjm92FqG+0TopfVWkbKC3Wsa+xTuOYVcQfa2zDzVPRC4OIvHlvQ0FKKr0/2JxZISyHxNb\nevCjbbj93WwcO2l8h4Ef95dcMayUAAAPDUlEQVTing+24IUV8lKTyMA5x1tr81EaJmkSLP0Nbmw6\ni0e+2IkjFeGfXKi2oQkPfbYd5cJUh0dPnsEjX+xEY9NZ3P/hVlzx8lq3j80vq8ETi3a7HYTVThul\n5W+3/ZLqWjz+1W7c9vYmr+XEfCJPXzvUv40rYM+xKjy9dK9yqZFFBVr3xOraRqTPXYKhf16OEzX1\n2HbkJJ7/Nsfn47/ZWYQPsw47bv/n+1xkHSx3W7Zl8prjbg4sNhVU4OWVuX7Xe8HGw1i6M/y62jY0\nncXDC3e6HDzllpzCXxfvweinvpNYMydrHfa2svnQCXyQdRgfZB1GwTMz0RBmM768vTYf6/LK8NZt\no7Fwy1F8kl2IiHYMT187DADw0GfbsS6vHDOGOhMzpc9dAgDI+dvliNESe13y3A8AmufDXbyjCK/d\nPArvrC/AxvwKx+POco7ffLQV/ZI7usxr+Un2EXy2uRCf/Ho86hvPYsoLq5u3X1yN9LlLsP3xqUiM\ni8KpukYM+fNyJHWIxsUDkrFwizOh2ZCe4dMmWVJVi6teWYcP7hyLvskdsS6vDDe/mYVnrxuG6zPT\nXMpuP3ISV72yDs/9Yjiu0zKLvrnmILLyK/DGLPfT6/38tfU409CEByb3R4eYSPzh0+3okRCLrPxy\n3DC6N6Ij2+HNtfl4/44xmPHvNYiPjcKlA5Pxx2kDXbaTV3IKs97Kwlf3T8IDC7Y61v/6Qvk5fA6W\nNgf6af9q/ixU1zVixktrHNeVFm45itiodrhmZC/cf6nrHKkHSk/hng+2AAB+yCnBvFmZjkSCBc/M\nbFP2tx9vb7P/7/YU47lvcxzpGHYdq8Ty3cXY8tgU/JBTggUbD+PTuydg8Y5juP/DrYhox9zm8I+O\nlHOcev3oNMzfcAhAc5K4oU98C6D5x+xnw3viw6zD+MtVQ9o87tpX12HW+HRH8sNb3szCzGEpuHFM\nb8PrzMLhyCUzM5NnZ2cH/Lj1eWW46c0sAMBfrzoPj32123Ff6w+dWVoCtSeTB3bD1SN74f+EL787\n79w2Guf1TMCYv68MuA6DUxLw2i2jsLeoCnf/b4tj/Xu3j8Gstze6fUz/bh2RW3LK7X3ij46ZWr+W\n4hf+2pG9EBMVgQUbnUeWax66BIfKT+OWt7Lw6d3j8eTXu7HraBUA4N6L++HGMb1xwbOrAADf/e4i\nXPbCjx73nfWnyThd3+T4kfXHjiemIiE2yutn4PVbRuHyIeZnXHx44U6X1ypYV4/oiS+3HXPcfnTm\nIPxtyV4AwOzxffC/rMOO9yg+JhLVQc7CNLJ3J2w97L0bZceYSOx6clpQ2w/FofIaXPTPH/wu/8dp\nA/DehgIUVzWfzSfERqKq1vm65D01PejmUcbYZs65z0mBLR3oV+8v9Ri4wjXQW5Gs13LC0ytxzKTe\nS3r5/vcX4dLnPf+AfHnfRIyQ0D1vwcbDeHjhTtP3a6QZQ3vg1ZvPN32/FTX1GPXXFbpuM9jvmL+B\n3tJt9OLk1UQ9/7tzrOwqBMxbkAeAgT3iTaqJq7ho9ZJ/PfeL4VL2m9TBevl1LB3oa+qsn6CJeEZ5\n4fQzMk29ydbbKTbFpJEMC/SMscsZYzmMsTzG2Fwj9rE5DOdsnGryfKAqU3F0cbSkrqq9u8ibcs8o\nFOf9Z8injjEWAeAVANMBDAZwI2NssN77+XxLod6bDFnHWEt3ZAorKn6P29HgMyKBUYcXYwDkcc4P\ncs7rAXwE4CqD9hVW7rpAfvc5VcicW5WEPxk9wazKqEDfC8AR4Xahtk55XSx4oSZcMTo3J0QXRgV6\nd99QlwZXxtgcxlg2Yyy7tFSdSTq6JcT6LkQIISYyKtAXAhCHKaYCOCYW4JzP45xncs4zk5OTg9pJ\nDwqqhBDik1GBfhOA/oyxDMZYNIAbACzSeyfnSuqTTAghVmJIoOecNwK4H8ByAHsBfMI53+39UYG7\nZEBwZwKEEGInhvUF5JwvBbDUqO0DnptuNj1ymZG7tZW+yR1kV4HoJDaqHWobwivxX7AS20dJ3X96\nlzgUlId/1twWlh4Z62nkZHJ8jLkVUdhvLztX6v43/mmy1P3r6eoRPaXuf/1cdV7LjY/IfS7/1yqr\nZyiGmTBjlaUD/bDURPzj56650jO6yj8CHZySENTj/nfHWPxmcuAfoNsnZgS1P1+uHN4TVwwzP9Oi\nSEYvph//eDG2Pz416MffNNaZdvbze8Zj2nndceekDGm5WVr4m6Nl1R8uRv7TMwyujdOHd47FG7My\nMXlgN78fEyG56+0AN9cHNz96Gb797YV4ZMYg3DnJ/+/k5/dM0LNqblk6e2WLyjMNGP5kc05oWZkW\nRXWNTXj+2/2Yt/ogLujfFWtyyxz3dY6LwonTztmg9vxlGgY/vhyAs+6n6xtx7GQt0rvE4cGPt2Hx\nDudkC2Myklzy0IuPW72/FKXVdfj9p805wEf17oQtQqrX3Kemo/8j37g89sO7xqJbfAwu0/LUA8DY\njCS88MsR6NUpPOaKPVJxGo1nuUvK4A0PX4rjlbUYlJKAgY8tc6zf9MhlqGtswqR/NKcjnnZedwxK\nScC/vmue4KKl+WLOhX1x1wV98fQ3e11y7+/5yzTERTe3aFbXNr9PDy/c6XgPsh+9DLFREThRU489\nRVV4e20+svIrsOoPF+NQeQ0G9khAj8RYzHxpDQ6Vn5aSRtebqS/+iP3Fp7D98alopx3mnalvQsNZ\njpvf+AkF5addPodHKs448tYDwK3j+uC+S84BB8f4p7932fYVw1Lw1DVDcaa+Cd0TYjD1xdXILTmF\nTnFR+O53F2HGv9egpLoO2Y9ehvjYSNw5PxtrcstcvrNFlWfQIyEWAx5dhnptfolVf7jY5b3/z00j\nccUwuWdHALC3qAopibEY8ZcV6J4Qg6w/uTYZH6+sxbinV2JYaiJ2FDpnjxuUkoAv7p2A/cXVGNIz\nMaTR0rZIU2wFe4uqMP3fa7DmoUuQluTMN9KSzrjgmZkuy/54ccV+LN99HMsevNCv8i+tzMWi7cfw\n3e8u8rivBxZsxdGTZ7D50Al8fs8EnN8nPJNg+fNa1TU2YcCjyxAd0Q77n5quy3bf21CAl7/Ps+X1\nH0+vTaCf20As21WEu/+3BTeN7Y2nrh6CjIeXYkD3eCz/rX+f+XBz13vZ6BgTiRd/OULX7VKgD3Ov\nrMpDXHQEfjUxw9AvTGtvrc1HXWMT7r34HMP3ZYT3NhSg8nQD/s9LExfnHL/5aBtuGJOGCf38myT+\nzTUHcZZzzLmwn041Vcffl+7FoJR4XDMy1WW9mZ9b4h4FegvZW1SFnw6W41cGtbUTYoSdhZXYVngS\nt47rI7sqtuVvoKdUi2FgUEoCBgV5AZcQWYamJmKoCT1GSOgs3euGEEKIbxToCSFEcRToCSFEcRTo\nCSFEcRToCSFEcRToCSFEcRToCSFEcRToCSFEcWExMpYxVgrgUJAP7wqgzGcptdBztgd6zvYQynPu\nwzn3OQNTWAT6UDDGsv0ZAqwSes72QM/ZHsx4ztR0QwghiqNATwghilMh0M+TXQEJ6DnbAz1nezD8\nOVu+jZ4QQoh3KhzRE0II8cLSgZ4xdjljLIcxlscYmyu7PoFgjKUxxlYxxvYyxnYzxn6jrU9ijK1g\njOVq/ztr6xlj7CXtue5gjI0StjVbK5/LGJstrD+fMbZTe8xLjEmeUVnDGItgjG1ljC3WbmcwxrK0\n+n/MGIvW1sdot/O0+9OFbTysrc9hjE0T1ofdZ4Ix1okx9hljbJ/2fo9X/X1mjP1W+1zvYowtYIzF\nqvY+M8beZoyVMMZ2CesMf1897cMrzrkl/wBEADgAoC+AaADbAQyWXa8A6p8CYJS2HA9gP4DBAJ4F\nMFdbPxfAP7TlGQC+AcAAjAOQpa1PAnBQ+99ZW+6s3bcRwHjtMd8AmC77eWv1+h2ADwEs1m5/AuAG\nbfl1APdoy/cCeF1bvgHAx9ryYO39jgGQoX0OIsL1MwFgPoA7teVoAJ1Ufp8B9AKQD6C98P7eptr7\nDOBCAKMA7BLWGf6+etqH17rK/hKE8CKPB7BcuP0wgIdl1yuE5/MVgCkAcgCkaOtSAORoy/8FcKNQ\nPke7/0YA/xXW/1dblwJgn7DepZzE55kKYCWASwEs1j7EZQAiW7+vAJYDGK8tR2rlWOv3uqVcOH4m\nACRoQY+1Wq/s+4zmQH9EC16R2vs8TcX3GUA6XAO94e+rp314+7Ny003Lh6lFobbOcrRT1ZEAsgB0\n55wXAYD2v5tWzNPz9ba+0M162f4F4CEAZ7XbXQCc5Jw3arfFejqem3Z/pVY+0NdCpr4ASgG8ozVX\nvckY6wCF32fO+VEAzwE4DKAIze/bZqj9Prcw4331tA+PrBzo3bVDWq4LEWOsI4DPATzIOa/yVtTN\nOh7EemkYY1cAKOGcbxZXuynKfdxnmeeM5iPUUQBe45yPBFCD5tNtTyz/nLU246vQ3NzSE0AHANPd\nFFXpffZF6nO0cqAvBJAm3E4FcExSXYLCGItCc5D/gHO+UFtdzBhL0e5PAVCirff0fL2tT3WzXqaJ\nAK5kjBUA+AjNzTf/AtCJMdYyUb1YT8dz0+5PBFCBwF8LmQoBFHLOs7Tbn6E58Kv8Pl8GIJ9zXso5\nbwCwEMAEqP0+tzDjffW0D4+sHOg3AeivXcmPRvNFnEWS6+Q37Qr6WwD2cs5fEO5aBKDlyvtsNLfd\nt6yfpV29HwegUjttWw5gKmOss3YkNRXN7ZdFAKoZY+O0fc0StiUF5/xhznkq5zwdze/X95zzmwGs\nAnCdVqz1c255La7TynNt/Q1ab40MAP3RfOEq7D4TnPPjAI4wxgZoqyYD2AOF32c0N9mMY4zFaXVq\nec7Kvs8CM95XT/vwTOZFGx0uhMxAc2+VAwAekV2fAOs+Cc2nYjsAbNP+ZqC5bXIlgFztf5JWngF4\nRXuuOwFkCtu6HUCe9vcrYX0mgF3aY/6DVhcEJT//i+HsddMXzV/gPACfAojR1sdqt/O0+/sKj39E\ne145EHqZhONnAsAIANnae/0lmntXKP0+A3gSwD6tXu+jueeMUu8zgAVovgbRgOYj8DvMeF897cPb\nH42MJYQQxVm56YYQQogfKNATQojiKNATQojiKNATQojiKNATQojiKNATQojiKNATQojiKNATQoji\n/j96nzZ+iupJRQAAAABJRU5ErkJggg==\n",
+      "text/plain": [
+       "<matplotlib.figure.Figure at 0x7fe8b92a1b00>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "# This allows us to plot right here in the notebook\n",
     "%matplotlib inline\n",
@@ -971,12 +1275,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "execution_count": 53,
+   "metadata": {},
    "outputs": [],
    "source": [
+    "@vectorize(['int16(int16, int16)'], target='cuda')\n",
     "def zero_suppress(waveform_value, threshold):\n",
     "    if waveform_value < threshold:\n",
     "        result = 0\n",
@@ -987,11 +1290,30 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 54,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<matplotlib.lines.Line2D at 0x7fe8b80f5c50>]"
+      ]
+     },
+     "execution_count": 54,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3Xl8VdW5N/DfQ0aSAEkgIPMYQVRk\niAgiToizRa/2VdqqrVjux+FabXu9ent9b6fXWqf2tbUqVutQr9qqVcQBEahWrGAYRIaEhDkkJIEQ\nhgQywLp/ZOfsleTkTHtYZ639fD+ffLLPPjtnP2d6svfaaz2LhBBgjDFmrh6qA2CMMeYtTvSMMWY4\nTvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWa4VNUBAEC/fv3EiBEjVIfB\nGGNaWb169T4hREG07ZIi0Y8YMQLFxcWqw2CMMa0Q0c5YtuOmG8YYMxwnesYYMxwnesYYMxwnesYY\nMxwnesYYMxwnesYYMxwnesYYMxwn+k5qDh3DRxv3qg5Dey3HT+AXizbh0cWlqDjQqDocI3xWtg87\n9jWoDkN7h4614Mrf/QNn/2op/lFWqzocXyTFgKlk8acV2/GzdzcBAH40+2R8/9xRyExLURyVnhZ8\nug3PfbYdAPD75eXY8dAViiPSV0NTKx54ZwPeWrMHAPi1dKC+sRkTf74kdPvG51YF4vXkI3rLnvqj\noSQPAI8t2YLfLStTGJHeinfUdbj96OJSRZHo79GPSkNJnjkjJ/kg4URvmf9S1xIMTy7fqiASMywv\n7XhK/Pvl5Yoi0duxluP404odHdbNeGiZmmCYtjjRWzZWHlIdgjGaW0+EXV/f2OxzJPr7e2lNl3V7\n6o92OWNiLBJO9Mx1jy0J30zzw7985XMk5rru6X+qDkE73f1zXLe73udI/MeJHsCBhu6PNI+1HPcx\nEjM888m2sOuXldTgky3B6OXglo83dz2iZ4n58xfhCz1e/eQKnyPxHyd6AHOf/aLb+8Y98KGPkZjv\n3a8qVYeglTdWV3R739FmPgiJx9vruv/stR4P39xoCk70AEr2HlYdgjG+2LY/4v2REhfrqKk1ciK/\n5g/mH4m6pfrQsYj3N3OiZyx2n2+NnOhZ7Mb+V+SzST5AiV3t4aaI99/25zU+RaIGJ/oYmH5a56Yn\nlvLYA5Z8fvzXyB0BTL92xIk+Bve+sV51CFo4dKxFdQiMhRXL2Y8QwodI1Ah8oo/lzX1rLY9KjMWD\n721WHULgmJyc/FZec0R1CJ4JfKJ/OMah+fyFiu4AD4jy3artPHDKLa0nzP2OR030RDSUiJYT0WYi\n2khEP7DW5xPREiIqs37nWeuJiJ4gonIiWk9Ek71+Ek489ffYyhw0dTPak9kWb6yOabuGplaPI9Hf\nVzEO4rl+Qfddg1mbyvqjMW23ZtcBjyNRJ5Yj+lYAPxJCnAJgGoA7iGg8gPsALBVCFAJYat0GgMsA\nFFo/8wE85XrUTGsle7ncRDRzAjCIxy+L1sc2duMnf9vgcSTqRE30QogqIcQaa/kwgM0ABgOYA+BF\na7MXAVxtLc8B8JJo8wWAXCIa6HrkPvtwA9eodwu3gjE//eqDEtUhKBdXGz0RjQAwCcBKAAOEEFVA\n2z8DAP2tzQYD2C39WYW1ThtbH7y8y7pfvrcpzJYsmlH9srus4zotiXn2piLVIWiJDyziSPRElAPg\nTQB3CyEinXtTmHVdXmoimk9ExURUXFurpg/rZ2X7uqz7/L4LkdKj61PYd4QvNEby3vqqLutSexDe\nvnNG2O1PGHzhyyuzxw8Iu77qYGxt0Mx2/tgC/Ob6M1SH4ZuYEj0RpaEtyb8ihHjLWl3d3iRj/W6v\nvlQBYKj050MAdGkkE0IsEEIUCSGKCgoKEo3fkeVhSsAOyu2pIBL9rQ1zIevfLixE78w0nDOmn4KI\nzPLW7WcDAC499aQu9x1o4PEL8UhLITxz4xRMGprX5b79RyKPoNVVLL1uCMBzADYLIR6X7loI4GZr\n+WYA70jrb7J630wDcLC9iSfZtE91x5zbF+YLMuuUtta8Z26c4nc4xpk8rC0p/fraCV3u4zr/3Tse\n5syx9BeXISM1BSPCNCtuqTazL30sR/QzANwI4EIiWmf9XA7gIQCziagMwGzrNgC8D2AbgHIAzwK4\n3f2wvXHlBPua8S0zRiqMRD/hKgOeNrgPACA7IxW/vvZ0v0PS1pMRZuPqk5WGH198cod1XOe/e+HG\nv/SQmmb7Zqd3uC9SJVudxdLr5jMhBAkhJgghJlo/7wsh9gshZgkhCq3fddb2QghxhxBitBDidCFE\n1zn6ktTogpzQcnpq15eGB00l7vozh3W4vX7PQUWRJL9Hogzi+8604R1u741SmTHI9keYawIAFgTk\nAnfgR8bK5s20j+JvO390l/srDvBFr3BaEij6VlLFfeljNSSv43WjcM0RLLyzHlza4fa8czqeqU8e\nlutnOMpworf0zU5H78y00O0+PdPw5m3TO2zz9Cc8WXg44f4BvnTL1Ih/c99bX3sVjnH+dnvHnkt9\nczIURaK/B64c3+F22yVI8wU20Xduhvl2p9NhAF2uyr+ycpenMZnk1EG9VYdgjIJenNiZM4FN9J2L\nmd0yY0SXbXqE6U/Pugp37YKPOpkuXv3+NNUheC6wib5zMbPcrPRutmTRlHUq78pHoO4Z3jcr7Pon\nv5XUtQK1MnVkvuoQPBfYRM/c07kYVOcua+0mDOnjRzha63x2JHf5lV3RzXrWvaz0lLDrO4+EDzeQ\nUnec6JljnQdLfbNoaNjtemWmdrjNZRC6WryxY/G8y0+PLaFHm5Q9iDZ06sL7wvcidxBod8cr5s0f\ny4k+inAFzlhknbuwtZtzRsfadi0nuMZ/Z5sqO3Y7PXVQbGdB+7kWUxfff6njEJ5JMXalbGw+7kU4\nSnGiB3BrN4kJ6FqhbWutmUOk/fCNiYM63N57kAf6dPbEsu5HxUayjT+XXVR1+nylpQQ33QX3mUvm\nnzeq2/s6d7M9auB/e79kpnVsIz3vkb+rCcRAjy3ZojoElsQCmeg7X/Dq3yuz2207D6gweQJhxpiZ\nApnojziYs/TeN9e7GIl53rkjfP35dmeO6FoaljHVPvn381WH4KlAJvrt+xoS/ttmniQ8ojOGRr7g\ndfZork3fnXg/l6t+MsujSMyz6j8jv1bD+3YsWXzwqFk1/gOZ6Fvj7NY3LD/8oBUWv2+dNSz6RgH1\n5Y66DrfHndQr4vaRmhxZR72kOlaxePHzHd4EokggE31LnEflC7uZDo8BFQca49p+QG9OTrF6/V+n\nR9+IxSTc9KCRLC0xa9BUIBP99Qvim1ygT8/4jgaC5KEPSlSHYKx4P3fVXJe+W+Hml4jkq931HkWi\nRiATfbyCUso0EWt3mfWF0Jlp7crMPZzomSN76u1a9BeO6x/333MZBNs6h0eRJo7oTBTPBtdR4BP9\nv18yNu6/4Rl+wnvwmvjnhV2yudqDSPT0Pw7nO5j3wpcuRaK/t9ftUR1CUgl8oo+1/oVsySZOTuH0\nT6A88YJPt3kQSTBFmx81SO55Pf4J0wf2MbejQOAT/Vkj+8b9N08sLfMgEv3FOlFLP2lSktU7D3gV\njtZGF2RH3wjAOWN4XEI0sTYpvnLrWR5Hok7gEv2xlo7tmPF2uwKATTyxtSPv/ht3V41mSF5sYzee\nvnGKx5Ho76bpXacJDWdUQY7HkagTuETPp7fq9Y5z8EoQ/fdV46NvBCAnIzX6RgF33skFqkNQLnCJ\nnq/Gq5fNyamLzr2P+mbzdIxuSbR7dNXBo9E30kQAE73qCMzBF6Xdc7zTB7NPFp/1JMpJLStZZb05\nA9ACl+jvem1tQn93yakDXI5EfyUuXavggT5dp2NkiXNrhPDzK7a78jjJIHCJXh7J+cTcSTH/3R++\nzRe9OttVF1+dm+40tfBAn8PH7NLZCfQPCOGJcYAeLo1kf299lSuPkwwCl+hl3zhjUPSNLIn0zjHd\nTinR33tp/APP2vH4M6DluF1oz8nL8dGmvdE3MpyTr+r/vTK2i+C6CXSid6K+kXvvrNpul9W97bzR\nCT/OCb5wgnfWVYaW505NvJTzIW4GQ8vxxD9PZ43KdzGS5MGJPkGcmzqKt2fD0PyeoWWnNV5MsFWa\novIkB6Wcaw5zW7989j0rgfpLJuJEnyAuaOnMR3efF1q+/ZU1CiNJDnL98/PHxtfv+4rTB4aWf7es\n3LWYdHVAOtueN3NkXH87fmBvt8NJCoFK9E6Lkb08b6pLkbDMtEB99OKSkZoS1/aFA8wd0ZmIH76+\nLrQ8fVR8JU5MLUkeqG+b0/leZxbaR1rPfWZO1yuWXMZGmUKwszsuGONRJHpqkHoemZq44xWoRO/m\ne86nyM7wF9A9aSmB+hqzBATqE+JW/1rGGNNJoBI9Y4wFUdRET0TPE1ENEW2Q1v2UiPYQ0Trr53Lp\nvvuJqJyISonoEq8CT0T9Ue777hYuDseCYEv1YdUhuCKWI/oXAFwaZv1vhBATrZ/3AYCIxgO4AcCp\n1t/8gYji60Lgoe21drGj/7h0nMJI9PcRFzRjhrprVmFo+UhTa4Qt9RE10QshPgVQF207yxwArwkh\nmoQQ2wGUA0iaPonXL/gitHzuyTwzjxP/+vLq0PJ/XXGKwkj0V3PYeRGuUweZ2f9bhdyeduXQY4bU\nDnLSRn8nEa23mnbyrHWDAeyWtqmw1iWdU05y/sVoPe6su6YphveNbdq7SJx2fdXZ22udT2R9p9TF\nkkcaO5MhjfF46pOtCiNxT6KJ/ikAowFMBFAF4DFrfbhuLWEbc4loPhEVE1FxbW1tgmEkLtb5TSPh\n2araXHRKYsPMLzvtpNDyjv3u1BDXkRt1zy+Qhvpf/eQKx4+nKzfa1Oeeadca+kfZPsePlwwSSvRC\niGohxHEhxAkAz8JunqkAMFTadAiAys5/bz3GAiFEkRCiqKBAz6m+UrmiJYDE+8SfPdoetRjka7vy\nhe3NPw93OSyWx3ArGr09udwe35LoFIJuHAQmm4QSPRENlG5eA6C9R85CADcQUQYRjQRQCGCVsxCT\nV2oP7p3qxHem2ZM2B3nijepD9nPvmZ5Y34XOM1QFVUmVfUQ/Ls4RxiaLpXvlqwD+CWAsEVUQ0TwA\nDxPR10S0HsAFAO4BACHERgB/AbAJwIcA7hBCmHE1w3KudJSwsy64zQ1ukM8ENrs0W5WOPtzovIY8\nl3puUyo13YznC9QhsfS6mSuEGCiESBNCDBFCPCeEuFEIcboQYoIQ4htCiCpp+/8nhBgthBgrhPjA\n2/D9d9UE+2RmzwFzJg9W7ZfvbVYdgtbSuQxCF3MmJmU/ECX40xGnbxbZlyDeXFOhMBLGbJlpSTNc\nhSUhTvQOfLy5JvpGjDGmWGAS/eFjPMUaYyx+dQZ0ow5Mon/soy2qQ2CMMSUCk+jLpTk5ebi4Mycc\nztTFbEcNGWJvsr0HnQ9oUy0wib76kP1mDeyT+OTLDDgsFXrKSA3MR8gTGysPqg6BRWFCp4vAfEvL\npCP6p74zRWEkBpAO6N+7a6a6OAwgnxtNGZ7X7Xbx4rMuZ34gVbBsMKCCZWASvYynXnNGHpSSncHd\n+pz4a7FdA/CWGSNde9y6Rv0vIKp090V2ot9Tr/94Gc54DjU26//fPl4Pvm8PbkpxWBdkwpA+TsPR\n2gcb7FGxw/tmufa4f1vjvCJmkMmjtk0obMaJ3qHGAF5Mk8vgpjms9zNxaK7TcLR2+Jh9oHDaYGf/\n9D7+4bmh5c+36p+cmHs40SdAPvIKeltodkaqo78/Y4id6D8v5+TkxJj+dhGvigCW56g40BhavuqM\nQQojST6c6BPwzSlDQssBz/NId9jrRu4B9a0/rnQaDrPInQ+C4tlPt4WWuYR4R5zoE3Dj9BGhZS4P\n68ykYe71NGHBtrHSroB6hzTjFuNEn5A+8pySLcFqo3d7yr9E668z1lnxzgOh5TH9cxRGknwCkei9\nnNv1g6+rom9kkJK9wa0bz5iuApHovexT/GjAauhwSxVj+glEopeHH84/d5S6OAzAMxkxpp9AJHo5\nNX337BGqwjACp3nG9BOMRC9lJ+525YzgI3oWEL+9fmJoWffPfSAS/aoddaHlHpzoHVm7yx4VO3VE\nvsJIzFLIvUSSTmqKnSveWVepMBLnApHo5RFzvTKdjeQMuq219kCcIXk9XXlMTnJAkQf/NFs87G0W\nBD2kejc79jcojMS5QCT6F1bsCC1npHK/bSfW7bbrp8+b6U61RfkLFSSle+0qoG41Dcw7x35Pgvmq\nuud0qfbQsRa9/2kGItHXHG5SHYIxNlfZ/ehPHeRO5cmA5nm0nrCTh1u9ma4/c2hoeWut3kehiXJr\nsFRuVlr0jTQRiETPktvL885SHYISDU32qGq3inBlSmesv1tW5spj6uDwsZbQsluznuVIBft0v7TH\niT5B10warDoEY/TLSQ8tb98XnKPQJZvsWvQzRvdz5TGH5tvXTTTvKBKXJZuqQ8tuXYeTa9LvqmuM\nsGXy40SfIDdnAwo6+Qv15y92KozEX3LTilu9weTX8r0AleeoPmQ3z557coHrj79ovd6vJSf6BMld\nr5h7dmt+5BSPZSU1qkMwxgcb7EQ898xhCiNJTpzoEyRfQJR7TzBnuMQCS8T6Crs3WF52eoQtg4kT\nfYL65WSElstqONG7pb6xJfpGjLG4cKJPkJzoX/o8OO3KXpNrijPG3GF8oq8+dMzzfWysPBh9I8YY\nU8T4RC/34vBqwuCG5mDNMsVYUFwxYaDqEFxhfKI/0tQaWp4rjRpk8fNypi7GktGYAjPqMBmf6CsO\nHA0t9++dqTAS/X2+db/qEBjzlVyD6MQJfXuEGZ/o5RFzug9jTiYXjuuvOgRjvPC9M119vDypRotc\nGoDFL10qp9DQ3Bphy+RmfKKXZaVziWK3DMrlsyO35GW52+87pYf9td5SfSTCliyaC6QDGo0P6IOV\n6E/q411y0vm0LlZf77F7F82fOVphJGZxu0piivSt1n1mJNXkCq06v5ZREz0RPU9ENUS0QVqXT0RL\niKjM+p1nrScieoKIyoloPRFN9jL4ZBKEEZ0rt9szdQ3rm+XqY8+dGqxh68da7J5aw/tmu/rYp0nJ\n6csdwRqX4GVp4d9+rG810FiO6F8AcGmndfcBWCqEKASw1LoNAJcBKLR+5gN4yp0wWTL4dEutZ48d\ntOsnxR4m4LtmFYaWy6rNH7Vd39gcWr7sNO+6Q77w+Q7PHttrURO9EOJTAHWdVs8B8KK1/CKAq6X1\nL4k2XwDIJSIzOqJGYf7xvLfk1y8I3Th31nlXjvmMobmh5aMt5o/xOHzMvkg6usDdsyNTJNpGP0AI\nUQUA1u/2KxaDAeyWtquw1nVBRPOJqJiIimtrvTtS9NJ/XzVedQjGGJZvNwUdD0Az2Ftr9viynw82\n7I2+keYWb7Sf4+WnB+K4Mm5uX4wNdwIe9lsrhFgghCgSQhQVFLhfP9oP407qHVoOQG7yVLY0m8+v\n3i9RGIk/VnNNH9dU1ttlTgblujNhvWkSTfTV7U0y1u/2wtoVAOThp0MAVCYeXnKbNEw6ReYyCI7M\nmWiXpzh0lPt+s9g1H+fvXjSJJvqFAG62lm8G8I60/iar9800AAfbm3hMlJlmz8/58ebqCFuyaHpn\n2r0l1u2uVxgJ000ZjxWIKpbula8C+CeAsURUQUTzADwEYDYRlQGYbd0GgPcBbANQDuBZALd7EnUS\n4uTknm0BmjeWOSd3+2XhRR0qKoSY281ds8JsKwDc4TQoHb38xU784urTVIfBGGNdBGZkbFZ6SvSN\nGGPMQEYnenn04eRheQojMYtXdf0ZY94wOtHLPWFmncLVFt2SmWr0x4Yx4xj9jZXrpwetloqX8nPc\nrbbIWDL7j0vHhZZ1LftsdKI/INXAkLtCsvhtrjoUWs5K43LPTshVEB++doIn++BSAO45bbA9MLKh\nSc8++0Yn+h4UsEpZHtp/xP6nKbiyjyNyV9wZhf082cdoQ6bASwbyGI+FX/lTusJthid61RGY44+f\nbQstf/fsEZ7sw8sSs8lELmnt1Wc0LYDXUWZ5NOvZaYPtss9H+Ig++dQcblIdgjH+XmoXnst1eUak\ndr0yg9EkVNdgt/OmpXjzFbznopNDy8cCUMESAGZ6dHaUIv03bmrV87U0OtE/vmSLr/s72KjnhZpk\n8eOLx6oOwRdyXf9+ORme7KO39E9zT/1RT/aRbM4cme/5Pp75ZFv0jZKQ0YneD6/NnxZa3raPa244\nMWOMN0dkQZQhdT5YHZBZpnIygnFGmAhO9A4FpV3ZD6kBuajiRx+BPj3tz+XzK7Z7v0NFlkrFBN2e\nktEknOgd8qqNNYjkLrBvrq5QGIm3Xv9yd/SNXFRq8HSC97y+TnUIWuAs5VBBL7uNlTsdOiMn+n+U\n6TnrWCyaWv2dKtHkSXEOSdMIsu5xondI7mP79lo9+9gmo7fXGTtfDWO+40TvolruzskYS0Kc6F0U\nhImYGWP6CUSinzKcSxQzxoLL2ER/4oR9Bap/L28GpTDGgmHu1KGqQ3DE2ETffNzu2fCDiwoVRsIY\n092tM0epDsERYxO9PChlDFfyY4w5oPtQPmMTvdx3OJUHNbnm/98wUXUIxvCq2iJzn+5DEYzNgFUH\nj6kOwUgpASlT4IdfX+fNpCPMfflSxdaSvYcibJmcjE30X0mTOzD3jOnPzWC6GJzbU3UIxsjLthP9\nXg0PIo1N9D9ftEl1CMbYWHkwtMzJw5nGZnvIvtelCR7/P2d4u4OA0nFgpLGJvq6hOfpGLhl3Ui/f\n9qXCEameCHlcevHWc0Z6+viqNftY52bSsOCMH3nEx2awoxpO5GJsovfTYcMLK8n/NL1uoZebhoSB\n1bhWbq8LLednezNTV7u0FPvd0vEoNB5ezXoWTstx/T6XnOhdkJFm9st42ytrQstZ6SkRtnROLvts\n4lSQ8hG9nxe2f/ruRt/25ZdWaazM9NF9fduvn2dlbjE7Q/nkgSvHh5blD5+JvG66GdA7M7RccaDR\n032pcPyEmqPBXfvNey0PSFN3+nn2t6ykOvpGSYYTvQt6SnXUdWy/SyYzxthHZopyoqfuVjRRxtd7\nDkbfSDNbFE2o8qWGUzNyondBD+kol/uZOyOfMZwwMdP7yOuzL9X+WmzP1NUrk6f0jIQTvQsG9rGb\nG3oY/uXy0+pd+h05Mf/w5DSx40TvgqH5WaHlHfsbFEZilpIqc+c6ZcxPnOhd5mf/fdPJg4sYY4nj\nRO+yReurVIdgjI8316gOgTEjGJ/o5R4xflizk9uVGWPJxfhEL18o9UPJXm5XZowlF0eJnoh2ENHX\nRLSOiIqtdflEtISIyqzfSgtufGfacJW7N8qT35qsOgTGlMnL0rcLpxtH9BcIISYKIYqs2/cBWCqE\nKASw1LqtTD+eL9aRFmmk70k+nx0xlkwW332u6hAS5kXTzRwAL1rLLwK42oN9RCT3fLlqwkC/d2+U\nxiZ7pG+vzFSFkZhlZmE/X/ZjejVQP/Xvre+BjtNELwB8RESriWi+tW6AEKIKAKzfvs+X9uqqXaFl\n00cHeo2kT8jJA8wux+ynH1081pf9jCzI9mU/LLk5PUSbIYSoJKL+AJYQUUmsf2j9Y5gPAMOGDXMY\nRkePLC519fGCLIX/UXpiQG9/mhQnB6gmPeueoyN6IUSl9bsGwN8ATAVQTUQDAcD6HbYztBBigRCi\nSAhRVFBQ4CQM5qHymiO+7zPb41LIySDPp/rpfXPs/ZQa2iPsR7NPVh1C0ks40RNRNhH1al8GcDGA\nDQAWArjZ2uxmAO84DZKp85uPt/i+zwf/5XTf9+kHuYR1qk/F7/r0tHuK7D2k31ynsbhhqrstArF4\ne+0e3/fphJMj+gEAPiOirwCsAvCeEOJDAA8BmE1EZQBmW7eNd/XEQapD8MQ/t+73fZ+mzkv7lpQc\nUlP8GcIiF9kzaa4EubJpX49n6gqnj2ZdLRNuoxdCbAPQZfZhIcR+ALOcBKWja6cMMbKaXpOC2XRM\nLU5c3+h/HST5zKHFoEQvT8mo4vOi2zSXxo+M9YvfpRZMJn+HmlrNmcilst7/phO511nVQXOabuR/\nWiqmgHj4Q706fHCid0kPnnDENSekTP/hhr0KI3HXC5/vULr/n727Sen+3SQfT6voQt2q2aQ4nOhd\nIndDNOkUWYWcDLtFcaGBzWHMuZ2K531Q0RvNCU70LsmVLs7sOXBUYST6O21wn9BypUHNDcw9n27Z\npzoErXCid8nwvvYIxD31nOjdomoCaJbcMtM4dcWDXy0PfLTRnHZl1XTr3cD8kRWAQXVu4kTvAZ7U\n2j2aXfNiPvlLcYXqELTCid4DG/YcUh0CY4yFGJfo5RFz54zxpxQsY4wlM+MSvTxirnBAjsJIGGOm\nKRquZzVQ4xK9kIZSnDkiX2EkZjljaK7qEFiCzj2Zq8O6ZYamrQTGJXp54NIlp56kMBL9yQO/Ftw4\nRWEkZkn3qaBZuweuOMXX/ZlMHsynE+MSffXhptByCpclcKS+sSW03C+H5951yykD/Z2pKyOVuyK6\n5bLT7YPHZgUF/xJlXKLfWHlQdQjGkJvB/P6fedZIc5vd/K6TIoytB+q/IXlZoeXG5laFkcTHuETf\ng6e+c03NIfvsyO/CUemp9kfTtEFTx31O9Lo2N8TirlmFyvat08fSwESvOgJz1B5pir6RR9J8bsf2\n03lj/b042tfgZrf0FHVf+LW79RkYady3qfqQuuR03ZQhyvbthRVl6gpHZUtHoTodOXWnWprG76oJ\nZs5GpoLKQrG76/SpaWVcon9jtbqh0RedMkDZvr3wxXb/pxFsl5Fq1kfz4FH7wvawvlkRtvSWTu3K\nsZiq8FrOkSZ9Xkuzvk2KjSrIjr6RRlSWcrh2sllnR4vWV4WWe2eqm2/UhLOjw8fsf5pD8tTNL/zI\nYn1mmeJE7yK+POAe+UjNgNyETZXJUf9opcKzNLe0HDfhE+EvTvQukjumnOCyi46YNgbi483VqkMA\nAPx0of7TCcq9lkz7nHiFE72r7A9dM08n6JoGw9qVVdpV16g6BMd2SNMIDspV13SjE070LsrOsEcg\nVvIsU65Zta0u+kYsMHRqG09FlsTEAAAHuklEQVQWnOhdNLCPfXSxpVqvyYOT2R8/26Y6BJZEVm3n\nf/zxMjbR+104qrMV5Tx5sVs28kQujDlibKK/deZIpfs3aVJr1f80D2vUX5mxZGRsoldd8sakmjt3\nXjhGdQiMhajsOw8A10warHT/iTA30Svu1W5Qnuf6QSypqK7do2OXTnMTvX7vRdLyu3IlY5Go/jSq\n3n8izE30qvevOgAX6XgEw8xl0nfLL8YmetWfBtVNR24y55kw5pyO/2jMTfSK6fhh6I5Jz4XpT/XH\nUceDOGMTvX5vRfLS8YPNzMXXjOJnbqJX/Fkw6cNo0FNhzDEdvw/mJno+CnWNSf+0GHNKx6+DuYle\nwzcjWXGnG8Zk+n0hPEv0RHQpEZUSUTkR3efVfpKVScnRoKfCmGM6HkR6kuiJKAXAkwAuAzAewFwi\nGu/FvrqNwc+dJeH+3cRNN4zpzasj+qkAyoUQ24QQzQBeAzDHo32FxbmJMcbapHr0uIMB7JZuVwA4\ny+2dfLKlFr9cFH5qtJQeai8/LC+txezHP1Eag1tSU9T/1zTltUwGur+WZTVq53rITLUnGHLjtbz+\nzKG4deYox48TiVeJPlxm6DCJKhHNBzAfAIYNG5bQTnIyUlE4IKfDuvYPwfdmjEjoMZ0iAoQAzh9b\ngKz0lOh/kMT2NzSjrqEZ35wyVMn+//PycXjw/RLMGNMXfXqmKYnBLblZafhyxwG8d9c5Sva/4r4L\nMeOhZQDQ5Tujm9EFOfhw4178y2Q1VSTvvXQsnl+xHYA7r2U/H4q0kRDuT2JNRNMB/FQIcYl1+34A\nEEL8Ktz2RUVFori42PU4GGPMZES0WghRFG07r9o3vgRQSEQjiSgdwA0AFnq0L8YYYxF40nQjhGgl\nojsBLAaQAuB5IcRGL/bFGGMsMq/a6CGEeB/A+149PmOMsdgYOzKWMcZYG070jDFmOE70jDFmOE70\njDFmOE70jDFmOE8GTMUdBFEtgJ0J/nk/APtcDEcH/JyDgZ9zMDh5zsOFEAXRNkqKRO8EERXHMjLM\nJPycg4GfczD48Zy56YYxxgzHiZ4xxgxnQqJfoDoABfg5BwM/52Dw/Dlr30bPGGMsMhOO6BljjEWg\ndaLXeQJyIhpKRMuJaDMRbSSiH1jr84loCRGVWb/zrPVERE9Yz3U9EU2WHutma/syIrpZWj+FiL62\n/uYJSpLJX4kohYjWEtEi6/ZIIlppxf+6VdoaRJRh3S637h8hPcb91vpSIrpEWp90nwkiyiWiN4io\nxHq/p5v+PhPRPdbnegMRvUpEmaa9z0T0PBHVENEGaZ3n72t3+4hICKHlD9rKH28FMApAOoCvAIxX\nHVcc8Q8EMNla7gVgC9omUn8YwH3W+vsA/NpavhzAB2ibvWsagJXW+nwA26zfedZynnXfKgDTrb/5\nAMBlqp+3FdcPAfwPgEXW7b8AuMFafhrAbdby7QCetpZvAPC6tTzeer8zAIy0PgcpyfqZAPAigFut\n5XQAuSa/z2ibSnQ7gJ7S+/td095nAOcCmAxgg7TO8/e1u31EjFX1l8DBizwdwGLp9v0A7lcdl4Pn\n8w6A2QBKAQy01g0EUGotPwNgrrR9qXX/XADPSOufsdYNBFAire+wncLnOQTAUgAXAlhkfYj3AUjt\n/L6ibT6D6dZyqrUddX6v27dLxs8EgN5W0qNO6419n2HPGZ1vvW+LAFxi4vsMYAQ6JnrP39fu9hHp\nR+emm3ATkKuZRNIh61R1EoCVAAYIIaoAwPrd39qsu+cbaX1FmPWq/RbAvQBOWLf7AqgXQrRat+U4\nQ8/Nuv+gtX28r4VKowDUAviT1Vz1RyLKhsHvsxBiD4BHAewCUIW29201zH6f2/nxvna3j27pnOij\nTkCuAyLKAfAmgLuFEIcibRpmnUhgvTJEdCWAGiHEanl1mE1FlPu0ec5oO0KdDOApIcQkAA1oO93u\njvbP2WoznoO25pZBALIBXBZmU5Pe52iUPkedE30FgKHS7SEAKhXFkhAiSkNbkn9FCPGWtbqaiAZa\n9w8EUGOt7+75Rlo/JMx6lWYA+AYR7QDwGtqab34LIJeI2mc7k+MMPTfr/j4A6hD/a6FSBYAKIcRK\n6/YbaEv8Jr/PFwHYLoSoFUK0AHgLwNkw+31u58f72t0+uqVzotd6AnLrCvpzADYLIR6X7loIoP3K\n+81oa7tvX3+TdfV+GoCD1mnbYgAXE1GedSR1MdraL6sAHCaiada+bpIeSwkhxP1CiCFCiBFoe7+W\nCSG+DWA5gOuszTo/5/bX4jpre2Gtv8HqrTESQCHaLlwl3WdCCLEXwG4iGmutmgVgEwx+n9HWZDON\niLKsmNqfs7Hvs8SP97W7fXRP5UUbFy6EXI623ipbAfxEdTxxxn4O2k7F1gNYZ/1cjra2yaUAyqzf\n+db2BOBJ67l+DaBIeqxbAJRbP9+T1hcB2GD9ze/R6YKg4ud/PuxeN6PQ9gUuB/BXABnW+kzrdrl1\n/yjp739iPa9SSL1MkvEzAWAigGLrvX4bbb0rjH6fAfwMQIkV18to6zlj1PsM4FW0XYNoQdsR+Dw/\n3tfu9hHph0fGMsaY4XRuumGMMRYDTvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4TvSMMWY4\nTvSMMWa4/wVgxNtGhr5GaQAAAABJRU5ErkJggg==\n",
+      "text/plain": [
+       "<matplotlib.figure.Figure at 0x7fe8b80efb00>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "# This will throw an error until you successfully vectorize the `zero_suppress` function above.\n",
     "# The noise on the baseline should disappear when zero_suppress is implemented\n",
@@ -1019,7 +1341,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 55,
    "metadata": {
     "collapsed": true
    },
@@ -1032,7 +1354,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 56,
    "metadata": {
     "collapsed": true,
     "scrolled": true
@@ -1046,11 +1368,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 57,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.12 ms ± 552 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit add_ufunc(x, y)  # Baseline performance with host arrays"
    ]
@@ -1064,11 +1392,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 58,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x7fe8b80abe80>\n",
+      "(100000,)\n",
+      "float32\n"
+     ]
+    }
+   ],
    "source": [
     "from numba import cuda\n",
     "\n",
@@ -1089,11 +1425,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "537 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit add_ufunc(x_device, y_device)"
    ]
@@ -1109,7 +1451,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 60,
    "metadata": {
     "collapsed": true
    },
@@ -1129,11 +1471,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 61,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "448 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+     ]
+    }
+   ],
    "source": [
     "%timeit add_ufunc(x_device, y_device, out=out_device)"
    ]
@@ -1147,11 +1495,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 62,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27.]\n"
+     ]
+    }
+   ],
    "source": [
     "out_host = out_device.copy_to_host()\n",
     "print(out_host[:10])"
@@ -1179,7 +1533,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 63,
    "metadata": {
     "collapsed": true
    },
@@ -1208,23 +1562,44 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "execution_count": 71,
+   "metadata": {},
    "outputs": [],
    "source": [
-    "pulses = make_pulses(t, period, 100.0)\n",
-    "waveform = add_ufunc(pulses, noise)"
+    "noise_dev = cuda.to_device(noise)\n",
+    "t_dev = cuda.to_device(t)\n",
+    "pulses_dev = cuda.device_array(shape=(n,), dtype=np.float32)\n",
+    "\n",
+    "make_pulses(t_dev, period, 100.0, out=pulses_dev)\n",
+    "waveform = add_ufunc(pulses_dev, noise_dev)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 72,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<matplotlib.lines.Line2D at 0x7fe8b80061d0>]"
+      ]
+     },
+     "execution_count": 72,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJztnXd4FVX6x78nCSkEElqoAUIg9E6k\ngwhIVbEL6yquuP50V6wri20VdRXXVRdsyKorNhBBBSkiXUCKQXpNIZDQklBDST+/P+4kuWVun5lz\n58z7eZ48uXPu3Jl32ndOec/7Ms45CIIgCPMTJtoAgiAIQhtI0AmCICSBBJ0gCEISSNAJgiAkgQSd\nIAhCEkjQCYIgJIEEnSAIQhJI0AmCICSBBJ0gCEISIozcWYMGDXhSUpKRuyQIgjA927dvL+CcJ3hb\nz1BBT0pKQlpampG7JAiCMD2MsaO+rEddLgRBEJJAgk4QBCEJJOgEQRCSQIJOEAQhCSToBEEQkkCC\nThAEIQkk6ARBEJLgk6Azxp5gjO1jjO1ljM1ljEUzxloxxrYyxtIZY98wxiL1NpYITTjn+DYtB0Wl\n5aJNIQhL41XQGWPNADwKIJVz3hlAOIDxAN4A8A7nPAXAOQCT9DSUCF1WH8jD0wt2462fD4k2hSAs\nja9dLhEAYhhjEQBqAjgJYCiABcr3cwDcrL152nLhaimW7TmJd1YexoDpa1BRQQmyteBiUSkAoOBS\niWBLCMLaeJ36zzk/zhj7N4BjAK4C+BnAdgDnOedlymq5AJrpZqVGTJ67A78czq9aLimvQHRYuECL\nzE9m/iU8OX+XaDNMz9WSchwpuIyOTeNEm0KYGF+6XOoCGAegFYCmAGIBjFZZVbW6yxh7kDGWxhhL\ny8/PV1vFMI6euSx0/zLy4bpM0SZIwRPf7MSYmRuqWjsEEQi+dLkMB3CEc57POS8F8B2A/gDqKF0w\nAJAI4ITajznnsznnqZzz1IQEr8HCdIVzz8uE/zDRBkhC2tFzAICuL/2MRTuPC7ZGHrZknUHuuSui\nzTAMXwT9GIC+jLGajDEGYBiA/QDWArhdWWcigEX6mKgd3KkRMfjNtYIskRNOb8iACbN7Mz42byed\nSw2oqOAYP3sLBr6xFln5l0SbYwheBZ1zvhW2wc/fAexRfjMbwN8BPMkYywBQH8AnOtqpCRUVjsv5\nhcViDJEIZidEP+xUbaQRPsCcmjqzf8nCH/67BaXlFeo/ILwyZeHuqs9D31ov0BLj8CkeOuf8RQAv\nOhVnAeituUU6cvz8VZeyw6cL0bZRbQHWEEQ1YU6K/vrygwCAnLNXkJxQS4RJpoZzjgXbc0WbYTiW\nnyk64p1f8NPek6LNMB3HzlxB0tSlmJ9mvYdGa3bmnEdpuXoXy7cWFCUtuH3WZpeyx+btkL4ry/KC\nDgAPffm7aBNMx9YjZ1TLtx05a7Al5ib9dCFufn8TCi6pd/+RF5F/ZOZfwrxtx7BdGWS2Z9HOE8g9\n59pKlwlDU9AR8uCunnPnR5uRPX2sobaYmTOXaTKWloyZsQHFZe7HHZzHKmSDauiE3+RdLMKUBbvd\nfp9pEY8CLaiQvAvAaDyJuRUgQVd4f22GaBNMQ2a+5wlawyziUaAFmzIKvK4je78voR0k6ApvrqDA\nUr7iS7OVaum+UeZmMNSe8bO3GGCJNWCS97lIL+gfrc/ExvQC8ufVEF9EaPLXOwywRAJ80JetNNCs\nGQOmrxFtgq5IPyha6c87oXdzr+teLSlHTCQF6/JGeJh3FcqjSVu+4WNvypWSMtSMlP5xJYJE+hp6\nJXO35Xhd56lvdxpgibnJOXsFn/16xOt67tzwCEf2HL/g03o7c87rbAkhA5YRdF/Ynevbw2Vl7v10\nG1bsOy3aDGn4NVPdn9+ZTzd6f4lanaSpS0WbIBwSdDtkn3SgBYVFZd5XIjRn1YE80SZIQ9LUpfh4\nQ5ZoM3SBBJ3wE99d6A6euqijHQQROO+ukdNNmQSd8At/0sx9/zvF9fbEj7soOqUofBjXNyUk6IRu\nfPSLnM1arZg8l1w7RSGrPzoJuhM0K48wgkvFNBahJTln/ctKRDV0i/DxBvImcEdGHs3+1Iqb3t0o\n2gSpGPQv/7KPUQ3dIvxz2QHRJoQsw9+mGC1akVXgf8LyqyXlOlhiTeSUcxJ0VY4E8LARhN7M2Zwt\n2gRpkHUmMwm6Cv+V1EdVBJQNSjuW7CavGMIzJOgqzNt2TLQJ0kBRLLVj73Hy6yc8Q4KuQgU5umgG\nnUtCbzame48pbxVI0AldofEIbckrLBJtQsgx7cd9ok0IGUjQCZ/w18+X0Ieb39sk2oSQIz1Ad9pH\n5+7AxaJSja0RCwk64ROrDlCERa0IZvLaiQtUQ9eKxbtO4LNN2aLN0BSpBb2olPx2tSKYCbQV1JHu\nAJ0OQi+kFvSBbwSebqpQsqZYsCwOIpAUxXRxhMJLhA6yXQqpBd2fyIDOdHnpZw0tMT/BZMxJP12o\noSXmRzINEcrdH1MCbXukFnQiNCiXrRoUJOSVoR2bMnzL+GQVpBX0svIK0SYQCpQP05Evt9DENUIf\npBX0z37NFm2CNJQHOYp39Ay5PGoJhd4l3CGtoF+k3JeakZ4XfB/4Zh+TIRPe+TYtR7QJ0sAlG9GQ\nVtBPk79uSDHhvzR4RRB6I62gf6NBLYbcy4hQJEzS5Awi+M+qdNEmaIpPgs4Yq8MYW8AYO8gYO8AY\n68cYq8cYW8kYS1f+19XbWKNZsptCvwLy+eqKJPec+/GEBrWiDLTE/NBYgiu+1tBnAPiJc94eQDcA\nBwBMBbCac54CYLWyLBWUxNfG6BkbRJsgDWXl7t+OMyd0N9AS83OlxLOgW/EF6VXQGWNxAAYD+AQA\nOOclnPPzAMYBmKOsNgfAzXoZSYQ2zerEiDbBNHhq7PRv3cCnbZy+SONDALDtyFmP3792S2eDLAkd\nfKmhJwPIB/A/xtgOxtjHjLFYAI045ycBQPnfUEc7dWFs1ya4pUcz0WaYnmWPDRJtgmmY6yZ5So1w\nW7/4kskD8dyYDh638c1v5OUCAI987bkFPaJTY3z2p2u8bkemsTJfBD0CQE8AH3LOewC4DD+6Vxhj\nDzLG0hhjafn5+QGaqQ/v/6En/mnBt7iWLH10IOJjamDBQ/2wnITdK7PdxLUZlJIAAOjcLB5/HpyM\n7Olj3W7jzOXAQ1pYjSHtGqJhbc9dL6sP5Blkjf74Iui5AHI551uV5QWwCfxpxlgTAFD+q54Vzvls\nznkq5zw1ISFBC5u9UlzmPsrioBTHZm3NyAi8cVsXvU2Slk5N4wEAqUn10KFJHP7v2mTBFpkTtVpi\nt+Z13K6fR90uHtn+/PCqz97q36ckOpdeBZ1zfgpADmOsnVI0DMB+AIsBTFTKJgJYpIuFAXDtv9a5\n/W7y0BSXsjtTm+tojbnxNvDkzDUt6+lkifUY2amR2+/WHQ6t1m6oUd+PAdGlEnmzRfi43mQAXzHG\nIgFkAfgTbC+D+YyxSQCOAbhDHxP9x9Mbt3ereujUNA77TlQn3GXk1+uW0jL39ZsVjw92KRve0b0I\nWR1P8fmnjnbtN39ocGv8duQs1h5yFW9P3jKEf2zOkmcWs0+CzjnfCSBV5ath2ppjDIv+OsDnCIDr\nD+fj2rbGdBWFIpfd1NBnTuiBdo1rG2yNudl34oLb71rUq+lSFhbG8MCgZHVBr6Dgc+5wHn8Y26WJ\nZWI7STtT1BMR4WGIigj3ad2Jn27T2ZrQpv909SQhvVr6P4+spMzaIuSpJRgTqX4/9kuur1p+tYSy\ncfnKCzd0xI4Xrkd7C1RALCXo9/RtKdoEaYiO8P/W2SJR0zYQ3LnHRXk4l2Fh6i+BeeS66DPhYQx1\nYyMxtksT0abojqUE/ZWbyUVRKzwNOu2bNlK1/F6Lt3beXZOhWn7o1dF+b+tIweVgzTE1GQFEAG1q\ngQlw0gn68fNXA/rd8A6mmxcVssRG+TrWbi3ST18SbYI0TJqTplreuVmc29+Eu2ntAPLEhZFO0B9w\nc6G9TU9/d0JPPcyRkqHt6eUXCIFWNghX3CVNaZ1Qy+1vrmvn/r6VZXxHOkF35zftbUaou0EpwpUh\n7azr9UOIJ9DadHzNGi4TCyuRZfq/dILurlFVLzYy4G0WFpUG/FsZqQgyJR1Rza4XR3hdZ8Z4isJo\nT7EHf/4nr2/r8bfuIjB+uulIUDaFCtIJerabpliXZvEBb7PUopM43NVafGnNBHO+rUR8TA2v64zr\nTgHkfOGTialoWT/W4zoT+yeplr+/NlMHi4xHOkF3RzCzQT2MpViSuGjvIvTOXd0MsMTaUDwXRwa6\n6U6xR/Zn2TKCHgwnLZqftOBS4FH92jSUfxKHaDZlFog2IaTwZbKgPzFezAgJug/8lu05kL6s3PPJ\nVtXyPm5mL/rC4dP++w8T6jzxzS5k5VvPFTKY1nazOjFIaejeE8bsWELQP71PLQyNK8+Oaa9abtXg\nXQdPqYtvMAPMz3+/N+Dfmhl/o1b6Ss4567lCfrnlaFC/b1nfNW6OLFhC0Ie29y0C4IODW6NpfLRL\nuTXlPHgax7mey20Wbe2oDaxP7Od7KIpAYufIytsrD7uU+dM3Hh8TeIUk1LGEoPtDA5XsJmEWraGr\n8acBST6v+5frWutniMlQu4Uaqrzw3PHoMNc4/kQ1Syb7ni2rQxN5x3dI0J2IDHc9JVYPKmWPPy50\n9/ZL0s8Qk1FY5Nrl0ruV78lAZPfOCJYEL2nm7Lm1Z6KOloiFBN2J1291TUe3eNcJAZaEJt09pEUj\n1OGcY4BKGOJrkvwRdFJ0T0TX8F3K3I0ByTBgT4LuREojeZtjhBi0SBdHeu6Z2j7MjfDGPxaZf8Be\nKkE/f4WyoWsFTe/XjiINklGkUq5W3TkTxLyLUEEqQVebCFOLQrkGhFVSdpmFyIgw1Yw7FZIElTKa\nsV1dk12k55nfp18qQVfLs0hN1cD4hjLihBwPD3H1Giq3aJyhYJk5vodoE3RBKkGfsSrdpYz0PDAO\naTRA1I0GUTVDzcMoTKon2Dg8JbswM1LdDmpB75vV1WZWWCAprwhgqEpSgbJyOZIJBEPbRtpMP7fq\nLGZCHakEXa0/cc7912iy7eFv/6LJdqzGo8PauJTtP3lRgCWhRZ2a2sxWlGEgL1juH9BKs22Vm9wZ\nQCpBV/MjbVjb99l4hPao1SBPUCo2zboCn/lut0ZbIgBg2Z6Tok0ICqkE3eQv15AnpoY2afoe+vJ3\nTbZjFnblXnApmzJKPRCcv1gt+UpeoWso68cCDIuQnOCaDKPIQzYkMyCVoBP68tvzw0WbYEpmrXfN\nhtNOxQWR8M4NMze6lMXXDGxS0aSB2nXVhAok6CpkTx8r2oSQhHz6tSNWw6Tkaw/labatUCavsAh5\nhcWabY9J6ANHgk4QAgjUO6V/a9fkIpkSTIjxhZIybb2jZHQQkkbQc8+pJ4cmxKM2K8/KJNaNCfi3\nn96njdeWGdHaRXNgG9ccpGUmH4iTRtAHvrFWtAmEGzo1jRNtQkgxoXeLgH+rNiGmWOOaa6iidYW6\neT3XOSqlJp8jIY2gq+FPSE3CM08MbxvwbynQlyNREYHfl2qipnVXhJXJMHn3ldSKNzGIBAtaDlqZ\nDbWolbFRgZ+P/ipNWytTOzrwwWW1bgd6XWrH55uDy1cqGqkF/a9DXWcp+kqiRiEDzIi75NCB0rOF\naz5MvZImhxr5Kl4ZwQRIVKuhc4tEXFTrQq8boMuirPgs6IyxcMbYDsbYEmW5FWNsK2MsnTH2DWMs\n5DKvxgUR9P6LB3praIm5UNOH9o217Qf/10+HNN1eqHLLB5tcym7q3jTg7amJmkX0XNXNsGYkudLa\n408N/TEAB+yW3wDwDuc8BcA5AJO0NEw0aiEDisvMPYvMV3bnnncpG5iibbeJVZKR5J5zDXMQjAip\ndbm8tzYj4O2ZCbWX2ZRR7Yw3JITxSdAZY4kAxgL4WFlmAIYCWKCsMgfAzXoYGEos33NKtAmG8Pry\ng7rvg8ZJCX/Zf8I1qNvQ9q7RPK2MrzX0/wCYAqByOL0+gPOc88qO0FwAvqeD1xijwrFa1ZtgkMa1\nc4Ay7RD+88x3e1zKgs0lOkzlhWDmMQmvgs4YuwFAHud8u32xyqqqZ4Ex9iBjLI0xlpafH3yyXDWu\nGBRQZ8pCa0a266iDH7mJnxnh/LFv4H7sZobr4M/zicpELTO3Hn2poQ8AcBNjLBvAPNi6Wv4DoA5j\nrLIzMBHACbUfc85nc85TOeepCQkJGphMGE2sDgNPVEMPnLFdAh9UJbyzM8d1DMkseBV0zvkznPNE\nznkSgPEA1nDO7wawFsDtymoTASzSzUovqDUX+rSiLOlakdrS1e0wWJbvtcZ4hDMzxncPehv9VOK5\nWAGjknlsTC8wZD96EIwf+t8BPMkYy4CtT/0TbUzShjn3W9ftUGuoLq0doztTXJtAMSrOyp7jrvHr\nzYJfgs45X8c5v0H5nMU57805b8M5v4Nzrl1cSz85fdF119EaJGO4u4/1+iovFbtO+NEihgZ5I9iI\nDGLaP2EM5RXmdX6Q4u567nvX0W8tsGISArW+7T7JwTfxW6tkhyGIUGTtIX2cN4xACkHXqyE2pov1\nmsdqtXG1CH/+kppEYxp6YvbUaYQ2SCHoekVIa1ArSmVf2sY5sQpqiRlkJ+escTH65207Zti+QoWP\n7ukl2oSQQwpBP3vZuGnkx8+7JqmVCa2TCFQS7AQQM7JPZWajVgxo4/iC1DI1m1kY2amxaBNCDikE\n3Rk9I7CR/zThO/rdK85i9sE610TUhPWQUtAfura1bts287RgX9AzzWJvp370TRnm9ff1BT0jUtSt\nGXLBTXVFzzGCX56+TrdtG42Ugq7nYOblYrkHny6ruC3qxd0fbzVsXyIouGS9bhC9OHVBv67OFvXl\nyX0gpaA3rRN4El5vPDV/l27bDgV6v7Zat233b2OtgVG16IBEYGQVmDs1nFFIKeh6dhuUmDyJrEge\nHZoi2gRD0Wl82ZIcd4orf3uvREGWhDZyCjo9SJrROM410UeghGngz24mDmicys+exvHaXRcz4Hzv\n/G0EJbZQQ0pB15Ibu1k7st2Ch/uJNsG07HKK2tezRR3Ntn2NxSZqbc8+57BcL1bfQWGzOj9IKeha\n+lJPGthKs22ZkfgY6/mPa8Hvx865lA1KofDRgfLdjuMOyzXC9W3tmTUmuukFvULnMx+jQZAvM0PB\npALjkEp3i9Z5Wa2MXhPgKrlcYpy3l5aY/mk16YvUNERFWPuFFigfrHNN3Gy1bhIzs96kAbpML+h6\nD7MlNZDHR5UwDqvmn5WF2b9kiTYhIEwv6IU6T4ShGipBWA+zJrkwvaCv2n/a8H1uyTpj+D4Jwht7\nTSpChHaYXtCf+tb4mZvbjpw1fJ8E4Y0PKUCX5TG9oIvApC6qXrlaYnycmsKiUsP3aQSl5cbfJEv3\nnDR8n7IQGS6HFMpxFHaQ33TgdPjHT7rvY9pNnRyWpy7UJ32gaIyM0S87RmRjeuvObrrvwwhMLehq\ns7l2vThCgCVycm1b7SfCjOvuOPP2/FVrCF9cdIRoE0zL+sP6uxDK0ug2taAv23NKyH6PGZhaTCQR\nOsReYU6OppsyrDHAPLxDI823+bYktUpvlBswbbNVfTmSmJta0PMKxaSDW/h7rpD9Gs1d1zQXbYI0\n6DGz8daerhEHtx+Vb8DeCJ/wTk3jdN+HEZha0C9cNWZAbUwXa+Yu1DsAkpUwKgLorhz5XBd3OgU5\ni9IhHIVaJFA9k2rohakF3bn5rhdN4/VLmGE5rBVBtwrnpM56YYXQ0UYFzMvKN19SDVML+pVSYwLo\n1FCpERjRryea+rWiRJsgDSM6GtPKs4Ceo2tivCH7mbJwtyH70RJTC/rOY+e9r6QBCSrCJqr/Xi/U\nPIaa19W+ZaIWvfL4+asqa8qFFWrORjGqs345g+3JPWe++9LUgr7VoBmbvVrWNWQ/oUaEDpMt1MLx\nXjVpqFJ/CLdYtiZCDKYWdKPo1tw104xR/fdWQNaZt/YYFeTNCq0dveiu8pybDakEXU149UK2JvSv\nmeL8wS2g57rRrI5jt9juXLm8XIrLjAtHUSFBzUIqQTcSyfRcujEBkZSVGxcL/b0/9HBYNqob0iiK\nDYwr3zqhlmH70gupBN3IbsorAgJZyYoEFSMH3lvrmq1IL8Jkayo6YeS98dKNnbyvFOJIJehGvmGP\nnLls2L6MYNHOE8L2vTmzQNi+9eC4Cb0jQpXVB4zLdxBVw/xy6PUIGGPNGWNrGWMHGGP7GGOPKeX1\nGGMrGWPpyn/hriB6vs2fHtnOYXm3ZDPy1jnlUEzUwWXRHTmSCaBzpEU94rhUInkFHV9sOSraBFPh\nyyupDMBTnPMOAPoC+CtjrCOAqQBWc85TAKxWloWiZ4b6e/u1dFjeJFmt0hkjY1usOZhn2L6MYLXT\n8dzco6mbNQlvnDDQa0eGrj+vCsg5P8k5/135XAjgAIBmAMYBmKOsNgfAzXoZ6SsPDNJvSrDzxNCC\nS8W67SsUuLdfkmH7OlIgV/eVM3oKhezus84zsvu3NiaEglnxq0rLGEsC0APAVgCNOOcnAZvoA2io\ntXH+omsfutNDmZUvtwgNaNNAt23rEVwplKFJRYHjLOiPDkvRbV8ydF/5/GQxxmoBWAjgcc75RT9+\n9yBjLI0xlpafr3+ger2QwUc1VFj39BDRJhiKnp4obRqa39XOE86CXkPHVHHRKmEpzIZPZ4cxVgM2\nMf+Kc/6dUnyaMdZE+b4JANWOUM75bM55Kuc8NSFB+ww4RhETaf6LHSo0sVj0ytgo/e4d2e9L59ys\nanGViGp88XJhAD4BcIBz/rbdV4sBTFQ+TwSwSHvz3LPH4BlxMry9CTEM1LH7SnauOuUTbVG/piBL\nzIEvNfQBAO4BMJQxtlP5GwNgOoDrGWPpAK5Xlg3jxvc2Grk7gggYPbIVeSL9dKGh+5MZsyX79pq5\nlnO+Ee5nug/T1pzAeXBwsmgTCCIk2JJ1BimNaos2QwomzN6CFU8MFm2Gz0jjbtCiHjXFCPGcC4Ea\n3Y+7T4o2QRouFZsrtLM0gq6WoIHwjV05xiQKsQIvLt4n2gRskyxAl0jMFo5YGkG3Qko4vcgWEJcm\nuUGs4fs0gsW7xMXEIYLnHzd0FG1CUEgj6KTngSOicdNacv9pIzEyTIPsTOjdQrQJQSGNoJNbYeBc\nLCo1fJ/vTnCM471ZYIINs0MzUbXD7H790gh683rGT1bZceyc4fvUg5d/3O+w3L6x/h4Szi/gCf/d\novs+RdDAgIkwsg4flRqYKEQWpBH0cAGBGETGENeSMqf+qoUP9xdkiXxc31G/0LmV9Ghh/lyYarzw\nw17RJpgOaQS9bzJFYdOK2Civ0xMIHzFifsTzY809kOeOeb/lOCz3o2fcK9IIepgB/Yj39U9yWCbP\nGsIbrQzw5tEzD0Ao0btVPdEmhDzWuBM04qWbHHMOLt9LEzgIwihkz5+qBSToQWC2WWQEYWZIz71D\ngh4ERaU0Cm9mvk3LwU97T4k2g/ARs3hnXiwqxYGTPqeM0BQS9BCmpKwCZ+xS3Z2/UoKF23MFWiQX\nTy/YjYe+3C7aDMJHInRMbmHP7Ht6OSz7G1bkjx9vxegZG7Q0yWekEPSHh7QWbYIuPDF/J3q9uqpq\nefLcHXjq213Iyr8k0CrzUlpegZIya7eqjp65jDtm/YpCZTLZvhMXHPy9P96Q5VeugffXZiBp6lJc\nLSn3vnKQODsl6IXzHAl/k5jvNjhXgz1SCHpECLfFMvIKcc0/VyGvsMjv3y5VoubN/iUTAJB30VZb\nLwlywsX+Exex/WjoTYrKOXtF1+33emUl2j6/HD95GMzefvQsFu08rqsdRpGWfRa/O01+e3vlYfyW\nfQ6rD+Qhu+Ayxs7ciH8uPVD1/atLD6jmGrhaUo62z7meu89+zQaAqheEnhg1G9y5Pn7ZgJeVVkgh\n6CJnyn28IQtJU5eiooJj1vpMJE1dWhWgaf+Jixg9YwPyC4vx877TLr+9cLUUY2duQEae5xr3a8sO\nupQVl5XjwtXAHqIxMzfgtg9/xUshEBnQHncvvcKiUr+j3lVUcGQqLZkT56+irLwCF4tsg9gPffl7\n1XfO3PbhZjw2byfOXCr2uzYfahE/b5+1Gbd+8KvqdxwcZ5RQv7ty3UfbLC2vwEfrM5FVcAkl5RX4\n98+HPe7zlg82Yehb69x+f7m4rOqFWVZegSkLdmHRzuN46+dDXo7GODo0cZwpvWq/67MbqphO0Msr\nOH7NKHAoE5nA+VWldrPucB6mL7cJ739W2m76MTM3VOVE5AD+8tV23Phude1n7cE87DtxEc9+v8dl\nu2rxVbhSd5i79Rju/u9WdJv2c1C2f/ZrNorLxNU+4qIdJzDZX8bColK8vzYDFRUcY2ZuwIDpa1x+\nf+O7G3Hf/7a5lG9ML0Dys8sw7K312JhegP7T1+DlJY7hDYa9td5h+a6PNuO1ZdU11V6vrsJfvnLt\nXy9TWkdXSsrw+rIDKCotx5US24siv7DYYd2J/VqqHbYu3NEr0e13S3ZXz2hWa8t6enzm/JqN15cf\nxKcbs6vKyit41XmwZ+bqdOw4dh5Z+bbonReulqKotBxl5RUor+AoLa/Ac9/vwWPzdmJP7gXsO3ER\n89Ny8di8nXh3TQZKyirww47jwl+MNSMd70v71uzh04V4c8VBXCouQ9LUpbjlg014ffkB500Iw3RT\nAl9duh//25TtUBYK9aL7P0ur+pxVcBlJU5c6rsA5lu2xeVS0fW45tj47rKpGqha/ushDM2/O5qN+\n2fZb9lnM25aDN27r4jKw9NH6LIflTyam+rXtYGgSH4OLRdXp0vYev4D7P/sNa/42BG/9fAhzt+Wg\ndUIscs7aaudbss6ga2J81QO357itr/JIwWU0rxtTdWx//GRr1TYrvQ0+93LOth45i61O12HVgTzc\n8sEm/HVIGwzv2Agr95/Gnz9Pw/LHBmHutmP4fPNRnLpYhEU7T2DWH3ti/WHHioYRk90q+UOfFvjW\nbsC8zbPLqj4/8vUODGjdAHVjI6vKOK92A/T0/Fwutt2Hl4ptFYwwBvR5bRWKSyuwZ9rIqvV25V7A\n2yura+8/7zuFB7+wvRDjY2pjN8qdAAATH0lEQVQgqX5N7Mq9gN5JtslBc387hq+3HnPY13trMzBz\ndToOny7ElFHt/Th6fTl+/irOXCpG/VpRuOujzTh3pRTDOthCOuw4dh47jp1H64RauDO1uWBLTVRD\nLy2vQM7ZK/hRJd60yBq6r9jHSykpr0CPV1Y6lL3x00F8m5aDnq+sRI+Xf8bmLMfogz/sOI7Dp127\nCbYfPYf8wuKqWk1ZeYXDABXnHHfM2oyFv+eizXPL8dVWR2GbtT7TYblronFxQRrHRzssv/Tjflws\nKkPqq6tw8aqt1vu8XTyP8bO3YPSMDXh83g78cji/qvy6f6/DuPc34ZUl+/HemnSHbQbru7zj2Hk8\n8Hkazl0uwZ8/t720Z6xKr3pBpGXbam8Pffk75m5zFCgjJ8I45y11js/z94W7Hdar4MAdszY7rGM/\ndvDZpiOosNvGCqXLkIGh4FIJCovLHLoKK89NJfPTqqftX7hail3KQGFlK9NZzAHb+AUAfLAuU/fx\nFH8Z9K+1AIBzV9S7OacssJ1f+1baxvQCfLLxCJKmLsV8pzAGemGaGvoLP+x1ie1QRejrOaY5RTQE\ngAt2N8eH6xyF9bF5Ox2WH//GcbmS2z609ZG+fmsXTOjdAm2eWw4AePvObri1Z2JVq6CS5753DHh0\nReCAjye9q5y0VXDJMaXb0TNXcPTMFfzgFBht34mL2HfC1ff36BnvwjBzdbrXdXq8srLq80/7qs+p\np759I8fqve3q5/2nkTR1KTo0scVOL7hUXBW64tAp23mrfDkBtpdrvVpReGeVY5/5IbsE1MPfduy2\nsmfVAXXPkNMXi1XLAWBTRnUlplJAK5nQW2zt90pJOb7YnF217PwcAcB3v+fiyfm7qpbtW4pTFu7G\nndfofwymEXS3Yg5UNX/Mxke/ZHlfyUee+W4PXrHrJ35y/i5wDmx0Gm/wRk0D40G3bVQb6w7lq363\n/rB6ub98scV795R9V4GWONeaQ4HKLqgfdlTXxotKK/Dqkv0u58rbYH0gHAuxmrc/vLCo2olAbeKQ\nvZiLwhRdLse81LIoaI8N59r2U9/uwvc7/HPBMzLS4uPDUwzblwiMlHN/k1wcPFXosPzxxiMu6/jS\nciFCC1MI+uA313pfySBGdjJnayAUiY4wd3YYb4zu0sSwfXVsQmnoCJMIeihxY7emok2QBiO9QEQQ\nZWBYW9nPpbHtHfNCgu4nSfXlzFZPaE8IdqETPhCt04u4qFR/BwQSdD/p3CxetAmESWBUq9SMhNr6\n52atRK8gYP7Odg4EEnSCIEKeW3o0E22CKSBBJwidqBtbQ7QJ0mBEKj8ZIEEnCJ1oWDva+0pESPLt\nQ/1EmxAQJOgEQRBOmNUNlASdICShX3J90SYQgiFBJwhJ4GYIamQS9HA5vWxAUnkSdKKK1JZ1RZsg\nDSKyaBkZ3ZHwn1057hOJaEVQgs4YG8UYO8QYy2CMTdXKKH8Y3qGhiN1Kyad/uka0CdIwoE0Dw/f5\n1Ih2hu9TVvSYQxBfM9L7SkESsKAzxsIBvA9gNICOACYwxjpqZZivPDOmg9G7xNcP9DF8n0YQF228\nm123RDknaqU0rGX4PntRCyukqRfKgg6gN4AMznkW57wEwDwA47Qxy3daJxj/4PQXUPsizAX1fpib\niHDtL6ARiXiCEfRmAOyDlOcqZQThM50kDaUQirHQzcpwAfkOaugw/d+IIetgrFbNN+uyEmMPMsbS\nGGNp+fnaJC0g5KF2tGlyrPjFNUkUo18rXr25s2gTTEMwgp4LwD6nUiIAl4SfnPPZnPNUznlqQkJC\nELsjZGSsgTHDjaR/a/IJ1wrn3LNmxYg2WzCC/huAFMZYK8ZYJIDxABZrYxZhFYxMSm0kRmZ+IohK\nAr7rOOdljLFHAKwAEA7gU875Pi8/IwiCIHQiqGoE53wZgGUa2eI3Zo23QBAEoQemnin6jxsNd3sn\nCMIitG1kvEt0sJha0KNryJ1k2CpQrGsiFPnShBMITS3ohByQy7Z29E0md0mtMGM8exJ0AgAwuK04\nl9IRHRsL27ds0Lm0NqYWdBHxMmTlz4NaCdv37b0She1bNkTMqiRCB1MLusg+9IYGZiE3gkidMp0T\nxtKifk3RJhBuMKJr0dRPcbiAmNOViAiPqid6BCOyKrf2pJBGhCtGxKs3taCLhAbytESuTDtv39ld\ntAlECNKjhf6zoknQA6SNZP33BkT2JAhLQzV0BVkj8oUS9WL1D77vjvAwU9yGBBEU1IeusGnqUNEm\nuGD2Gu3eaSOrPi94qB+SBSQKqYQmFhHuePL6tqJNMBWmEHQRqdG8YWaXybaNaqFWVAQ2TLkO6/42\nBKmSx+7u3Mz/mD8LH+6HVU9ei+/+0l8Hi8QwZZT5co7WjBQ7G7xBLZs32+JHBgS9LT3ylDpjCkEH\ngNn39MKglGrPEtEZ6vskh268azXnnztTq329/31HNwBA83o1kRQiteNuzbUdMMqePrbq85LJgzC0\nfXUy8fgYxwrCtJs6Ieu1MU7Cz9CmYS30bOF4n8WYONzEsPba+6jbn1dnfg2gZX1z96YOy+N7t/B7\nG1rClaZ4k/gYrzlb353QwwiTPGIaQR/RqTG+mFQdW6F5PcH+tnZdLvf0benwVbKKSC58uD92vTjC\n4yYn9G6BtOeHY+uzw3BdO/WZmw8Pae3VtHaNq4XppRs7YtuzwzD91q5VZaEYA6eVG//p/7s2GUsf\nHVi1fOT1MQCAQSkNsPyxQdhn13XkjL1w39DVlkhjXPemLjX2if2TEBbG8NUDfVX98e9KbY4Z47vj\nlh7NMOf+3i7ff29Xi5//f/3c2iOado1r44tJrvZ74+4+jqLavnHtqs+z/tgLma+NwTgnIQaApnVi\nXMrUZiSP6lQ9u/Wdu7oje/rYqr9aguPKVz7mYcz2DNtXFADgp8cH4cDLo/DaLV1wQ9cmSLCbn7Lq\nycEO61IfugqVD891HmoGRsCVSx0XHYFXlBRZzZQbmAOYMb7adS2hdhR6tazrUjN0plfLumhQKwqN\n4qLxvz/1xh29EvHIdW2qZlKGhzE8PKS1Q/+3M3P/3Bcf3N2zavm+Aa3QMC4aYWEMjeKiEB7GVF84\nonnt1i749L5Uh7IfHxmIZ0Z3QKem8UhuEIvbeiaCMYbs6WPxxaQ+6NAkDrFRES6D5o8PTwEArHh8\nMOb+uS8A4NaeiTj4yijMGF9di7qnb0s8OrRN1XJ8TA10UumeeeP2rhjXvRneuas7ereqh7+Pao+H\nrrW9WGtHRaCHXS2+d6vQ6b7Knj4WE/s5VjYqk6r/sa/nmq/9/fvPW7o4fLfw4f5oUCsSix8ZgMiI\nMISHMYzxIfPU9ueH43O7F2ILpVI2dXR73NuvJSLCWMjlYq1b0/bM2s95udfunLZvHIeYyHD8oU8L\nMMYcXkBtGla/+ADyclGlR4u6OPDyKNzUzbVGYCQxSt/eA4OSAdgenvkP2WpnTetEY1z3Zvjl6esA\nAE8Mdx3YaRznPfDPm3d0w99GtqvqeXv9li6Ii67hcNNU1pbu7tMCSyYPRL/W9d3WarY+OxyZr41B\nRAjOCq0ZGYGh7Ruha2J10ugudp/X/G0I3rqzm+pvnR+Tx5Xz3Tg+Gv3sUsFVtkwqvWpGdW6MJ0c4\n9itPHdUezerEONRCnXl4SGtMHd0eq5+6FuueHuL12EQybZxjPs6mdWKw4vHB+McNnTz+zvn5sg/6\nFRsVgbTnr3fINjWyU2PseOF6l+3Y18hrO42FDWhjuza1oiPw8rjOyHhtjJejMZ4vJvXB9Fu7oE7N\nai+wl5VzWl/FM6zCjbfEb88NN2QipCn9AWMED5QAQFREuEvzq1mdGLz/h55VN2qL+jVd1mnfuDYO\nnirE6qeuRacXVwAAomuEoai0At3sBMyeR4el4OjZKxjZubppypjN06Z1w1o4eKoQfZPro3Mz2+9F\nN1ODYfEjA5E0dalfvxncNgFLdp/0ef03b++K2b9koa/KOEif5Po+e1W1FugZ5A9LJg90mAncTnlZ\n7Z02Ep2Ve3D3SyPwwdpMzFqfCQBgjOHWns1w8nwRANs53pJ11uN+6toJ3KSBtthAn9/fG8fPX8WW\nzDOIjLC9SGMjw3G5pBzTbuqMBwe3rhp4DEWa1olR7cf//P7eSFGJl163ZiSOnrniUp5gUKgQ8z75\nIcrYrp6bnksmD0RZBXfox74ztTmm3dTJbXOzeb2aLn2zO18YgdKKChwpuIylu0+ij11TPxReeMHw\n5PVtsTPnvM/rv3VnNzw9sh2ufXOdT+s3iovGCzdYJzlK5YveGfsXf1x0DTw+PAX/3ZCFd+6ydbcE\nOuM1sW6Mw/ltVicGt9kFYNv87DAUl1YgMiLMtC6r7qKTPj+2A26ftbmqhff1A31w/mqpYXaRoBtM\nRHgYIhS9fXdCD0yeuwPXJNXzu+8wXunba1AryqUVYHYeHZbi1/pREeFoWT8W9w9ohQ5N3HeV6Enr\nhFhcb8LQtU3io3Hygq0WHl0jHJluuj36tPLNq+vQq6O89hXHRdcAzBdq3CcqK1OV3Sv9DY75RIIu\nkBu7NUVyQiw6NVWvQQXLLT2sFSRKZErC1U8NEbZvZx65ro2q66oaSyYPrBJ0T1QOYDao5XlGcVSE\nuVuHwdKhcRwmDWyFif2ShOyfBF0weol55mtjfH6oCbn420jfJxDVrxWF+iHch202wsKY0O48EnRJ\nERlamJCPyl4Uq9fAQx0SdIIgvNKgVhSeHtkOY33wNyfEQYJOEIRP/PW6Nt5XIoQSejNMCIIgiIAg\nQScIgpAEEnSCIAhJIEEnCIKQBBJ0giAISSBBJwiCkAQSdIIgCEkgQScIgpAExg1MX88YywdwNMCf\nNwBQoKE5ZoCO2RrQMctPsMfbknOuHrPXDkMFPRgYY2mc81Tva8oDHbM1oGOWH6OOl7pcCIIgJIEE\nnSAIQhLMJOizRRsgADpma0DHLD+GHK9p+tAJgiAIz5iphk4QBEF4wBSCzhgbxRg7xBjLYIxNFW2P\nPzDGmjPG1jLGDjDG9jHGHlPK6zHGVjLG0pX/dZVyxhibqRzrbsZYT7ttTVTWT2eMTbQr78UY26P8\nZibzN+O0TjDGwhljOxhjS5TlVoyxrYr93zDGIpXyKGU5Q/k+yW4bzyjlhxhjI+3KQ+6eYIzVYYwt\nYIwdVK53P9mvM2PsCeW+3ssYm8sYi5btOjPGPmWM5THG9tqV6X5d3e3DI5zzkP4DEA4gE0AygEgA\nuwB0FG2XH/Y3AdBT+VwbwGEAHQH8C8BUpXwqgDeUz2MALAfAAPQFsFUprwcgS/lfV/lcV/luG4B+\nym+WAxgt+rgVu54E8DWAJcryfADjlc+zADysfP4LgFnK5/EAvlE+d1SudxSAVsp9EB6q9wSAOQAe\nUD5HAqgj83UG0AzAEQAxdtf3PtmuM4DBAHoC2GtXpvt1dbcPj7aKfgh8OJn9AKywW34GwDOi7Qri\neBYBuB7AIQBNlLImAA4pnz8CMMFu/UPK9xMAfGRX/pFS1gTAQbtyh/UEHmcigNUAhgJYotysBQAi\nnK8rgBUA+imfI5T1mPO1rlwvFO8JAHGKuDGncmmvM2yCnqOIVIRynUfKeJ0BJMFR0HW/ru724enP\nDF0ulTdNJblKmelQmpg9AGwF0IhzfhIAlP8NldXcHa+n8lyVctH8B8AUABXKcn0A5znnZcqyvZ1V\nx6Z8f0FZ399zIZJkAPkA/qd0M33MGIuFxNeZc34cwL8BHANwErbrth1yX+dKjLiu7vbhFjMIulo/\noelccxhjtQAsBPA45/yip1VVyngA5cJgjN0AII9zvt2+WGVV7uU70xwzbDXOngA+5Jz3AHAZtmay\nO0x/zEqf7jjYukmaAogFMFplVZmuszeEHqMZBD0XQHO75UQAJwTZEhCMsRqwiflXnPPvlOLTjLEm\nyvdNAOQp5e6O11N5okq5SAYAuIkxlg1gHmzdLv8BUIcxVpmY3N7OqmNTvo8HcBb+nwuR5ALI5Zxv\nVZYXwCbwMl/n4QCOcM7zOeelAL4D0B9yX+dKjLiu7vbhFjMI+m8AUpSR80jYBlMWC7bJZ5QR608A\nHOCcv2331WIAlSPdE2HrW68sv1cZLe8L4ILS3FoBYARjrK5SMxoBW//iSQCFjLG+yr7utduWEDjn\nz3DOEznnSbBdrzWc87sBrAVwu7Ka8zFXnovblfW5Uj5e8Y5oBSAFtgGkkLsnOOenAOQwxtopRcMA\n7IfE1xm2rpa+jLGaik2VxyztdbbDiOvqbh/uETmo4seAxBjYvEMyATwn2h4/bR8IWxNqN4Cdyt8Y\n2PoOVwNIV/7XU9ZnAN5XjnUPgFS7bd0PIEP5+5NdeSqAvcpv3oPTwJzg4x+Cai+XZNge1AwA3wKI\nUsqjleUM5ftku98/pxzXIdh5dYTiPQGgO4A05Vr/AJs3g9TXGcA0AAcVu76AzVNFqusMYC5sYwSl\nsNWoJxlxXd3tw9MfzRQlCIKQBDN0uRAEQRA+QIJOEAQhCSToBEEQkkCCThAEIQkk6ARBEJJAgk4Q\nBCEJJOgEQRCSQIJOEAQhCf8PpK5CeDRWBjAAAAAASUVORK5CYII=\n",
+      "text/plain": [
+       "<matplotlib.figure.Figure at 0x7fe8b9b27fd0>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "%matplotlib inline\n",
     "from matplotlib import pyplot as plt\n",
@@ -1274,7 +1649,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 73,
    "metadata": {
     "collapsed": true
    },
@@ -1310,7 +1685,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 74,
    "metadata": {
     "collapsed": true
    },
@@ -1319,31 +1694,35 @@
     "# As you will recall, `numpy.exp` works on the CPU, but, cannot be used in GPU implmentations.\n",
     "# This import will work for the CPU-only boilerplate code provided below, but\n",
     "# you will need to modify this import before your GPU implementation will work.\n",
-    "from numpy import exp"
+    "# from numpy import exp\n",
+    "from math import exp"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 75,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "# Modify these 3 function calls to run on the GPU.\n",
+    "@vectorize(['float32(float32)'], target = 'cuda')\n",
     "def normalize(grayscales):\n",
     "    return grayscales / 255\n",
     "\n",
+    "@vectorize(['float32(float32, float32)'], target = 'cuda')\n",
     "def weigh(values, weights):\n",
     "    return values * weights\n",
     "        \n",
+    "@vectorize(['float32(float32)'], target = 'cuda')\n",
     "def activate(values):\n",
     "    return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 84,
    "metadata": {
     "collapsed": true
    },
@@ -1352,15 +1731,19 @@
     "# Modify the body of this function to optimize data transfers and therefore speed up performance.\n",
     "# As a constraint, even after you move work to the GPU, make this function return a host array.\n",
     "def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):\n",
+    "    greyscales_dev = cuda.to_device(greyscales)\n",
+    "    weights_dev = cuda.to_device(weights)\n",
+    "    \n",
+    "    norm_dev = cuda.device_array(shape=(n,), dtype=np.float32)\n",
+    "    weighted_dev = cuda.device_array(shape=(n,), dtype=np.float32)\n",
+    "    activated_dev = cuda.device_array(shape=(n,), dtype=np.float32)\n",
     "    \n",
-    "    normalized = normalize(greyscales)\n",
-    "    weighted = weigh(normalized, weights)\n",
-    "    activated = activate(weighted)\n",
+    "    normalize(greyscales, out=norm_dev)\n",
+    "    weigh(norm_dev, weights_dev, out=weighted_dev)\n",
+    "    activate(weighted_dev, out=activated_dev)\n",
     "    \n",
-    "    # The assessment mechanism will expect `activated` to be a host array, so,\n",
-    "    # even after you refactor this code to run on the GPU, make sure to explicitly copy\n",
-    "    # `activated` back to the host.\n",
-    "    return activated"
+    "    out_host = activated_dev.copy_to_host()\n",
+    "    return out_host"
    ]
   },
   {
@@ -1379,7 +1762,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 85,
    "metadata": {
     "collapsed": true
    },
@@ -1398,11 +1781,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 86,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[0.3338969  0.11722048 0.08675494 ... 0.1048661  0.18585935 0.05007837]\n"
+     ]
+    }
+   ],
    "source": [
     "# Use this cell (and feel free to create others) to self-assess your function\n",
     "a = create_hidden_layer(**arguments)\n",
@@ -1425,7 +1814,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 82,
    "metadata": {
     "collapsed": true
    },
@@ -1436,12 +1825,27 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 87,
    "metadata": {
-    "collapsed": true,
     "scrolled": false
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Setting n to 100 million.\n",
+      "\n",
+      "Your function returns a host np.ndarray: True\n",
+      "\n",
+      "Your function took 0.56s to run.\n",
+      "Your function runs fast enough (less than 1 second): True\n",
+      "\n",
+      "Your function returns the correct results: True\n",
+      "Congratulations, you passed! See the instructions below for how to get credit for your work to count toward a certificate in the course.\n"
+     ]
+    }
+   ],
    "source": [
     "assess(create_hidden_layer, arguments)"
    ]
@@ -1493,11 +1897,29 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 88,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "./\r\n",
+      "./.ipynb_checkpoints/\r\n",
+      "./.ipynb_checkpoints/Introduction to CUDA Python with Numba-checkpoint.ipynb\r\n",
+      "./Introduction to CUDA Python with Numba.ipynb\r\n",
+      "./solutions/\r\n",
+      "./solutions/make_pulses_solution.py\r\n",
+      "./solutions/zero_suppress_solution.py\r\n",
+      "./solutions/monte_carlo_pi_solution.py\r\n",
+      "./images/\r\n",
+      "./images/DLI Header.png\r\n",
+      "./images/run_the_assessment.png\r\n",
+      "./images/numba_flowchart.png\r\n",
+      "tar: .: file changed as we read it\r\n"
+     ]
+    }
+   ],
    "source": [
     "!tar -zcvf section1.tar.gz ."
    ]
@@ -1528,7 +1950,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 89,
    "metadata": {
     "collapsed": true
    },
@@ -1563,11 +1985,26 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 90,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[-0.53340748 -0.84585842]\n",
+      " [-0.6538219   0.75664848]\n",
+      " [-0.33922312  0.94070594]\n",
+      " [ 0.66932384 -0.7429708 ]\n",
+      " [-0.80223363  0.59701022]\n",
+      " [-0.88033165  0.47435871]\n",
+      " [ 0.65265486 -0.75765535]\n",
+      " [ 0.83383129 -0.55201937]\n",
+      " [ 0.64985792 -0.76005571]\n",
+      " [-0.83615082 -0.54849959]]\n"
+     ]
+    }
+   ],
    "source": [
     "angles = np.random.uniform(-np.pi, np.pi, 10)\n",
     "coords = np.stack([np.cos(angles), np.sin(angles)], axis=1)\n",
@@ -1583,11 +2020,22 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 91,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([1.        , 1.        , 1.        , 1.        , 1.        ,\n",
+       "       0.99999994, 1.        , 1.        , 1.        , 1.        ],\n",
+       "      dtype=float32)"
+      ]
+     },
+     "execution_count": 91,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "l2_norm(coords)"
    ]
diff --git a/nvidia1/bumashka.png b/nvidia1/bumashka.png
new file mode 100644
index 0000000..09053ec
--- /dev/null
+++ b/nvidia1/bumashka.png
author	leshe4ka46 <alex9102naid1@ya.ru>	2025-09-21 17:29:45 +0300
committer	leshe4ka46 <alex9102naid1@ya.ru>	2025-09-21 17:29:45 +0300
commit	1aa689c205e8b113445f6c1de78bacad1ebb1080 (patch)
tree	f2dd55bd14cb12c03886a1df70df274b3e39b662
parent	1d989271b7dc40565d89d9ff6acc9bc9d9eb4fd1 (diff)