{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fundamentals of Accelerated Data Science # " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 08 - Multi-GPU K-Means with Dask ##\n", "\n", "**Table of Contents**\n", "
\n", "This notebook uses GPU-accelerated K-means to identify population clusters in a multi-node, multi-GPU scalable way with Dask. This notebook covers the below sections: \n", "1. [Environment](#Environment)\n", "2. [Load and Persist Data](#Load-and-Persist-Data)\n", "3. [Training the Model](#Training-the-Model)\n", " * [Exercise #1 - Count Members of the Southernmost Cluster](#Exercise-#1---Count-Members-of-the-Southernmost-Cluster)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment ##\n", "First we import the needed modules to create a Dask cuDF cluster. As we did before, we need to import CUDA context creators after setting up the cluster so they don't lock to a single device. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import logging\n", "\n", "from dask.distributed import Client, wait, progress\n", "from dask_cuda import LocalCUDACluster" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import cudf\n", "import dask_cudf\n", "\n", "import cuml\n", "from cuml.dask.cluster import KMeans" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# create cluster\n", "cmd = \"hostname --all-ip-addresses\"\n", "process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)\n", "output, error = process.communicate()\n", "IPADDR = str(output.decode()).split()[0]\n", "\n", "cluster = LocalCUDACluster(ip=IPADDR, silence_logs=logging.ERROR)\n", "client = Client(cluster)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and Persist Data ##\n", "We will begin by loading the data, The data set has the two grid coordinate columns, `easting` and `northing`, derived from the main population data set we have prepared." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "ddf = dask_cudf.read_csv('./data/uk_pop5x_coords.csv', dtype=['float32', 'float32'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training the Model ##\n", "Training the K-means model is very similar to both the scikit-learn version and the cuML single-GPU version--by setting up the client and importing from the `cuml.dask.cluster` module, the algorithm will automatically use the local Dask cluster we have set up.\n", "\n", "Note that calling `.fit` triggers Dask computation.\n", "\n", "Once we have the fit model, we extract the cluster centers and rename the columns from their generic `0` and `1` to reflect the data on which they were trained." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.17 s, sys: 2.95 s, total: 8.12 s\n", "Wall time: 1min 57s\n" ] }, { "data": { "text/html": [ "
KMeansMG()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "KMeansMG()" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "dkm = KMeans(n_clusters=20)\n", "dkm.fit(ddf)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "northing float32\n", "easting float32\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cluster_centers = dkm.cluster_centers_\n", "cluster_centers.columns = ddf.columns\n", "cluster_centers.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise #1 - Count Members of the Southernmost Cluster ###\n", "Using the `cluster_centers`, identify which cluster is the southernmost (has the lowest `northing` value) with the `nsmallest` method, then use `dkm.predict` to get labels for the data, and finally filter the labels to determine how many individuals the model estimated were in that cluster. \n", "\n", "**Instructions**:
\n", "* Modify the `` only and execute the below cell to estimate the number of individuals in the southernmost cluster. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "31435157" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "south_idx = cluster_centers.nsmallest(1, 'northing').index[0]\n", "labels_predicted = dkm.predict(ddf)\n", "labels_predicted[labels_predicted==south_idx].compute().shape[0]" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\n", "south_idx = cluster_centers.nsmallest(1, 'northing').index[0]\n", "labels_predicted = dkm.predict(ddf)\n", "labels_predicted[labels_predicted==south_idx].compute().shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click ... for solution. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Well Done!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 4 }