<img src="./images/DLI_Header.png" width=400/>

# Fundamentals of Accelerated Data Science # 

## 08 - Multi-GPU K-Means with Dask ##

**Table of Contents**
<br>
This notebook uses GPU-accelerated K-means to identify population clusters in a multi-node, multi-GPU scalable way with Dask. This notebook covers the below sections: 
1. [Environment](#Environment)
2. [Load and Persist Data](#Load-and-Persist-Data)
3. [Training the Model](#Training-the-Model)
    * [Exercise #1 - Count Members of the Southernmost Cluster](#Exercise-#1---Count-Members-of-the-Southernmost-Cluster)

## Environment ##
First we import the needed modules to create a Dask cuDF cluster. As we did before, we need to import CUDA context creators after setting up the cluster so they don't lock to a single device. 

In [1]:
import subprocess
import logging

from dask.distributed import Client, wait, progress
from dask_cuda import LocalCUDACluster

In [2]:
import cudf
import dask_cudf

import cuml
from cuml.dask.cluster import KMeans

In [3]:
# create cluster
cmd = "hostname --all-ip-addresses"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
IPADDR = str(output.decode()).split()[0]

cluster = LocalCUDACluster(ip=IPADDR, silence_logs=logging.ERROR)
client = Client(cluster)

## Load and Persist Data ##
We will begin by loading the data, The data set has the two grid coordinate columns, `easting` and `northing`, derived from the main population data set we have prepared.

In [4]:
ddf = dask_cudf.read_csv('./data/uk_pop5x_coords.csv', dtype=['float32', 'float32'])

## Training the Model ##
Training the K-means model is very similar to both the scikit-learn version and the cuML single-GPU version--by setting up the client and importing from the `cuml.dask.cluster` module, the algorithm will automatically use the local Dask cluster we have set up.

Note that calling `.fit` triggers Dask computation.

Once we have the fit model, we extract the cluster centers and rename the columns from their generic `0` and `1` to reflect the data on which they were trained.

In [5]:
%%time
dkm = KMeans(n_clusters=20)
dkm.fit(ddf)

CPU times: user 5.17 s, sys: 2.95 s, total: 8.12 s
Wall time: 1min 57s


In [6]:
cluster_centers = dkm.cluster_centers_
cluster_centers.columns = ddf.columns
cluster_centers.dtypes

northing    float32
easting     float32
dtype: object

### Exercise #1 - Count Members of the Southernmost Cluster ###
Using the `cluster_centers`, identify which cluster is the southernmost (has the lowest `northing` value) with the `nsmallest` method, then use `dkm.predict` to get labels for the data, and finally filter the labels to determine how many individuals the model estimated were in that cluster. 

**Instructions**: <br>
* Modify the `<FIXME>` only and execute the below cell to estimate the number of individuals in the southernmost cluster. 

In [7]:
south_idx = cluster_centers.nsmallest(1, 'northing').index[0]
labels_predicted = dkm.predict(ddf)
labels_predicted[labels_predicted==south_idx].compute().shape[0]

31435157

Click ... for solution. 

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

**Well Done!**

<img src="./images/DLI_Header.png" width=400/>