<img src="./images/DLI_Header.png" width=400/>

# Fundamentals of Accelerated Data Science # 

## 02 - K-Means ##

**Table of Contents**
<br>
This notebook uses GPU-accelerated K-means to find the best locations for a fixed number of humanitarian supply airdrop depots. This notebook covers the below sections: 
1. [Environment](#Environment)
2. [Load Data](#Load-Data)
3. [K-Means Clustering](#K-Means-Clustering)
    * [Exercise #1 - Make Another `KMeans` Instance](#Exercise-#1---Make-Another-KMeans-Instance)
4. [Visualize the Clusters](#Visualize-the-Clusters)

## Environment ##
For the first time we import `cuml`, the RAPIDS GPU-accelerated library containing many common machine learning algorithms. We will be visualizing the results of your work in this notebook, so we also import `cuxfilter`.

In [1]:
# DO NOT CHANGE THIS CELL
import cudf
import cuml
import cupy as cp
import cuxfilter as cxf

## Load Data ##
For this notebook we load again the cleaned UK population data--in this case, we are not specifically looking at counties, so we omit that column and just keep the grid coordinate columns.

In [2]:
# DO NOT CHANGE THIS CELL
gdf = cudf.read_csv('./data/clean_uk_pop.csv', usecols=['easting', 'northing'])
print(gdf.dtypes)
gdf.shape

northing    float64
easting     float64
dtype: object


(58479894, 2)

In [30]:
gdf.head()

Unnamed: 0,northing,easting,cluster
0,515491.5313,430772.1875,1
1,503572.4688,434685.875,1
2,517903.6563,432565.5313,1
3,517059.9063,427660.625,1
4,509228.6875,425527.7813,1


In [4]:
# DO NOT CHANGE THIS CELL
# instantaite
km = cuml.KMeans(n_clusters=5)

# fit
km.fit(gdf)

km.cluster_centers_

gdf['cluster'] = km.labels_.astype('uint8')

In [44]:
labels = sorted([int(i) for i in gdf['cluster'].unique().values])

import colorsys

def col(i, a):
    r, g, b = colorsys.hsv_to_rgb(i/a, 1, 1)
    return '#%02x%02x%02x' % (int(r*255), int(g*255), int(b*255))
    
color_key = {i: col(i, len(labels)) for i in labels}
color_key

{0: '#ff0000', 1: '#cbff00', 2: '#00ff66', 3: '#0066ff', 4: '#cc00ff'}

In [53]:
cxf_data = cxf.DataFrame.from_dataframe(gdf)

scatter_chart = cxf.charts.datashader.scatter(
    x='easting',
    y='northing',
    point_size=1,
    pixel_shade_type='linear',
    aggregate_col='cluster',
    color_column='cluster',
)

cluster_widget = cxf.charts.panel_widgets.multi_select('cluster')

TypeError: scatter() got an unexpected keyword argument 'color_column'

In [54]:
dash = cxf_data.dashboard(
    charts=[scatter_chart],
    sidebar=[cluster_widget],
    theme=cxf.themes.dark,
    data_size_widget=True
)

dash.app()

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)