<img src="./images/DLI_Header.png" width=400/>

# Fundamentals of Accelerated Data Science # 

## 05 - KNN ##

**Table of Contents**
<br>
This notebook uses GPU-accelerated k-nearest neighbors to identify the nearest road nodes to hospitals. This notebook covers the below sections: 
1. [Environment](#Environment)
2. [Load Data](#Load-Data)
    * [Road Nodes](#Road-Nodes)
    * [Hospitals](#Hospitals)
3. [K-Nearest Neighbors](#K-Nearest-Neighbors)
    * [Road Nodes Closest to Each Hospital](#Road-Nodes-Closest-to-Each-Hospital)
    * [Viewing a Specific Hospital](#Viewing-a-Specific-Hospital)

## Environment ##

In [None]:
import cudf
import cuml

## Load Data

### Road Nodes ###
We begin by reading our road nodes data.

In [None]:
# road_nodes = cudf.read_csv('./data/road_nodes_2-06.csv', dtype=['str', 'float32', 'float32', 'str'])
road_nodes = cudf.read_csv('./data/road_nodes.csv', dtype=['str', 'float32', 'float32', 'str'])

In [None]:
road_nodes.dtypes

In [None]:
road_nodes.shape

In [None]:
road_nodes.head()

### Hospitals ###
Next we load the hospital data.

In [None]:
hospitals = cudf.read_csv('./data/clean_hospitals_full.csv')

In [None]:
hospitals.dtypes

In [None]:
hospitals.shape

In [None]:
hospitals.head()

## K-Nearest Neighbors ##
We are going to use the [k-nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm to find the nearest *k* road nodes for every hospital. We will need to fit a KNN model with road data, and then give our trained model hospital locations so that it can return the nearest roads.

Create a k-nearest neighbors model `knn` by using the `cuml.NearestNeighbors` constructor, passing it the named argument `n_neighbors` set to 3.

Create a new dataframe `road_locs` using the `road_nodes` columns `east` and `north`. The order of the columns doesn't matter, except that we will need them to remain consistent over multiple operations, so please use the ordering `['east', 'north']`.

Fit the `knn` model with `road_locs` using the `knn.fit` method.

In [None]:

knn = cuml.NearestNeighbors(n_neighbors=3)


In [None]:

road_locs = road_nodes[['east', 'north']]
knn.fit(road_locs)


### Road Nodes Closest to Each Hospital ###
Use the `knn.kneighbors` method to find the 3 closest road nodes to each hospital. `knn.kneighbors` expects 2 arguments: `X`, for which you should use the `easting` and `northing` columns of `hospitals` (remember to retain the same column order as when you fit the `knn` model above), and `n_neighbors`, the number of neighbors to search for--in this case, 3. 

`knn.kneighbors` will return 2 cudf dataframes, which you should name `distances` and `indices` respectively.

In [None]:
distances, indices = knn.kneighbors(hospitals[['easting', 'northing']], 3) # order has to match the knn fit order (east, north)


### Viewing a Specific Hospital ###
We can now use `indices`, `hospitals`, and `road_nodes` to derive information specific to a given hospital. Here we will examine the hospital at index `10`. First we view the hospital's grid coordinates:

In [None]:
SELECTED_RESULT = 10
print('hospital coordinates:\n', hospitals.loc[SELECTED_RESULT, ['easting', 'northing']], sep='')

Now we view the road node IDs for the 3 closest road nodes:

In [None]:
nearest_road_nodes = indices.iloc[SELECTED_RESULT, 0:3]
print('node_id:\n', nearest_road_nodes, sep='')

And finally the grid coordinates for the 3 nearest road nodes, which we can confirm are located in order of increasing distance from the hospital:

In [None]:
print('road_node coordinates:\n', road_nodes.loc[nearest_road_nodes, ['east', 'north']], sep='')

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

**Well Done!** Let's move to the [next notebook](3-06_xgboost.ipynb). 

<img src="./images/DLI_Header.png" width=400/>