<img src="./images/DLI_Header.png" width=400/>

# Fundamentals of Accelerated Data Science # 

## 05 - KNN ##

**Table of Contents**
<br>
This notebook uses GPU-accelerated k-nearest neighbors to identify the nearest road nodes to hospitals. This notebook covers the below sections: 
1. [Environment](#Environment)
2. [Load Data](#Load-Data)
    * [Road Nodes](#Road-Nodes)
    * [Hospitals](#Hospitals)
3. [K-Nearest Neighbors](#K-Nearest-Neighbors)
    * [Road Nodes Closest to Each Hospital](#Road-Nodes-Closest-to-Each-Hospital)
    * [Viewing a Specific Hospital](#Viewing-a-Specific-Hospital)

## Environment ##

In [1]:
import cudf
import cuml

## Load Data

### Road Nodes ###
We begin by reading our road nodes data.

In [2]:
# road_nodes = cudf.read_csv('./data/road_nodes_2-06.csv', dtype=['str', 'float32', 'float32', 'str'])
road_nodes = cudf.read_csv('./data/road_nodes.csv', dtype=['str', 'float32', 'float32', 'str'])

In [3]:
road_nodes.dtypes

node_id     object
east       float32
north      float32
type        object
dtype: object

In [4]:
road_nodes.shape

(3121148, 4)

In [5]:
road_nodes.head()

Unnamed: 0,node_id,east,north,type
0,id02FE73D4-E88D-4119-8DC2-6E80DE6F6594,320608.09375,870994.0,junction
1,id634D65C1-C38B-4868-9080-2E1E47F0935C,320628.5,871103.8125,road end
2,idDC14D4D1-774E-487D-8EDE-60B129E5482C,320635.46875,870983.875,junction
3,id51555819-1A39-4B41-B0C9-C6D2086D9921,320648.6875,871083.5625,junction
4,id9E362428-79D7-4EE3-B015-0CE3F6A78A69,320658.1875,871162.375,junction


### Hospitals ###
Next we load the hospital data.

In [6]:
hospitals = cudf.read_csv('./data/clean_hospitals_full.csv')

In [7]:
hospitals.dtypes

﻿OrganisationID         int64
OrganisationCode       object
OrganisationType       object
SubType                object
Sector                 object
OrganisationStatus     object
IsPimsManaged          object
OrganisationName       object
Address1               object
Address2               object
Address3               object
City                   object
County                 object
Postcode               object
Latitude              float64
Longitude             float64
ParentODSCode          object
ParentName             object
Phone                  object
Email                  object
Website                object
Fax                    object
northing              float64
easting               float64
dtype: object

In [8]:
hospitals.shape

(1226, 24)

In [9]:
hospitals.head()

Unnamed: 0,﻿OrganisationID,OrganisationCode,OrganisationType,SubType,Sector,OrganisationStatus,IsPimsManaged,OrganisationName,Address1,Address2,...,Latitude,Longitude,ParentODSCode,ParentName,Phone,Email,Website,Fax,northing,easting
0,17970,NDA07,Hospital,Hospital,Independent Sector,Visible,True,Walton Community Hospital - Virgin Care Servic...,,Rodney Road,...,51.379997,-0.406042,NDA,Virgin Care Services Ltd,01932 414205,,,01932 253674,165810.4688,510917.5313
1,17981,NDA18,Hospital,Hospital,Independent Sector,Visible,True,Woking Community Hospital (Virgin Care),,Heathside Road,...,51.315132,-0.556289,NDA,Virgin Care Services Ltd,01483 715911,,,,158381.3438,500604.8438
2,18102,NLT02,Hospital,Hospital,NHS Sector,Visible,True,North Somerset Community Hospital,North Somerset Community Hospital,Old Street,...,51.437195,-2.847193,NLT,North Somerset Community Partnership Community...,01275 872212,,http://www.nscphealth.co.uk,,171305.7813,341119.375
3,18138,NMP01,Hospital,Hospital,Independent Sector,Visible,False,Bridgewater Hospital,120 Princess Road,,...,53.459743,-2.245469,NMP,Bridgewater Hospital (Manchester) Ltd,0161 2270000,,www.bridgewaterhospital.com,,395944.5625,383703.5938
4,18142,NMV01,Hospital,Hospital,Independent Sector,Visible,True,Kneesworth House,Old North Road,Bassingbourn,...,52.078121,-0.030604,NMV,Partnerships In Care Ltd,01763 255 700,reception_kneesworthhouse@partnershipsincare.c...,www.partnershipsincare.co.uk,,244071.7031,534945.1875


## K-Nearest Neighbors ##
We are going to use the [k-nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm to find the nearest *k* road nodes for every hospital. We will need to fit a KNN model with road data, and then give our trained model hospital locations so that it can return the nearest roads.

Create a k-nearest neighbors model `knn` by using the `cuml.NearestNeighbors` constructor, passing it the named argument `n_neighbors` set to 3.

Create a new dataframe `road_locs` using the `road_nodes` columns `east` and `north`. The order of the columns doesn't matter, except that we will need them to remain consistent over multiple operations, so please use the ordering `['east', 'north']`.

Fit the `knn` model with `road_locs` using the `knn.fit` method.

In [10]:

knn = cuml.NearestNeighbors(n_neighbors=3)


In [11]:

road_locs = road_nodes[['east', 'north']]
knn.fit(road_locs)


### Road Nodes Closest to Each Hospital ###
Use the `knn.kneighbors` method to find the 3 closest road nodes to each hospital. `knn.kneighbors` expects 2 arguments: `X`, for which you should use the `easting` and `northing` columns of `hospitals` (remember to retain the same column order as when you fit the `knn` model above), and `n_neighbors`, the number of neighbors to search for--in this case, 3. 

`knn.kneighbors` will return 2 cudf dataframes, which you should name `distances` and `indices` respectively.

In [12]:
distances, indices = knn.kneighbors(hospitals[['easting', 'northing']], 3) # order has to match the knn fit order (east, north)


### Viewing a Specific Hospital ###
We can now use `indices`, `hospitals`, and `road_nodes` to derive information specific to a given hospital. Here we will examine the hospital at index `10`. First we view the hospital's grid coordinates:

In [13]:
SELECTED_RESULT = 10
print('hospital coordinates:\n', hospitals.loc[SELECTED_RESULT, ['easting', 'northing']], sep='')

hospital coordinates:
easting     260713.17190
northing     56303.21875
Name: 10, dtype: float64


Now we view the road node IDs for the 3 closest road nodes:

In [14]:
nearest_road_nodes = indices.iloc[SELECTED_RESULT, 0:3]
print('node_id:\n', nearest_road_nodes, sep='')

node_id:
0    118559
1    118560
2    118678
Name: 10, dtype: int64


And finally the grid coordinates for the 3 nearest road nodes, which we can confirm are located in order of increasing distance from the hospital:

In [15]:
print('road_node coordinates:\n', road_nodes.loc[nearest_road_nodes, ['east', 'north']], sep='')

road_node coordinates:
                 east         north
118559  260697.859375  56322.710938
118560  260722.812500  56207.925781
118678  260540.000000  56105.000000


In [16]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

**Well Done!** Let's move to the [next notebook](3-06_xgboost.ipynb). 

<img src="./images/DLI_Header.png" width=400/>