Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fundamentals of Accelerated Data Science # "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 03 - DBSCAN ##\n",
    "\n",
    "**Table of Contents**\n",
    "<br>\n",
    "This notebook uses GPU-accelerated DBSCAN to identify clusters of infected people. This notebook covers the below sections: \n",
    "1. [Environment](#Environment)\n",
    "2. [Load Data](#Load-Data)\n",
    "3. [DBSCAN Clustering](#DBSCAN-Clustering)\n",
    "    * [Exercise #1 - Make Another DBSCAN Instance](#Exercise-#1---Make-Another-DBSCAN-Instance)\n",
    "4. [Visualize the Clusters](#Visualize-the-Clusters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment ##"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "import cuml\n",
    "\n",
    "import cuxfilter as cxf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data ##\n",
    "For this notebook, we again load a subset of our population data with only the columns we need. An `infected` column has been added to the data to indicate whether or not a person is known to be infected with our simulated virus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf = cudf.read_csv('./data/pop_sample.csv', dtype=['float32', 'float32', 'float32'])\n",
    "print(gdf.dtypes)\n",
    "gdf.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf['infected'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DBSCAN Clustering ##\n",
    "DBSCAN is another unsupervised clustering algorithm that is particularly effective when the number of clusters is not known up front and the clusters may have concave or other unusual shapes--a situation that often applies in geospatial analytics.\n",
    "\n",
    "In this series of exercises you will use DBSCAN to identify clusters of infected people by location, which may help us identify groups becoming infected from common patient zeroes and assist in response planning.\n",
    "\n",
    "Create a DBSCAN instance by using `cuml.DBSCAN`. Pass in the named argument `eps` (the maximum distance a point can be from the nearest point in a cluster to be considered possibly in that cluster) to be `5000`. Since the `northing` and `easting` values we created are measured in meters, this will allow us to identify clusters of infected people where individuals may be separated from the rest of the cluster by up to 5 kilometers.\n",
    "\n",
    "Below we train a DBSCAN algorithm. We start by creating a new dataframe from rows of the original dataframe where `infected` is `1` (true), and call it `infected_df`--be sure to reset the dataframe's index afterward. Use `dbscan.fit_predict` to perform clustering on the `northing` and `easting` columns of `infected_df`, and turn the resulting series into a new column in `infected_gdf` called \"cluster\". Finally, compute the number of clusters identified by DBSCAN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dbscan = cuml.DBSCAN(eps=5000)\n",
    "# dbscan = cuml.DBSCAN(eps=10000)\n",
    "\n",
    "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
    "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
    "infected_df['cluster'].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise #1 - Make Another DBSCAN Instance ###\n",
    "\n",
    "**Instructions**: <br>\n",
    "* Modify the `<FIXME>` only and execute the below cell to instantiate a DBSCAN instance with `10000` for `eps`.\n",
    "* Modify the `<FIXME>` only and execute the cell below to fit the data and identify infected clusters. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dbscan = cuml.DBSCAN(eps=10000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
    "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
    "infected_df['cluster'].nunique()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "source": [
    "\n",
    "dbscan = cuml.DBSCAN(eps=10000)\n",
    "\n",
    "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
    "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
    "infected_df['cluster'].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Click ... for solution. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize the Clusters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because we have the same column names as in the K-means example--`easting`, `northing`, and `cluster`--we can use the same code to visualize the clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "infected_df.to_pandas().plot(kind='scatter', x='easting', y='northing', c='cluster')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Well Done!** Let's move to the [next notebook](3-04_logistic_regression.ipynb). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}