aboutsummaryrefslogtreecommitdiff
path: root/Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb
diff options
context:
space:
mode:
authorleshe4ka46 <alex9102naid1@ya.ru>2025-10-18 12:25:53 +0300
committerleshe4ka46 <alex9102naid1@ya.ru>2025-10-18 12:25:53 +0300
commit910a222fa60ce6ea0831f2956470b8a0b9f62670 (patch)
tree1d6bbccafb667731ad127f93390761100fc11b53 /Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb
parent35b9040e4104b0e79bf243a2c9769c589f96e2c4 (diff)
nvidia2
Diffstat (limited to 'Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb')
-rw-r--r--Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb1007
1 files changed, 1007 insertions, 0 deletions
diff --git a/Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb b/Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb
new file mode 100644
index 0000000..9b857e2
--- /dev/null
+++ b/Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb
@@ -0,0 +1,1007 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<a href=\"https://www.nvidia.com/dli\"><img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/></a>"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Week 1: Find Clusters of Infected People"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<span style=\"color:red\">\n",
+ "**URGENT WARNING**\n",
+ "\n",
+ "We have been receiving reports from health facilities that a new, fast-spreading virus has been discovered in the population. To prepare our response, we need to understand the geospatial distribution of those who have been infected. Find out whether there are identifiable clusters of infected individuals and where they are. \n",
+ "</span>\n",
+ "\n",
+ "Your goal for this notebook will be to estimate the location of dense geographic clusters of infected people using incoming data from week 1 of the simulated epidemic."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%load_ext cudf.pandas\n",
+ "import pandas as pd\n",
+ "import cuml\n",
+ "\n",
+ "import cupy as cp"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Load Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Begin by loading the data you've received about week 1 of the outbreak into a cuDF-accelerated pandas DataFrame. The data is located at `'./data/week1.csv'`. For this notebook you will only need the `'lat'`, `'long'`, and `'infected'` columns. Either drop the columns after loading, or use the `pd.read_csv` named argument `usecols` to provide a list of only the columns you need."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>lat</th>\n",
+ " <th>long</th>\n",
+ " <th>infected</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>54.522510</td>\n",
+ " <td>-1.571896</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>54.554030</td>\n",
+ " <td>-1.524968</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>54.552486</td>\n",
+ " <td>-1.435203</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>54.537189</td>\n",
+ " <td>-1.566215</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>54.528212</td>\n",
+ " <td>-1.588462</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>58479889</th>\n",
+ " <td>51.634416</td>\n",
+ " <td>-2.925863</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>58479890</th>\n",
+ " <td>51.556972</td>\n",
+ " <td>-3.036290</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>58479891</th>\n",
+ " <td>51.588992</td>\n",
+ " <td>-2.921915</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>58479892</th>\n",
+ " <td>51.590974</td>\n",
+ " <td>-2.954539</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>58479893</th>\n",
+ " <td>51.576716</td>\n",
+ " <td>-2.952142</td>\n",
+ " <td>False</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>58479894 rows × 3 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " lat long infected\n",
+ "0 54.522510 -1.571896 False\n",
+ "1 54.554030 -1.524968 False\n",
+ "2 54.552486 -1.435203 False\n",
+ "3 54.537189 -1.566215 False\n",
+ "4 54.528212 -1.588462 False\n",
+ "... ... ... ...\n",
+ "58479889 51.634416 -2.925863 False\n",
+ "58479890 51.556972 -3.036290 False\n",
+ "58479891 51.588992 -2.921915 False\n",
+ "58479892 51.590974 -2.954539 False\n",
+ "58479893 51.576716 -2.952142 False\n",
+ "\n",
+ "[58479894 rows x 3 columns]"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('./data/week1.csv', usecols = [\"lat\", \"long\", \"infected\"])\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Make Data Frame of the Infected"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Make a new DataFrame `infected_df` that contains only the infected members of the population.\n",
+ "\n",
+ "**Tip**: Reset the index of `infected_df` with `.reset_index(drop=True)`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "infected_df = df[df[\"infected\"] == True]\n",
+ "infected_df = infected_df.reset_index(drop=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Make Grid Coordinates for Infected Locations"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Provided for you in the next cell (which you can expand by clicking on the \"...\" and contract again after executing by clicking on the blue left border of the cell) is the lat/long to OSGB36 grid coordinates converter you used earlier in the workshop. Use this converter to create grid coordinate values stored in `northing` and `easting` columns of the `infected_df` you created in the last step."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# https://www.ordnancesurvey.co.uk/docs/support/guide-coordinate-systems-great-britain.pdf\n",
+ "\n",
+ "def latlong2osgbgrid_cupy(lat, long, input_degrees=True):\n",
+ " '''\n",
+ " Converts latitude and longitude (ellipsoidal) coordinates into northing and easting (grid) coordinates, using a Transverse Mercator projection.\n",
+ " \n",
+ " Inputs:\n",
+ " lat: latitude coordinate (N)\n",
+ " long: longitude coordinate (E)\n",
+ " input_degrees: if True (default), interprets the coordinates as degrees; otherwise, interprets coordinates as radians\n",
+ " \n",
+ " Output:\n",
+ " (northing, easting)\n",
+ " '''\n",
+ " \n",
+ " if input_degrees:\n",
+ " lat = lat * cp.pi/180\n",
+ " long = long * cp.pi/180\n",
+ "\n",
+ " a = 6377563.396\n",
+ " b = 6356256.909\n",
+ " e2 = (a**2 - b**2) / a**2\n",
+ "\n",
+ " N0 = -100000 # northing of true origin\n",
+ " E0 = 400000 # easting of true origin\n",
+ " F0 = .9996012717 # scale factor on central meridian\n",
+ " phi0 = 49 * cp.pi / 180 # latitude of true origin\n",
+ " lambda0 = -2 * cp.pi / 180 # longitude of true origin and central meridian\n",
+ " \n",
+ " sinlat = cp.sin(lat)\n",
+ " coslat = cp.cos(lat)\n",
+ " tanlat = cp.tan(lat)\n",
+ " \n",
+ " latdiff = lat-phi0\n",
+ " longdiff = long-lambda0\n",
+ "\n",
+ " n = (a-b) / (a+b)\n",
+ " nu = a * F0 * (1 - e2 * sinlat ** 2) ** -.5\n",
+ " rho = a * F0 * (1 - e2) * (1 - e2 * sinlat ** 2) ** -1.5\n",
+ " eta2 = nu / rho - 1\n",
+ " M = b * F0 * ((1 + n + 5/4 * (n**2 + n**3)) * latdiff - \n",
+ " (3*(n+n**2) + 21/8 * n**3) * cp.sin(latdiff) * cp.cos(lat+phi0) +\n",
+ " 15/8 * (n**2 + n**3) * cp.sin(2*(latdiff)) * cp.cos(2*(lat+phi0)) - \n",
+ " 35/24 * n**3 * cp.sin(3*(latdiff)) * cp.cos(3*(lat+phi0)))\n",
+ " I = M + N0\n",
+ " II = nu/2 * sinlat * coslat\n",
+ " III = nu/24 * sinlat * coslat ** 3 * (5 - tanlat ** 2 + 9 * eta2)\n",
+ " IIIA = nu/720 * sinlat * coslat ** 5 * (61-58 * tanlat**2 + tanlat**4)\n",
+ " IV = nu * coslat\n",
+ " V = nu / 6 * coslat**3 * (nu/rho - cp.tan(lat)**2)\n",
+ " VI = nu / 120 * coslat ** 5 * (5 - 18 * tanlat**2 + tanlat**4 + 14 * eta2 - 58 * tanlat**2 * eta2)\n",
+ "\n",
+ " northing = I + II * longdiff**2 + III * longdiff**4 + IIIA * longdiff**6\n",
+ " easting = E0 + IV * longdiff + V * longdiff**3 + VI * longdiff**5\n",
+ "\n",
+ " return(northing, easting)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>lat</th>\n",
+ " <th>long</th>\n",
+ " <th>infected</th>\n",
+ " <th>northing</th>\n",
+ " <th>easting</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>54.472766</td>\n",
+ " <td>-1.654932</td>\n",
+ " <td>True</td>\n",
+ " <td>508670.060234</td>\n",
+ " <td>422359.759523</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>54.529717</td>\n",
+ " <td>-1.667143</td>\n",
+ " <td>True</td>\n",
+ " <td>515002.666798</td>\n",
+ " <td>421538.547038</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>54.512986</td>\n",
+ " <td>-1.589866</td>\n",
+ " <td>True</td>\n",
+ " <td>513167.535850</td>\n",
+ " <td>426549.874086</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>54.522322</td>\n",
+ " <td>-1.380694</td>\n",
+ " <td>True</td>\n",
+ " <td>514305.280055</td>\n",
+ " <td>440081.234798</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>54.541660</td>\n",
+ " <td>-1.613490</td>\n",
+ " <td>True</td>\n",
+ " <td>516349.132042</td>\n",
+ " <td>425003.005560</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18143</th>\n",
+ " <td>52.428347</td>\n",
+ " <td>-3.322932</td>\n",
+ " <td>True</td>\n",
+ " <td>282016.338253</td>\n",
+ " <td>310060.098268</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18144</th>\n",
+ " <td>52.415895</td>\n",
+ " <td>-3.263942</td>\n",
+ " <td>True</td>\n",
+ " <td>280559.681381</td>\n",
+ " <td>314046.146547</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18145</th>\n",
+ " <td>52.539934</td>\n",
+ " <td>-3.617128</td>\n",
+ " <td>True</td>\n",
+ " <td>294832.815870</td>\n",
+ " <td>290338.202721</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18146</th>\n",
+ " <td>52.435490</td>\n",
+ " <td>-3.597263</td>\n",
+ " <td>True</td>\n",
+ " <td>283187.465568</td>\n",
+ " <td>291428.293249</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18147</th>\n",
+ " <td>52.700105</td>\n",
+ " <td>-3.375221</td>\n",
+ " <td>True</td>\n",
+ " <td>312306.356272</td>\n",
+ " <td>307081.485707</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>18148 rows × 5 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " lat long infected northing easting\n",
+ "0 54.472766 -1.654932 True 508670.060234 422359.759523\n",
+ "1 54.529717 -1.667143 True 515002.666798 421538.547038\n",
+ "2 54.512986 -1.589866 True 513167.535850 426549.874086\n",
+ "3 54.522322 -1.380694 True 514305.280055 440081.234798\n",
+ "4 54.541660 -1.613490 True 516349.132042 425003.005560\n",
+ "... ... ... ... ... ...\n",
+ "18143 52.428347 -3.322932 True 282016.338253 310060.098268\n",
+ "18144 52.415895 -3.263942 True 280559.681381 314046.146547\n",
+ "18145 52.539934 -3.617128 True 294832.815870 290338.202721\n",
+ "18146 52.435490 -3.597263 True 283187.465568 291428.293249\n",
+ "18147 52.700105 -3.375221 True 312306.356272 307081.485707\n",
+ "\n",
+ "[18148 rows x 5 columns]"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cupy_lat = cp.asarray(infected_df['lat'])\n",
+ "cupy_long = cp.asarray(infected_df['long'])\n",
+ "\n",
+ "infected_df['northing'], infected_df['easting'] = latlong2osgbgrid_cupy(cupy_lat, cupy_long)\n",
+ "infected_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Find Clusters of Infected People"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Use DBSCAN to find clusters of at least 25 infected people where no member is more than 2000m from at least one other cluster member. Create a new column in `infected_df` which contains the cluster to which each infected person belongs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>lat</th>\n",
+ " <th>long</th>\n",
+ " <th>infected</th>\n",
+ " <th>northing</th>\n",
+ " <th>easting</th>\n",
+ " <th>cluster</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>54.472766</td>\n",
+ " <td>-1.654932</td>\n",
+ " <td>True</td>\n",
+ " <td>508670.060234</td>\n",
+ " <td>422359.759523</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>54.529717</td>\n",
+ " <td>-1.667143</td>\n",
+ " <td>True</td>\n",
+ " <td>515002.666798</td>\n",
+ " <td>421538.547038</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>54.512986</td>\n",
+ " <td>-1.589866</td>\n",
+ " <td>True</td>\n",
+ " <td>513167.535850</td>\n",
+ " <td>426549.874086</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>54.522322</td>\n",
+ " <td>-1.380694</td>\n",
+ " <td>True</td>\n",
+ " <td>514305.280055</td>\n",
+ " <td>440081.234798</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>54.541660</td>\n",
+ " <td>-1.613490</td>\n",
+ " <td>True</td>\n",
+ " <td>516349.132042</td>\n",
+ " <td>425003.005560</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18143</th>\n",
+ " <td>52.428347</td>\n",
+ " <td>-3.322932</td>\n",
+ " <td>True</td>\n",
+ " <td>282016.338253</td>\n",
+ " <td>310060.098268</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18144</th>\n",
+ " <td>52.415895</td>\n",
+ " <td>-3.263942</td>\n",
+ " <td>True</td>\n",
+ " <td>280559.681381</td>\n",
+ " <td>314046.146547</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18145</th>\n",
+ " <td>52.539934</td>\n",
+ " <td>-3.617128</td>\n",
+ " <td>True</td>\n",
+ " <td>294832.815870</td>\n",
+ " <td>290338.202721</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18146</th>\n",
+ " <td>52.435490</td>\n",
+ " <td>-3.597263</td>\n",
+ " <td>True</td>\n",
+ " <td>283187.465568</td>\n",
+ " <td>291428.293249</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18147</th>\n",
+ " <td>52.700105</td>\n",
+ " <td>-3.375221</td>\n",
+ " <td>True</td>\n",
+ " <td>312306.356272</td>\n",
+ " <td>307081.485707</td>\n",
+ " <td>-1</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>18148 rows × 6 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " lat long infected northing easting cluster\n",
+ "0 54.472766 -1.654932 True 508670.060234 422359.759523 -1\n",
+ "1 54.529717 -1.667143 True 515002.666798 421538.547038 -1\n",
+ "2 54.512986 -1.589866 True 513167.535850 426549.874086 -1\n",
+ "3 54.522322 -1.380694 True 514305.280055 440081.234798 -1\n",
+ "4 54.541660 -1.613490 True 516349.132042 425003.005560 -1\n",
+ "... ... ... ... ... ... ...\n",
+ "18143 52.428347 -3.322932 True 282016.338253 310060.098268 -1\n",
+ "18144 52.415895 -3.263942 True 280559.681381 314046.146547 -1\n",
+ "18145 52.539934 -3.617128 True 294832.815870 290338.202721 -1\n",
+ "18146 52.435490 -3.597263 True 283187.465568 291428.293249 -1\n",
+ "18147 52.700105 -3.375221 True 312306.356272 307081.485707 -1\n",
+ "\n",
+ "[18148 rows x 6 columns]"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dbscan = cuml.DBSCAN(eps = 2000, min_samples=25, metric='euclidean')\n",
+ "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing','easting']])\n",
+ "infected_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "15"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "infected_df['cluster'].nunique()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Find the Centroid of Each Cluster"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Use grouping to find the mean `northing` and `easting` values for each cluster identified above."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>northing</th>\n",
+ " <th>easting</th>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>cluster</th>\n",
+ " <th></th>\n",
+ " <th></th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>-1</th>\n",
+ " <td>378085.504251</td>\n",
+ " <td>401877.070477</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>397661.052147</td>\n",
+ " <td>371410.022807</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>436475.467158</td>\n",
+ " <td>332980.455514</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>347062.237166</td>\n",
+ " <td>389386.821165</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>359668.638420</td>\n",
+ " <td>379638.020073</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>391630.079963</td>\n",
+ " <td>431158.142881</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>386471.292123</td>\n",
+ " <td>426559.091880</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>434970.334950</td>\n",
+ " <td>406985.282976</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>412772.647531</td>\n",
+ " <td>410069.665645</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>8</th>\n",
+ " <td>415807.314112</td>\n",
+ " <td>414765.634582</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>9</th>\n",
+ " <td>417322.517251</td>\n",
+ " <td>409583.740733</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>10</th>\n",
+ " <td>334208.230907</td>\n",
+ " <td>435937.780795</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>11</th>\n",
+ " <td>300567.933051</td>\n",
+ " <td>391901.512758</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>12</th>\n",
+ " <td>291539.411185</td>\n",
+ " <td>401640.667572</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>13</th>\n",
+ " <td>289854.874937</td>\n",
+ " <td>394518.294994</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " northing easting\n",
+ "cluster \n",
+ "-1 378085.504251 401877.070477\n",
+ " 0 397661.052147 371410.022807\n",
+ " 1 436475.467158 332980.455514\n",
+ " 2 347062.237166 389386.821165\n",
+ " 3 359668.638420 379638.020073\n",
+ " 4 391630.079963 431158.142881\n",
+ " 5 386471.292123 426559.091880\n",
+ " 6 434970.334950 406985.282976\n",
+ " 7 412772.647531 410069.665645\n",
+ " 8 415807.314112 414765.634582\n",
+ " 9 417322.517251 409583.740733\n",
+ " 10 334208.230907 435937.780795\n",
+ " 11 300567.933051 391901.512758\n",
+ " 12 291539.411185 401640.667572\n",
+ " 13 289854.874937 394518.294994"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "centroids_df = infected_df[['northing', 'easting', 'cluster']].groupby('cluster').mean()\n",
+ "centroids_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Find the number of people in each cluster by counting the number of appearances of each cluster's label in the column produced by DBSCAN."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "cluster\n",
+ " 0 8638\n",
+ "-1 8449\n",
+ " 2 403\n",
+ " 8 94\n",
+ " 12 72\n",
+ " 13 71\n",
+ " 1 68\n",
+ " 11 68\n",
+ " 4 66\n",
+ " 10 64\n",
+ " 5 43\n",
+ " 7 39\n",
+ " 6 27\n",
+ " 3 25\n",
+ " 9 21\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "infected_df['cluster'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "northing 397661.052147\n",
+ "easting 371410.022807\n",
+ "Name: 0, dtype: float64"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "idx = infected_df['cluster'].value_counts().idxmax()\n",
+ "maxxx = centroids_df.loc[idx]\n",
+ "maxxx.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Find the Centroid of the Cluster with the Most Members ##"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Use the cluster label for with the most people to filter `centroid_df` and write the answer to `my_assessment/question_1.json`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/opt/conda/lib/python3.10/site-packages/cudf/io/json.py:194: UserWarning: Using CPU via Pandas to write JSON dataset\n",
+ " warnings.warn(\"Using CPU via Pandas to write JSON dataset\")\n"
+ ]
+ }
+ ],
+ "source": [
+ "centroids_df.loc[infected_df['cluster'].value_counts().idxmax()].to_json('my_assessment/question_1.json')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Check Submission ##"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{\"northing\":397661.0521473523,\"easting\":371410.0228066591}"
+ ]
+ }
+ ],
+ "source": [
+ "!cat my_assessment/question_1.json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Tip**: Your submission file should contain one line of text, similar to: \n",
+ "\n",
+ "```\n",
+ "{'northing':XXX.XX,'easting':XXX.XX}\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<div align=\"center\"><h2>Please Restart the Kernel</h2></div>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'status': 'ok', 'restart': True}"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import IPython\n",
+ "app = IPython.Application.instance()\n",
+ "app.kernel.do_shutdown(True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<a href=\"https://www.nvidia.com/dli\"><img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/></a>"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.15"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}