nvidia2

author: leshe4ka46 <alex9102naid1@ya.ru> 2025-10-18 12:25:53 +0300
committer: leshe4ka46 <alex9102naid1@ya.ru> 2025-10-18 12:25:53 +0300
commit: 910a222fa60ce6ea0831f2956470b8a0b9f62670 (patch)
tree: 1d6bbccafb667731ad127f93390761100fc11b53 /Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb
parent: 35b9040e4104b0e79bf243a2c9769c589f96e2c4 (diff)
1 files changed, 1007 insertions, 0 deletions
diff --git a/Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb b/Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb
new file mode 100644
index 0000000..9b857e2
--- /dev/null
+++ b/Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb
@@ -0,0 +1,1007 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://www.nvidia.com/dli\"><img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Week 1: Find Clusters of Infected People"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<span style=\"color:red\">\n",
+    "**URGENT WARNING**\n",
+    "\n",
+    "We have been receiving reports from health facilities that a new, fast-spreading virus has been discovered in the population. To prepare our response, we need to understand the geospatial distribution of those who have been infected. Find out whether there are identifiable clusters of infected individuals and where they are.    \n",
+    "</span>\n",
+    "\n",
+    "Your goal for this notebook will be to estimate the location of dense geographic clusters of infected people using incoming data from week 1 of the simulated epidemic."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext cudf.pandas\n",
+    "import pandas as pd\n",
+    "import cuml\n",
+    "\n",
+    "import cupy as cp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Begin by loading the data you've received about week 1 of the outbreak into a cuDF-accelerated pandas DataFrame. The data is located at `'./data/week1.csv'`. For this notebook you will only need the `'lat'`, `'long'`, and `'infected'` columns. Either drop the columns after loading, or use the `pd.read_csv` named argument `usecols` to provide a list of only the columns you need."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>lat</th>\n",
+       "      <th>long</th>\n",
+       "      <th>infected</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>54.522510</td>\n",
+       "      <td>-1.571896</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>54.554030</td>\n",
+       "      <td>-1.524968</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>54.552486</td>\n",
+       "      <td>-1.435203</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>54.537189</td>\n",
+       "      <td>-1.566215</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>54.528212</td>\n",
+       "      <td>-1.588462</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>58479889</th>\n",
+       "      <td>51.634416</td>\n",
+       "      <td>-2.925863</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>58479890</th>\n",
+       "      <td>51.556972</td>\n",
+       "      <td>-3.036290</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>58479891</th>\n",
+       "      <td>51.588992</td>\n",
+       "      <td>-2.921915</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>58479892</th>\n",
+       "      <td>51.590974</td>\n",
+       "      <td>-2.954539</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>58479893</th>\n",
+       "      <td>51.576716</td>\n",
+       "      <td>-2.952142</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>58479894 rows × 3 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                lat      long  infected\n",
+       "0         54.522510 -1.571896     False\n",
+       "1         54.554030 -1.524968     False\n",
+       "2         54.552486 -1.435203     False\n",
+       "3         54.537189 -1.566215     False\n",
+       "4         54.528212 -1.588462     False\n",
+       "...             ...       ...       ...\n",
+       "58479889  51.634416 -2.925863     False\n",
+       "58479890  51.556972 -3.036290     False\n",
+       "58479891  51.588992 -2.921915     False\n",
+       "58479892  51.590974 -2.954539     False\n",
+       "58479893  51.576716 -2.952142     False\n",
+       "\n",
+       "[58479894 rows x 3 columns]"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df = pd.read_csv('./data/week1.csv', usecols = [\"lat\", \"long\", \"infected\"])\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Make Data Frame of the Infected"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Make a new DataFrame `infected_df` that contains only the infected members of the population.\n",
+    "\n",
+    "**Tip**: Reset the index of `infected_df` with `.reset_index(drop=True)`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "infected_df = df[df[\"infected\"] == True]\n",
+    "infected_df = infected_df.reset_index(drop=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Make Grid Coordinates for Infected Locations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Provided for you in the next cell (which you can expand by clicking on the \"...\" and contract again after executing by clicking on the blue left border of the cell) is the lat/long to OSGB36 grid coordinates converter you used earlier in the workshop. Use this converter to create grid coordinate values stored in `northing` and `easting` columns of the `infected_df` you created in the last step."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# https://www.ordnancesurvey.co.uk/docs/support/guide-coordinate-systems-great-britain.pdf\n",
+    "\n",
+    "def latlong2osgbgrid_cupy(lat, long, input_degrees=True):\n",
+    "    '''\n",
+    "    Converts latitude and longitude (ellipsoidal) coordinates into northing and easting (grid) coordinates, using a Transverse Mercator projection.\n",
+    "    \n",
+    "    Inputs:\n",
+    "    lat: latitude coordinate (N)\n",
+    "    long: longitude coordinate (E)\n",
+    "    input_degrees: if True (default), interprets the coordinates as degrees; otherwise, interprets coordinates as radians\n",
+    "    \n",
+    "    Output:\n",
+    "    (northing, easting)\n",
+    "    '''\n",
+    "    \n",
+    "    if input_degrees:\n",
+    "        lat = lat * cp.pi/180\n",
+    "        long = long * cp.pi/180\n",
+    "\n",
+    "    a = 6377563.396\n",
+    "    b = 6356256.909\n",
+    "    e2 = (a**2 - b**2) / a**2\n",
+    "\n",
+    "    N0 = -100000 # northing of true origin\n",
+    "    E0 = 400000 # easting of true origin\n",
+    "    F0 = .9996012717 # scale factor on central meridian\n",
+    "    phi0 = 49 * cp.pi / 180 # latitude of true origin\n",
+    "    lambda0 = -2 * cp.pi / 180 # longitude of true origin and central meridian\n",
+    "    \n",
+    "    sinlat = cp.sin(lat)\n",
+    "    coslat = cp.cos(lat)\n",
+    "    tanlat = cp.tan(lat)\n",
+    "    \n",
+    "    latdiff = lat-phi0\n",
+    "    longdiff = long-lambda0\n",
+    "\n",
+    "    n = (a-b) / (a+b)\n",
+    "    nu = a * F0 * (1 - e2 * sinlat ** 2) ** -.5\n",
+    "    rho = a * F0 * (1 - e2) * (1 - e2 * sinlat ** 2) ** -1.5\n",
+    "    eta2 = nu / rho - 1\n",
+    "    M = b * F0 * ((1 + n + 5/4 * (n**2 + n**3)) * latdiff - \n",
+    "                  (3*(n+n**2) + 21/8 * n**3) * cp.sin(latdiff) * cp.cos(lat+phi0) +\n",
+    "                  15/8 * (n**2 + n**3) * cp.sin(2*(latdiff)) * cp.cos(2*(lat+phi0)) - \n",
+    "                  35/24 * n**3 * cp.sin(3*(latdiff)) * cp.cos(3*(lat+phi0)))\n",
+    "    I = M + N0\n",
+    "    II = nu/2 * sinlat * coslat\n",
+    "    III = nu/24 * sinlat * coslat ** 3 * (5 - tanlat ** 2 + 9 * eta2)\n",
+    "    IIIA = nu/720 * sinlat * coslat ** 5 * (61-58 * tanlat**2 + tanlat**4)\n",
+    "    IV = nu * coslat\n",
+    "    V = nu / 6 * coslat**3 * (nu/rho - cp.tan(lat)**2)\n",
+    "    VI = nu / 120 * coslat ** 5 * (5 - 18 * tanlat**2 + tanlat**4 + 14 * eta2 - 58 * tanlat**2 * eta2)\n",
+    "\n",
+    "    northing = I + II * longdiff**2 + III * longdiff**4 + IIIA * longdiff**6\n",
+    "    easting = E0 + IV * longdiff + V * longdiff**3 + VI * longdiff**5\n",
+    "\n",
+    "    return(northing, easting)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>lat</th>\n",
+       "      <th>long</th>\n",
+       "      <th>infected</th>\n",
+       "      <th>northing</th>\n",
+       "      <th>easting</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>54.472766</td>\n",
+       "      <td>-1.654932</td>\n",
+       "      <td>True</td>\n",
+       "      <td>508670.060234</td>\n",
+       "      <td>422359.759523</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>54.529717</td>\n",
+       "      <td>-1.667143</td>\n",
+       "      <td>True</td>\n",
+       "      <td>515002.666798</td>\n",
+       "      <td>421538.547038</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>54.512986</td>\n",
+       "      <td>-1.589866</td>\n",
+       "      <td>True</td>\n",
+       "      <td>513167.535850</td>\n",
+       "      <td>426549.874086</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>54.522322</td>\n",
+       "      <td>-1.380694</td>\n",
+       "      <td>True</td>\n",
+       "      <td>514305.280055</td>\n",
+       "      <td>440081.234798</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>54.541660</td>\n",
+       "      <td>-1.613490</td>\n",
+       "      <td>True</td>\n",
+       "      <td>516349.132042</td>\n",
+       "      <td>425003.005560</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18143</th>\n",
+       "      <td>52.428347</td>\n",
+       "      <td>-3.322932</td>\n",
+       "      <td>True</td>\n",
+       "      <td>282016.338253</td>\n",
+       "      <td>310060.098268</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18144</th>\n",
+       "      <td>52.415895</td>\n",
+       "      <td>-3.263942</td>\n",
+       "      <td>True</td>\n",
+       "      <td>280559.681381</td>\n",
+       "      <td>314046.146547</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18145</th>\n",
+       "      <td>52.539934</td>\n",
+       "      <td>-3.617128</td>\n",
+       "      <td>True</td>\n",
+       "      <td>294832.815870</td>\n",
+       "      <td>290338.202721</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18146</th>\n",
+       "      <td>52.435490</td>\n",
+       "      <td>-3.597263</td>\n",
+       "      <td>True</td>\n",
+       "      <td>283187.465568</td>\n",
+       "      <td>291428.293249</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18147</th>\n",
+       "      <td>52.700105</td>\n",
+       "      <td>-3.375221</td>\n",
+       "      <td>True</td>\n",
+       "      <td>312306.356272</td>\n",
+       "      <td>307081.485707</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>18148 rows × 5 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "             lat      long  infected       northing        easting\n",
+       "0      54.472766 -1.654932      True  508670.060234  422359.759523\n",
+       "1      54.529717 -1.667143      True  515002.666798  421538.547038\n",
+       "2      54.512986 -1.589866      True  513167.535850  426549.874086\n",
+       "3      54.522322 -1.380694      True  514305.280055  440081.234798\n",
+       "4      54.541660 -1.613490      True  516349.132042  425003.005560\n",
+       "...          ...       ...       ...            ...            ...\n",
+       "18143  52.428347 -3.322932      True  282016.338253  310060.098268\n",
+       "18144  52.415895 -3.263942      True  280559.681381  314046.146547\n",
+       "18145  52.539934 -3.617128      True  294832.815870  290338.202721\n",
+       "18146  52.435490 -3.597263      True  283187.465568  291428.293249\n",
+       "18147  52.700105 -3.375221      True  312306.356272  307081.485707\n",
+       "\n",
+       "[18148 rows x 5 columns]"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cupy_lat = cp.asarray(infected_df['lat'])\n",
+    "cupy_long = cp.asarray(infected_df['long'])\n",
+    "\n",
+    "infected_df['northing'], infected_df['easting'] = latlong2osgbgrid_cupy(cupy_lat, cupy_long)\n",
+    "infected_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Find Clusters of Infected People"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use DBSCAN to find clusters of at least 25 infected people where no member is more than 2000m from at least one other cluster member. Create a new column in `infected_df` which contains the cluster to which each infected person belongs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>lat</th>\n",
+       "      <th>long</th>\n",
+       "      <th>infected</th>\n",
+       "      <th>northing</th>\n",
+       "      <th>easting</th>\n",
+       "      <th>cluster</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>54.472766</td>\n",
+       "      <td>-1.654932</td>\n",
+       "      <td>True</td>\n",
+       "      <td>508670.060234</td>\n",
+       "      <td>422359.759523</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>54.529717</td>\n",
+       "      <td>-1.667143</td>\n",
+       "      <td>True</td>\n",
+       "      <td>515002.666798</td>\n",
+       "      <td>421538.547038</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>54.512986</td>\n",
+       "      <td>-1.589866</td>\n",
+       "      <td>True</td>\n",
+       "      <td>513167.535850</td>\n",
+       "      <td>426549.874086</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>54.522322</td>\n",
+       "      <td>-1.380694</td>\n",
+       "      <td>True</td>\n",
+       "      <td>514305.280055</td>\n",
+       "      <td>440081.234798</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>54.541660</td>\n",
+       "      <td>-1.613490</td>\n",
+       "      <td>True</td>\n",
+       "      <td>516349.132042</td>\n",
+       "      <td>425003.005560</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18143</th>\n",
+       "      <td>52.428347</td>\n",
+       "      <td>-3.322932</td>\n",
+       "      <td>True</td>\n",
+       "      <td>282016.338253</td>\n",
+       "      <td>310060.098268</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18144</th>\n",
+       "      <td>52.415895</td>\n",
+       "      <td>-3.263942</td>\n",
+       "      <td>True</td>\n",
+       "      <td>280559.681381</td>\n",
+       "      <td>314046.146547</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18145</th>\n",
+       "      <td>52.539934</td>\n",
+       "      <td>-3.617128</td>\n",
+       "      <td>True</td>\n",
+       "      <td>294832.815870</td>\n",
+       "      <td>290338.202721</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18146</th>\n",
+       "      <td>52.435490</td>\n",
+       "      <td>-3.597263</td>\n",
+       "      <td>True</td>\n",
+       "      <td>283187.465568</td>\n",
+       "      <td>291428.293249</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18147</th>\n",
+       "      <td>52.700105</td>\n",
+       "      <td>-3.375221</td>\n",
+       "      <td>True</td>\n",
+       "      <td>312306.356272</td>\n",
+       "      <td>307081.485707</td>\n",
+       "      <td>-1</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>18148 rows × 6 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "             lat      long  infected       northing        easting  cluster\n",
+       "0      54.472766 -1.654932      True  508670.060234  422359.759523       -1\n",
+       "1      54.529717 -1.667143      True  515002.666798  421538.547038       -1\n",
+       "2      54.512986 -1.589866      True  513167.535850  426549.874086       -1\n",
+       "3      54.522322 -1.380694      True  514305.280055  440081.234798       -1\n",
+       "4      54.541660 -1.613490      True  516349.132042  425003.005560       -1\n",
+       "...          ...       ...       ...            ...            ...      ...\n",
+       "18143  52.428347 -3.322932      True  282016.338253  310060.098268       -1\n",
+       "18144  52.415895 -3.263942      True  280559.681381  314046.146547       -1\n",
+       "18145  52.539934 -3.617128      True  294832.815870  290338.202721       -1\n",
+       "18146  52.435490 -3.597263      True  283187.465568  291428.293249       -1\n",
+       "18147  52.700105 -3.375221      True  312306.356272  307081.485707       -1\n",
+       "\n",
+       "[18148 rows x 6 columns]"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dbscan = cuml.DBSCAN(eps = 2000, min_samples=25, metric='euclidean')\n",
+    "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing','easting']])\n",
+    "infected_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "15"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "infected_df['cluster'].nunique()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Find the Centroid of Each Cluster"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use grouping to find the mean `northing` and `easting` values for each cluster identified above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>northing</th>\n",
+       "      <th>easting</th>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>cluster</th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>-1</th>\n",
+       "      <td>378085.504251</td>\n",
+       "      <td>401877.070477</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>397661.052147</td>\n",
+       "      <td>371410.022807</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>436475.467158</td>\n",
+       "      <td>332980.455514</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>347062.237166</td>\n",
+       "      <td>389386.821165</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>359668.638420</td>\n",
+       "      <td>379638.020073</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>391630.079963</td>\n",
+       "      <td>431158.142881</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>386471.292123</td>\n",
+       "      <td>426559.091880</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>434970.334950</td>\n",
+       "      <td>406985.282976</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>412772.647531</td>\n",
+       "      <td>410069.665645</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>415807.314112</td>\n",
+       "      <td>414765.634582</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>417322.517251</td>\n",
+       "      <td>409583.740733</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>334208.230907</td>\n",
+       "      <td>435937.780795</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>300567.933051</td>\n",
+       "      <td>391901.512758</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>291539.411185</td>\n",
+       "      <td>401640.667572</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>289854.874937</td>\n",
+       "      <td>394518.294994</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "              northing        easting\n",
+       "cluster                              \n",
+       "-1       378085.504251  401877.070477\n",
+       " 0       397661.052147  371410.022807\n",
+       " 1       436475.467158  332980.455514\n",
+       " 2       347062.237166  389386.821165\n",
+       " 3       359668.638420  379638.020073\n",
+       " 4       391630.079963  431158.142881\n",
+       " 5       386471.292123  426559.091880\n",
+       " 6       434970.334950  406985.282976\n",
+       " 7       412772.647531  410069.665645\n",
+       " 8       415807.314112  414765.634582\n",
+       " 9       417322.517251  409583.740733\n",
+       " 10      334208.230907  435937.780795\n",
+       " 11      300567.933051  391901.512758\n",
+       " 12      291539.411185  401640.667572\n",
+       " 13      289854.874937  394518.294994"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "centroids_df = infected_df[['northing', 'easting', 'cluster']].groupby('cluster').mean()\n",
+    "centroids_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Find the number of people in each cluster by counting the number of appearances of each cluster's label in the column produced by DBSCAN."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "cluster\n",
+       " 0     8638\n",
+       "-1     8449\n",
+       " 2      403\n",
+       " 8       94\n",
+       " 12      72\n",
+       " 13      71\n",
+       " 1       68\n",
+       " 11      68\n",
+       " 4       66\n",
+       " 10      64\n",
+       " 5       43\n",
+       " 7       39\n",
+       " 6       27\n",
+       " 3       25\n",
+       " 9       21\n",
+       "Name: count, dtype: int64"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "infected_df['cluster'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "northing    397661.052147\n",
+       "easting     371410.022807\n",
+       "Name: 0, dtype: float64"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "idx = infected_df['cluster'].value_counts().idxmax()\n",
+    "maxxx = centroids_df.loc[idx]\n",
+    "maxxx.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Find the Centroid of the Cluster with the Most Members ##"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use the cluster label for with the most people to filter `centroid_df` and write the answer to `my_assessment/question_1.json`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/opt/conda/lib/python3.10/site-packages/cudf/io/json.py:194: UserWarning: Using CPU via Pandas to write JSON dataset\n",
+      "  warnings.warn(\"Using CPU via Pandas to write JSON dataset\")\n"
+     ]
+    }
+   ],
+   "source": [
+    "centroids_df.loc[infected_df['cluster'].value_counts().idxmax()].to_json('my_assessment/question_1.json')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Check Submission ##"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{\"northing\":397661.0521473523,\"easting\":371410.0228066591}"
+     ]
+    }
+   ],
+   "source": [
+    "!cat my_assessment/question_1.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Tip**: Your submission file should contain one line of text, similar to: \n",
+    "\n",
+    "```\n",
+    "{'northing':XXX.XX,'easting':XXX.XX}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div align=\"center\"><h2>Please Restart the Kernel</h2></div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'status': 'ok', 'restart': True}"
+      ]
+     },
+     "execution_count": 31,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import IPython\n",
+    "app = IPython.Application.instance()\n",
+    "app.kernel.do_shutdown(True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://www.nvidia.com/dli\"><img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/></a>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
author	leshe4ka46 <alex9102naid1@ya.ru>	2025-10-18 12:25:53 +0300
committer	leshe4ka46 <alex9102naid1@ya.ru>	2025-10-18 12:25:53 +0300
commit	910a222fa60ce6ea0831f2956470b8a0b9f62670 (patch)
tree	1d6bbccafb667731ad127f93390761100fc11b53 /Fundamentals_of_Accelerated_Data_Science/4-02_find_infected.ipynb
parent	35b9040e4104b0e79bf243a2c9769c589f96e2c4 (diff)