nvidia2

author: leshe4ka46 <alex9102naid1@ya.ru> 2025-10-18 12:25:53 +0300
committer: leshe4ka46 <alex9102naid1@ya.ru> 2025-10-18 12:25:53 +0300
commit: 910a222fa60ce6ea0831f2956470b8a0b9f62670 (patch)
tree: 1d6bbccafb667731ad127f93390761100fc11b53 /Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb
parent: 35b9040e4104b0e79bf243a2c9769c589f96e2c4 (diff)
1 files changed, 241 insertions, 0 deletions
diff --git a/Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb b/Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb
new file mode 100644
index 0000000..e27f31e
--- /dev/null
+++ b/Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb
@@ -0,0 +1,241 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src=\"./images/DLI_Header.png\" width=400/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Fundamentals of Accelerated Data Science # "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 03 - DBSCAN ##\n",
+    "\n",
+    "**Table of Contents**\n",
+    "<br>\n",
+    "This notebook uses GPU-accelerated DBSCAN to identify clusters of infected people. This notebook covers the below sections: \n",
+    "1. [Environment](#Environment)\n",
+    "2. [Load Data](#Load-Data)\n",
+    "3. [DBSCAN Clustering](#DBSCAN-Clustering)\n",
+    "    * [Exercise #1 - Make Another DBSCAN Instance](#Exercise-#1---Make-Another-DBSCAN-Instance)\n",
+    "4. [Visualize the Clusters](#Visualize-the-Clusters)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Environment ##"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cudf\n",
+    "import cuml\n",
+    "\n",
+    "import cuxfilter as cxf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load Data ##\n",
+    "For this notebook, we again load a subset of our population data with only the columns we need. An `infected` column has been added to the data to indicate whether or not a person is known to be infected with our simulated virus."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gdf = cudf.read_csv('./data/pop_sample.csv', dtype=['float32', 'float32', 'float32'])\n",
+    "print(gdf.dtypes)\n",
+    "gdf.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gdf.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gdf['infected'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## DBSCAN Clustering ##\n",
+    "DBSCAN is another unsupervised clustering algorithm that is particularly effective when the number of clusters is not known up front and the clusters may have concave or other unusual shapes--a situation that often applies in geospatial analytics.\n",
+    "\n",
+    "In this series of exercises you will use DBSCAN to identify clusters of infected people by location, which may help us identify groups becoming infected from common patient zeroes and assist in response planning.\n",
+    "\n",
+    "Create a DBSCAN instance by using `cuml.DBSCAN`. Pass in the named argument `eps` (the maximum distance a point can be from the nearest point in a cluster to be considered possibly in that cluster) to be `5000`. Since the `northing` and `easting` values we created are measured in meters, this will allow us to identify clusters of infected people where individuals may be separated from the rest of the cluster by up to 5 kilometers.\n",
+    "\n",
+    "Below we train a DBSCAN algorithm. We start by creating a new dataframe from rows of the original dataframe where `infected` is `1` (true), and call it `infected_df`--be sure to reset the dataframe's index afterward. Use `dbscan.fit_predict` to perform clustering on the `northing` and `easting` columns of `infected_df`, and turn the resulting series into a new column in `infected_gdf` called \"cluster\". Finally, compute the number of clusters identified by DBSCAN."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dbscan = cuml.DBSCAN(eps=5000)\n",
+    "# dbscan = cuml.DBSCAN(eps=10000)\n",
+    "\n",
+    "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
+    "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
+    "infected_df['cluster'].nunique()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise #1 - Make Another DBSCAN Instance ###\n",
+    "\n",
+    "**Instructions**: <br>\n",
+    "* Modify the `<FIXME>` only and execute the below cell to instantiate a DBSCAN instance with `10000` for `eps`.\n",
+    "* Modify the `<FIXME>` only and execute the cell below to fit the data and identify infected clusters. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dbscan = cuml.DBSCAN(<<<<FIXME>>>>)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
+    "infected_df['cluster'] = dbscan.<<<<FIXME>>>>(infected_df[['northing', 'easting']])\n",
+    "infected_df['cluster'].nunique()"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "metadata": {
+    "jupyter": {
+     "source_hidden": true
+    }
+   },
+   "source": [
+    "\n",
+    "dbscan = cuml.DBSCAN(eps=10000)\n",
+    "\n",
+    "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
+    "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
+    "infected_df['cluster'].nunique()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Click ... for solution. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Visualize the Clusters"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Because we have the same column names as in the K-means example--`easting`, `northing`, and `cluster`--we can use the same code to visualize the clusters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "infected_df.to_pandas().plot(kind='scatter', x='easting', y='northing', c='cluster')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import IPython\n",
+    "app = IPython.Application.instance()\n",
+    "app.kernel.do_shutdown(True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Well Done!** Let's move to the [next notebook](3-04_logistic_regression.ipynb). "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src=\"./images/DLI_Header.png\" width=400/>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
author	leshe4ka46 <alex9102naid1@ya.ru>	2025-10-18 12:25:53 +0300
committer	leshe4ka46 <alex9102naid1@ya.ru>	2025-10-18 12:25:53 +0300
commit	910a222fa60ce6ea0831f2956470b8a0b9f62670 (patch)
tree	1d6bbccafb667731ad127f93390761100fc11b53 /Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb
parent	35b9040e4104b0e79bf243a2c9769c589f96e2c4 (diff)