aboutsummaryrefslogtreecommitdiff
path: root/Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb
diff options
context:
space:
mode:
authorleshe4ka46 <alex9102naid1@ya.ru>2025-10-18 12:25:53 +0300
committerleshe4ka46 <alex9102naid1@ya.ru>2025-10-18 12:25:53 +0300
commit910a222fa60ce6ea0831f2956470b8a0b9f62670 (patch)
tree1d6bbccafb667731ad127f93390761100fc11b53 /Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb
parent35b9040e4104b0e79bf243a2c9769c589f96e2c4 (diff)
nvidia2
Diffstat (limited to 'Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb')
-rw-r--r--Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb241
1 files changed, 241 insertions, 0 deletions
diff --git a/Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb b/Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb
new file mode 100644
index 0000000..e27f31e
--- /dev/null
+++ b/Fundamentals_of_Accelerated_Data_Science/3-03_dbscan.ipynb
@@ -0,0 +1,241 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<img src=\"./images/DLI_Header.png\" width=400/>"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Fundamentals of Accelerated Data Science # "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 03 - DBSCAN ##\n",
+ "\n",
+ "**Table of Contents**\n",
+ "<br>\n",
+ "This notebook uses GPU-accelerated DBSCAN to identify clusters of infected people. This notebook covers the below sections: \n",
+ "1. [Environment](#Environment)\n",
+ "2. [Load Data](#Load-Data)\n",
+ "3. [DBSCAN Clustering](#DBSCAN-Clustering)\n",
+ " * [Exercise #1 - Make Another DBSCAN Instance](#Exercise-#1---Make-Another-DBSCAN-Instance)\n",
+ "4. [Visualize the Clusters](#Visualize-the-Clusters)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Environment ##"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import cudf\n",
+ "import cuml\n",
+ "\n",
+ "import cuxfilter as cxf"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Load Data ##\n",
+ "For this notebook, we again load a subset of our population data with only the columns we need. An `infected` column has been added to the data to indicate whether or not a person is known to be infected with our simulated virus."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gdf = cudf.read_csv('./data/pop_sample.csv', dtype=['float32', 'float32', 'float32'])\n",
+ "print(gdf.dtypes)\n",
+ "gdf.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gdf.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gdf['infected'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DBSCAN Clustering ##\n",
+ "DBSCAN is another unsupervised clustering algorithm that is particularly effective when the number of clusters is not known up front and the clusters may have concave or other unusual shapes--a situation that often applies in geospatial analytics.\n",
+ "\n",
+ "In this series of exercises you will use DBSCAN to identify clusters of infected people by location, which may help us identify groups becoming infected from common patient zeroes and assist in response planning.\n",
+ "\n",
+ "Create a DBSCAN instance by using `cuml.DBSCAN`. Pass in the named argument `eps` (the maximum distance a point can be from the nearest point in a cluster to be considered possibly in that cluster) to be `5000`. Since the `northing` and `easting` values we created are measured in meters, this will allow us to identify clusters of infected people where individuals may be separated from the rest of the cluster by up to 5 kilometers.\n",
+ "\n",
+ "Below we train a DBSCAN algorithm. We start by creating a new dataframe from rows of the original dataframe where `infected` is `1` (true), and call it `infected_df`--be sure to reset the dataframe's index afterward. Use `dbscan.fit_predict` to perform clustering on the `northing` and `easting` columns of `infected_df`, and turn the resulting series into a new column in `infected_gdf` called \"cluster\". Finally, compute the number of clusters identified by DBSCAN."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dbscan = cuml.DBSCAN(eps=5000)\n",
+ "# dbscan = cuml.DBSCAN(eps=10000)\n",
+ "\n",
+ "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
+ "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
+ "infected_df['cluster'].nunique()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Exercise #1 - Make Another DBSCAN Instance ###\n",
+ "\n",
+ "**Instructions**: <br>\n",
+ "* Modify the `<FIXME>` only and execute the below cell to instantiate a DBSCAN instance with `10000` for `eps`.\n",
+ "* Modify the `<FIXME>` only and execute the cell below to fit the data and identify infected clusters. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dbscan = cuml.DBSCAN(<<<<FIXME>>>>)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
+ "infected_df['cluster'] = dbscan.<<<<FIXME>>>>(infected_df[['northing', 'easting']])\n",
+ "infected_df['cluster'].nunique()"
+ ]
+ },
+ {
+ "cell_type": "raw",
+ "metadata": {
+ "jupyter": {
+ "source_hidden": true
+ }
+ },
+ "source": [
+ "\n",
+ "dbscan = cuml.DBSCAN(eps=10000)\n",
+ "\n",
+ "infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
+ "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
+ "infected_df['cluster'].nunique()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Click ... for solution. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Visualize the Clusters"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Because we have the same column names as in the K-means example--`easting`, `northing`, and `cluster`--we can use the same code to visualize the clusters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "infected_df.to_pandas().plot(kind='scatter', x='easting', y='northing', c='cluster')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import IPython\n",
+ "app = IPython.Application.instance()\n",
+ "app.kernel.do_shutdown(True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Well Done!** Let's move to the [next notebook](3-04_logistic_regression.ipynb). "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<img src=\"./images/DLI_Header.png\" width=400/>"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.15"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}