{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fundamentals of Accelerated Data Science # " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 04 - Logistic Regression ##\n", "\n", "**Table of Contents**\n", "
\n", "This notebook uses GPU-accelerated logistic regression to predict infection risk based on features of our population members. This notebook covers the below sections: \n", "1. [Environment](#Environment)\n", "2. [Load Data](#Load-Data)\n", "3. [Logistic Regression](#Logistic-Regression)\n", " * [Viewing the Regression](#Viewing-the-Regression)\n", " * [Estimate Probability of Infection](#Estimate-Probability-of-Infection)\n", "4. [Model Explainability](#Model-Explainability)\n", " * [Show Infection Prevalence is Related to Age](#Show-Infection-Prevalence-is-Related-to-Age)\n", " * [Exercise #1 - Show Infection Prevalence is Related to Sex](#Exercise-#1---Show-Infection-Prevalence-is-Related-to-Sex)\n", "5. [Making Predictions with Separate Training and Testing Data](#Making-Predictions-with-Separate-Training-and-Test-Data)\n", " * [Exercise #2 - Fit Logistic Regression Model Using Training Data](#Exercise-#2---Fit-Logistic-Regression-Model-Using-Training-Data)\n", " * [Use Test Data to Validate Model](#Use-Test-Data-to-Validate-Model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment ##" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import cudf\n", "import cuml\n", "\n", "import cupy as cp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "gdf = cudf.read_csv('./data/clean_uk_pop_full.csv', usecols=['age', 'sex', 'infected'])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age float64\n", "sex float64\n", "infected float64\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.dtypes" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(58479894, 3)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexinfected
00.00.00.0
10.00.00.0
20.00.00.0
30.00.00.0
40.00.00.0
\n", "
" ], "text/plain": [ " age sex infected\n", "0 0.0 0.0 0.0\n", "1 0.0 0.0 0.0\n", "2 0.0 0.0 0.0\n", "3 0.0 0.0 0.0\n", "4 0.0 0.0 0.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic Regression ##\n", "Logistic regression can be used to estimate the probability of an outcome as a function of some (assumed independent) inputs. In our case, we would like to estimate infection risk based on population members' age and sex.\n", "\n", "Below we train a logistic regresion model. We first create a cuML logistic regression instance `logreg`. The `logreg.fit` method takes 2 arguments: the model's independent variables *X*, and the dependent variable *y*. Fit the `logreg` model using the `gdf` columns `age` and `sex` as *X* and the `infected` column as *y*.\n", "\n", "1/(1+e^{-z}) sigmoid" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression()" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logreg = cuml.LogisticRegression()\n", "logreg.fit(gdf[['age', 'sex']], gdf['infected'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Viewing the Regression ###\n", "After fitting the model, we could use `logreg.predict` to estimate whether someone has more than a 50% chance to be infected, but since the virus has low prevalence in the population (around 1-2%, in this data set), individual probabilities of infection are well below 50% and the model should correctly predict that no one is individually likely to have the infection.\n", "\n", "However, we also have access to the model coefficients at `logreg.coef_` as well as the intercept at `logreg.intercept_`. Both of these values are cuDF Series. \n", "\n", "Below we view these values. Notice that changing sex from 0 to 1 has the same effect via the coefficients as changing the age by ~48 years." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cudf.core.dataframe.DataFrame" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(logreg.coef_)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cudf.core.series.Series" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(logreg.intercept_)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Coefficients: [age, sex]\n", "[0 0.014861\n", "Name: 0, dtype: float64, 0 0.695666\n", "Name: 1, dtype: float64]\n", "Intercept:\n", "-5.222369426308725\n" ] } ], "source": [ "logreg_coef = logreg.coef_\n", "logreg_int = logreg.intercept_\n", "\n", "print(\"Coefficients: [age, sex]\")\n", "print([logreg_coef[0], logreg_coef[1]])\n", "\n", "print(\"Intercept:\")\n", "print(logreg_int[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Estimate Probability of Infection ###\n", "As with all logistic regressions, the coefficients allow us to calculate the logit for each; from that, we can calculate the estimated percentage risk of infection. \n", "\n", "**Note**: Remembering that a 1 indicates 'infected', we assign that class' probability to a new column in the original dataframe. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
00.9946340.005366
10.9946340.005366
20.9946340.005366
30.9946340.005366
40.9946340.005366
.........
584798890.9604280.039572
584798900.9604280.039572
584798910.9604280.039572
584798920.9604280.039572
584798930.9604280.039572
\n", "

58479894 rows × 2 columns

\n", "
" ], "text/plain": [ " 0 1\n", "0 0.994634 0.005366\n", "1 0.994634 0.005366\n", "2 0.994634 0.005366\n", "3 0.994634 0.005366\n", "4 0.994634 0.005366\n", "... ... ...\n", "58479889 0.960428 0.039572\n", "58479890 0.960428 0.039572\n", "58479891 0.960428 0.039572\n", "58479892 0.960428 0.039572\n", "58479893 0.960428 0.039572\n", "\n", "[58479894 rows x 2 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_probs = logreg.predict_proba(gdf[['age', 'sex']])\n", "class_probs" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "gdf['risk'] = class_probs[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the original records with their new estimated risks, we can see how estimated risk varies across individuals." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexinfectedrisk
5747594984.01.00.00.036319
4307354739.01.00.00.018944
4640633348.01.00.00.021596
1783783148.00.00.00.010889
1678501345.00.00.00.010419
\n", "
" ], "text/plain": [ " age sex infected risk\n", "57475949 84.0 1.0 0.0 0.036319\n", "43073547 39.0 1.0 0.0 0.018944\n", "46406333 48.0 1.0 0.0 0.021596\n", "17837831 48.0 0.0 0.0 0.010889\n", "16785013 45.0 0.0 0.0 0.010419" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.take(cp.random.choice(gdf.shape[0], size=5, replace=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Explainability ##\n", "Model explainability refers to the ability to understand and explain the decisions and reasoning underlying the predictions from machine learning models. It can be achieved by investigating how the feature variables are related to the target variable. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show Infection Prevalence is Related to Age ###\n", "The positive coefficient on age suggests that the virus is more prevalent in older people, even when controlling for sex.\n", "\n", "For this exercise, show that infection prevalence has some relationship to age by printing the mean `infected` values for the oldest and youngest members of the population when grouped by age:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " infected\n", "age \n", "66.0 0.020700\n", "71.0 0.021292\n", "64.0 0.020675\n", "77.0 0.022102\n", "82.0 0.022929\n", " infected\n", "age \n", "33.0 0.015707\n", "76.0 0.021928\n", "74.0 0.021807\n", "79.0 0.022518\n", "86.0 0.023417\n" ] } ], "source": [ "# %load solutions/risk_by_age\n", "age_groups = gdf[['age', 'infected']].groupby(['age'])\n", "print(age_groups.mean().head())\n", "print(age_groups.mean().tail())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise #1 - Show Infection Prevalence is Related to Sex ###\n", "Similarly, the positive coefficient on sex suggests that the virus is more prevalent in people with sex = `1` (females), even when controlling for age.\n", "\n", "**Instructions**:
\n", "* Modify the `` only and execute the below cell to show that infection prevalence has some relationship to sex by printing the mean `infected` values for the population when grouped by sex. ." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
infected
sex
0.00.010140
1.00.020713
\n", "
" ], "text/plain": [ " infected\n", "sex \n", "0.0 0.010140\n", "1.0 0.020713" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sex_groups = gdf[['sex', 'infected']].groupby(['sex'])\n", "sex_groups.mean()" ] }, { "cell_type": "raw", "metadata": { "scrolled": true }, "source": [ "\n", "sex_groups = gdf[['sex', 'infected']].groupby(['sex'])\n", "sex_groups.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click ... for solution. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making Predictions with Separate Training and Test Data ##\n", "The typical process involves training the model on the training set, then using the test set to evaluate its performance. This provides a more realistic assessment of how well the model will perform on new, unseen data in real-world applications. By testing on a separate dataset, you can detect if your model is **overfitting** to the training data. Overfitting occurs when a model performs well on training data but poorly on new data. In many cases, you don't have access to truly new data, so splitting your existing data simulates this scenario. \n", "\n", "cuML gives us a simple method for producing paired training/testing data:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = cuml.train_test_split(gdf[['age', 'sex']], gdf['infected'], train_size=0.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise #2 - Fit Logistic Regression Model Using Training Data ###\n", "\n", "**Instructions**:
\n", "* Execute the below cell to create a new logistic regression model `logreg`\n", "* Modify the `` only and execute the cell below to fit the new model with the *X* and *y* training data just created." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "logreg = cuml.LogisticRegression()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression()" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logreg.fit(X_train, y_train)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\n", "logreg.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click ... for solution. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Test Data to Validate Model ###\n", "We can now use the same procedure as above to predict infection risk using the test data:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16557172 0.010267\n", "21666454 0.012426\n", "25019343 0.014598\n", "32613710 0.012409\n", "5458911 0.006700\n", " ... \n", "29193786 0.010716\n", "10832641 0.008235\n", "50674662 0.024982\n", "15628357 0.009970\n", "44635132 0.020095\n", "Name: 1, Length: 5847990, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test_pred = logreg.predict_proba(X_test, convert_dtype=True)[1]\n", "y_test_pred.index = X_test.index\n", "y_test_pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw before, very few people are actually infected in the population, even among the highest-risk groups. As a simple way to check our model, we split the test set into above-average predicted risk and below-average predicted risk, then observe that the prevalence of infections correlates closely to those predicted risks." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexinfectedpredicted_risk
high_risk
False29.5368750.2522180.0099920.010328
True56.1873850.8899230.0236760.023323
\n", "
" ], "text/plain": [ " age sex infected predicted_risk\n", "high_risk \n", "False 29.536875 0.252218 0.009992 0.010328\n", "True 56.187385 0.889923 0.023676 0.023323" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_results = cudf.DataFrame()\n", "test_results['age'] = X_test['age']\n", "test_results['sex'] = X_test['sex']\n", "test_results['infected'] = y_test\n", "test_results['predicted_risk'] = y_test_pred\n", "\n", "test_results['high_risk'] = test_results['predicted_risk'] > test_results['predicted_risk'].mean()\n", "\n", "risk_groups = test_results.groupby('high_risk')\n", "risk_groups.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, in a few milliseconds, we can do a two-tier analysis by sex and age:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 310 ms, sys: 56.2 ms, total: 367 ms\n", "Wall time: 366 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
infectedpredicted_risk
sexage
0.06.00.0034740.005867
1.00.0007640.005449
14.00.0061230.006602
1.05.00.0062640.011532
0.060.00.0133770.012984
.........
35.00.0103410.008995
13.00.0056500.006505
32.00.0104200.008606
19.00.0072080.007107
1.012.00.0106660.012778
\n", "

182 rows × 2 columns

\n", "
" ], "text/plain": [ " infected predicted_risk\n", "sex age \n", "0.0 6.0 0.003474 0.005867\n", " 1.0 0.000764 0.005449\n", " 14.0 0.006123 0.006602\n", "1.0 5.0 0.006264 0.011532\n", "0.0 60.0 0.013377 0.012984\n", "... ... ...\n", " 35.0 0.010341 0.008995\n", " 13.0 0.005650 0.006505\n", " 32.0 0.010420 0.008606\n", " 19.0 0.007208 0.007107\n", "1.0 12.0 0.010666 0.012778\n", "\n", "[182 rows x 2 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "s_groups = test_results[['sex', 'age', 'infected', 'predicted_risk']].groupby(['sex', 'age'])\n", "s_groups.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Well Done!** Let's move to the [next notebook](3-05_knn.ipynb). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 4 }