{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Header\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Week 3: Identify Risk Factors for Infection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**UPDATE**\n", "\n", "Thank you again for the previous analysis. We will next be publishing a public health advisory that warns of specific infection risk factors of which individuals should be aware. Please advise as to which population characteristics are associated with higher infection rates. \n", "\n", "\n", "Your goal for this notebook will be to identify key potential demographic and economic risk factors for infection by comparing the infected and uninfected populations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext cudf.pandas\n", "import pandas as pd\n", "import cuml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Begin by loading the data you've received about week 3 of the outbreak into a cuDF-accelerated pandas DataFrame. The data is located at `./data/week3.csv`. For this notebook you will need all columns of the data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexemploymentinfected
00mU0.0
10mU0.0
20mU0.0
30mU0.0
40mU0.0
...............
5847988990fV0.0
5847989090fV0.0
5847989190fV0.0
5847989290fV0.0
5847989390fV0.0
\n", "

58479894 rows × 4 columns

\n", "
" ], "text/plain": [ " age sex employment infected\n", "0 0 m U 0.0\n", "1 0 m U 0.0\n", "2 0 m U 0.0\n", "3 0 m U 0.0\n", "4 0 m U 0.0\n", "... ... .. ... ...\n", "58479889 90 f V 0.0\n", "58479890 90 f V 0.0\n", "58479891 90 f V 0.0\n", "58479892 90 f V 0.0\n", "58479893 90 f V 0.0\n", "\n", "[58479894 rows x 4 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./data/week3.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate Infection Rates by Employment Code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert the `infected` column to type `float32`. For people who are not infected, the float32 `infected` value should be `0.0`, and for infected people it should be `1.0`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexemploymentinfected
00mU0.0
10mU0.0
20mU0.0
30mU0.0
40mU0.0
...............
5847988990fV0.0
5847989090fV0.0
5847989190fV0.0
5847989290fV0.0
5847989390fV0.0
\n", "

58479894 rows × 4 columns

\n", "
" ], "text/plain": [ " age sex employment infected\n", "0 0 m U 0.0\n", "1 0 m U 0.0\n", "2 0 m U 0.0\n", "3 0 m U 0.0\n", "4 0 m U 0.0\n", "... ... .. ... ...\n", "58479889 90 f V 0.0\n", "58479890 90 f V 0.0\n", "58479891 90 f V 0.0\n", "58479892 90 f V 0.0\n", "58479893 90 f V 0.0\n", "\n", "[58479894 rows x 4 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['infected'] = df['infected'].astype('float32')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, produce a list of employment types and their associated **rates** of infection, sorted from highest to lowest rate of infection.\n", "\n", "**NOTE**: The infection **rate** for each employment type should be the percentage of total individuals within an employment type who are infected. Therefore, if employment type \"X\" has 1000 people, and 10 of them are infected, the infection **rate** would be .01. If employment type \"Z\" has 10,000 people, and 50 of them are infected, the infection rate would be .005, and would be **lower** than for type \"X\", even though more people within that employment type were infected." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_peopleinfected_people
employment
A3057551178.0
B, D, E4867851837.0
C265375310301.0
F20756286604.0
G354946517561.0
\n", "
" ], "text/plain": [ " total_people infected_people\n", "employment \n", "A 305755 1178.0\n", "B, D, E 486785 1837.0\n", "C 2653753 10301.0\n", "F 2075628 6604.0\n", "G 3549465 17561.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "emp_groups = df.groupby('employment').agg(\n", " total_people=('employment', 'size'),\n", " infected_people=('infected','sum')\n", ")\n", "emp_groups.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_peopleinfected_peopleinfection_rate
employment
A3057551178.00.003853
B, D, E4867851837.00.003774
C265375310301.00.003882
F20756286604.00.003182
G354946517561.00.004948
\n", "
" ], "text/plain": [ " total_people infected_people infection_rate\n", "employment \n", "A 305755 1178.0 0.003853\n", "B, D, E 486785 1837.0 0.003774\n", "C 2653753 10301.0 0.003882\n", "F 2075628 6604.0 0.003182\n", "G 3549465 17561.0 0.004948" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "emp_groups['infection_rate'] = emp_groups['infected_people'] / emp_groups['total_people']\n", "emp_groups.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_peopleinfected_peopleinfection_rate
employment
Q380260248505.00.012756
I155657516116.00.010354
V1009846676648.00.007590
P300614918609.00.006190
Z716190740498.00.005655
\n", "
" ], "text/plain": [ " total_people infected_people infection_rate\n", "employment \n", "Q 3802602 48505.0 0.012756\n", "I 1556575 16116.0 0.010354\n", "V 10098466 76648.0 0.007590\n", "P 3006149 18609.0 0.006190\n", "Z 7161907 40498.0 0.005655" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "emp_rate_df = emp_groups.sort_values(by='infection_rate', ascending=False)\n", "emp_rate_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# emp_groups = df[['employment', 'infected']].groupby('employment')['infected']\n", "# emp_rate_df = emp_groups.mean().rename('infection_rate').sort_values(by='infection_rate', ascending=False)\n", "# emp_rate_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, read in the employment codes guide from `./data/code_guide.csv` to interpret which employment types are seeing the highest rates of infection." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CodeField
0AAgriculture, forestry & fishing
1B, D, EMining, energy and water supply
2CManufacturing
3FConstruction
4GWholesale, retail & repair of motor vehicles
5HTransport & storage
6IAccommodation & food services
7JInformation & communication
8KFinancial & insurance activities
9LReal estate activities
10MProfessional, scientific & technical activities
11NAdministrative & support services
12OPublic admin & defence; social security
13PEducation
14QHuman health & social work activities
15R, S, TOther services
16UStudent
17VRetired
18XOutside the UK or not specified
19YPre-school child
20ZNot formally employed
\n", "
" ], "text/plain": [ " Code Field\n", "0 A Agriculture, forestry & fishing\n", "1 B, D, E Mining, energy and water supply\n", "2 C Manufacturing\n", "3 F Construction\n", "4 G Wholesale, retail & repair of motor vehicles\n", "5 H Transport & storage\n", "6 I Accommodation & food services\n", "7 J Information & communication\n", "8 K Financial & insurance activities\n", "9 L Real estate activities\n", "10 M Professional, scientific & technical activities\n", "11 N Administrative & support services\n", "12 O Public admin & defence; social security\n", "13 P Education\n", "14 Q Human health & social work activities\n", "15 R, S, T Other services\n", "16 U Student\n", "17 V Retired\n", "18 X Outside the UK or not specified\n", "19 Y Pre-school child\n", "20 Z Not formally employed" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "emp_codes = pd.read_csv('./data/code_guide.csv')\n", "emp_codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get Top 2 Employment Type with Highest Rate of Infection ###" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we ask you to get the top two employment types that have the highest rate of infection. We start by using `.sort_values()` to sort `emp_rate_df` by the rate of infection. We then take the first 2 results. \n", "\n", "We will also need to index `emp_codes` to get the respeictve field name. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_peopleinfected_peopleinfection_rate
employment
Q380260248505.00.012756
I155657516116.00.010354
\n", "
" ], "text/plain": [ " total_people infected_people infection_rate\n", "employment \n", "Q 3802602 48505.0 0.012756\n", "I 1556575 16116.0 0.010354" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_inf_emp = emp_rate_df.sort_values(by='infection_rate', ascending=False).iloc[:2]\n", "top_inf_emp.head()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6 Accommodation & food services\n", "14 Human health & social work activities\n", "Name: Field, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_inf_emp = emp_rate_df.sort_values(by='infection_rate', ascending=False).iloc[:2].index\n", "top_inf_emp_df = emp_codes.loc[emp_codes['Code'].isin(top_inf_emp), 'Field']\n", "top_inf_emp_df" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.10/site-packages/cudf/io/json.py:194: UserWarning: Using CPU via Pandas to write JSON dataset\n", " warnings.warn(\"Using CPU via Pandas to write JSON dataset\")\n" ] } ], "source": [ "top_inf_emp_df.to_json('my_assessment/question_3.json', orient='records')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate Infection Rates by Employment Code and Sex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to see if there is an effect of `sex` on infection rate, either in addition to `employment` or confounding it. Group by both `employment` and `sex` simultaneously to get the infection rate for the intersection of those categories." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageinfected
employmentsex
If41.3776270.015064
Qf41.3854000.014947
Vf76.0222140.010852
B, D, Ef41.4256180.007973
R, S, Tf41.3716720.007748
Of41.3962460.007719
Kf41.3774950.007672
Mf41.4018980.007645
Jf41.3857720.007645
Cf41.3913650.007630
Zf41.3846180.007629
Pf41.3993020.007584
Ff41.4151910.007577
Gf41.3948930.007556
Af41.3850100.007491
Xf41.4014470.007391
Nf41.3865650.007389
Hf41.4168570.007385
Lf41.4123150.007221
Qm41.0158710.005120
Im40.9843400.005117
Vm74.9623310.003685
Gm41.0011630.002596
Pm40.9736410.002577
Cm40.9978510.002569
Jm41.0078000.002546
Om41.0179590.002543
Zm41.0107870.002543
R, S, Tm41.0216420.002542
Nm40.9753320.002538
Fm41.0074580.002535
Mm40.9895910.002520
Am40.9922020.002514
Km40.9886830.002490
Hm40.9787310.002482
B, D, Em41.0140250.002462
Xm41.0242420.002435
Lm40.9534360.002197
Uf8.3302510.000329
m8.3312350.000110
\n", "
" ], "text/plain": [ " age infected\n", "employment sex \n", "I f 41.377627 0.015064\n", "Q f 41.385400 0.014947\n", "V f 76.022214 0.010852\n", "B, D, E f 41.425618 0.007973\n", "R, S, T f 41.371672 0.007748\n", "O f 41.396246 0.007719\n", "K f 41.377495 0.007672\n", "M f 41.401898 0.007645\n", "J f 41.385772 0.007645\n", "C f 41.391365 0.007630\n", "Z f 41.384618 0.007629\n", "P f 41.399302 0.007584\n", "F f 41.415191 0.007577\n", "G f 41.394893 0.007556\n", "A f 41.385010 0.007491\n", "X f 41.401447 0.007391\n", "N f 41.386565 0.007389\n", "H f 41.416857 0.007385\n", "L f 41.412315 0.007221\n", "Q m 41.015871 0.005120\n", "I m 40.984340 0.005117\n", "V m 74.962331 0.003685\n", "G m 41.001163 0.002596\n", "P m 40.973641 0.002577\n", "C m 40.997851 0.002569\n", "J m 41.007800 0.002546\n", "O m 41.017959 0.002543\n", "Z m 41.010787 0.002543\n", "R, S, T m 41.021642 0.002542\n", "N m 40.975332 0.002538\n", "F m 41.007458 0.002535\n", "M m 40.989591 0.002520\n", "A m 40.992202 0.002514\n", "K m 40.988683 0.002490\n", "H m 40.978731 0.002482\n", "B, D, E m 41.014025 0.002462\n", "X m 41.024242 0.002435\n", "L m 40.953436 0.002197\n", "U f 8.330251 0.000329\n", " m 8.331235 0.000110" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simul_groups = df.groupby(['employment', 'sex'])\n", "simul_groups.mean().sort_values('infected', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check Submission ##" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"Accommodation & food services\",\"Human health & social work activities\"]" ] } ], "source": [ "!cat my_assessment/question_3.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip**: Your submission file should contain one line of text, similar to: \n", "\n", "```\n", "[\"Agriculture, forestry & fishing\",\"Mining, energy and water supply\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Please Restart the Kernel

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you plan to continue work in other notebooks, please shutdown the kernel." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'status': 'ok', 'restart': True}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Header\"" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 4 }