{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://www.nvidia.com/dli\"><img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 3: Identify Risk Factors for Infection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:red\">\n",
    "**UPDATE**\n",
    "\n",
    "Thank you again for the previous analysis. We will next be publishing a public health advisory that warns of specific infection risk factors of which individuals should be aware. Please advise as to which population characteristics are associated with higher infection rates. \n",
    "</span>\n",
    "\n",
    "Your goal for this notebook will be to identify key potential demographic and economic risk factors for infection by comparing the infected and uninfected populations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext cudf.pandas\n",
    "import pandas as pd\n",
    "import cuml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Begin by loading the data you've received about week 3 of the outbreak into a cuDF-accelerated pandas DataFrame. The data is located at `./data/week3.csv`. For this notebook you will need all columns of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>employment</th>\n",
       "      <th>infected</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479889</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479890</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479891</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479892</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479893</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>58479894 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          age sex employment  infected\n",
       "0           0   m          U       0.0\n",
       "1           0   m          U       0.0\n",
       "2           0   m          U       0.0\n",
       "3           0   m          U       0.0\n",
       "4           0   m          U       0.0\n",
       "...       ...  ..        ...       ...\n",
       "58479889   90   f          V       0.0\n",
       "58479890   90   f          V       0.0\n",
       "58479891   90   f          V       0.0\n",
       "58479892   90   f          V       0.0\n",
       "58479893   90   f          V       0.0\n",
       "\n",
       "[58479894 rows x 4 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('./data/week3.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate Infection Rates by Employment Code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert the `infected` column to type `float32`. For people who are not infected, the float32 `infected` value should be `0.0`, and for infected people it should be `1.0`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>employment</th>\n",
       "      <th>infected</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>U</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479889</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479890</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479891</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479892</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58479893</th>\n",
       "      <td>90</td>\n",
       "      <td>f</td>\n",
       "      <td>V</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>58479894 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          age sex employment  infected\n",
       "0           0   m          U       0.0\n",
       "1           0   m          U       0.0\n",
       "2           0   m          U       0.0\n",
       "3           0   m          U       0.0\n",
       "4           0   m          U       0.0\n",
       "...       ...  ..        ...       ...\n",
       "58479889   90   f          V       0.0\n",
       "58479890   90   f          V       0.0\n",
       "58479891   90   f          V       0.0\n",
       "58479892   90   f          V       0.0\n",
       "58479893   90   f          V       0.0\n",
       "\n",
       "[58479894 rows x 4 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['infected'] = df['infected'].astype('float32')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, produce a list of employment types and their associated **rates** of infection, sorted from highest to lowest rate of infection.\n",
    "\n",
    "**NOTE**: The infection **rate** for each employment type should be the percentage of total individuals within an employment type who are infected. Therefore, if employment type \"X\" has 1000 people, and 10 of them are infected, the infection **rate** would be .01. If employment type \"Z\" has 10,000 people, and 50 of them are infected, the infection rate would be .005, and would be **lower** than for type \"X\", even though more people within that employment type were infected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_people</th>\n",
       "      <th>infected_people</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>305755</td>\n",
       "      <td>1178.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B, D, E</th>\n",
       "      <td>486785</td>\n",
       "      <td>1837.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>2653753</td>\n",
       "      <td>10301.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>2075628</td>\n",
       "      <td>6604.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>G</th>\n",
       "      <td>3549465</td>\n",
       "      <td>17561.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            total_people  infected_people\n",
       "employment                               \n",
       "A                 305755           1178.0\n",
       "B, D, E           486785           1837.0\n",
       "C                2653753          10301.0\n",
       "F                2075628           6604.0\n",
       "G                3549465          17561.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "emp_groups = df.groupby('employment').agg(\n",
    "    total_people=('employment', 'size'),\n",
    "    infected_people=('infected','sum')\n",
    ")\n",
    "emp_groups.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_people</th>\n",
       "      <th>infected_people</th>\n",
       "      <th>infection_rate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>305755</td>\n",
       "      <td>1178.0</td>\n",
       "      <td>0.003853</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B, D, E</th>\n",
       "      <td>486785</td>\n",
       "      <td>1837.0</td>\n",
       "      <td>0.003774</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>2653753</td>\n",
       "      <td>10301.0</td>\n",
       "      <td>0.003882</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>2075628</td>\n",
       "      <td>6604.0</td>\n",
       "      <td>0.003182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>G</th>\n",
       "      <td>3549465</td>\n",
       "      <td>17561.0</td>\n",
       "      <td>0.004948</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            total_people  infected_people  infection_rate\n",
       "employment                                               \n",
       "A                 305755           1178.0        0.003853\n",
       "B, D, E           486785           1837.0        0.003774\n",
       "C                2653753          10301.0        0.003882\n",
       "F                2075628           6604.0        0.003182\n",
       "G                3549465          17561.0        0.004948"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "emp_groups['infection_rate'] = emp_groups['infected_people'] / emp_groups['total_people']\n",
    "emp_groups.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_people</th>\n",
       "      <th>infected_people</th>\n",
       "      <th>infection_rate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Q</th>\n",
       "      <td>3802602</td>\n",
       "      <td>48505.0</td>\n",
       "      <td>0.012756</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>1556575</td>\n",
       "      <td>16116.0</td>\n",
       "      <td>0.010354</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>V</th>\n",
       "      <td>10098466</td>\n",
       "      <td>76648.0</td>\n",
       "      <td>0.007590</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>P</th>\n",
       "      <td>3006149</td>\n",
       "      <td>18609.0</td>\n",
       "      <td>0.006190</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Z</th>\n",
       "      <td>7161907</td>\n",
       "      <td>40498.0</td>\n",
       "      <td>0.005655</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            total_people  infected_people  infection_rate\n",
       "employment                                               \n",
       "Q                3802602          48505.0        0.012756\n",
       "I                1556575          16116.0        0.010354\n",
       "V               10098466          76648.0        0.007590\n",
       "P                3006149          18609.0        0.006190\n",
       "Z                7161907          40498.0        0.005655"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "emp_rate_df = emp_groups.sort_values(by='infection_rate', ascending=False)\n",
    "emp_rate_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# emp_groups = df[['employment', 'infected']].groupby('employment')['infected']\n",
    "# emp_rate_df = emp_groups.mean().rename('infection_rate').sort_values(by='infection_rate', ascending=False)\n",
    "# emp_rate_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, read in the employment codes guide from `./data/code_guide.csv` to interpret which employment types are seeing the highest rates of infection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Code</th>\n",
       "      <th>Field</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>Agriculture, forestry &amp; fishing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B, D, E</td>\n",
       "      <td>Mining, energy and water supply</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>C</td>\n",
       "      <td>Manufacturing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>F</td>\n",
       "      <td>Construction</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>G</td>\n",
       "      <td>Wholesale, retail &amp; repair of motor vehicles</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>H</td>\n",
       "      <td>Transport &amp; storage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>I</td>\n",
       "      <td>Accommodation &amp; food services</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>J</td>\n",
       "      <td>Information &amp; communication</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>K</td>\n",
       "      <td>Financial &amp; insurance activities</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>L</td>\n",
       "      <td>Real estate activities</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>M</td>\n",
       "      <td>Professional, scientific &amp; technical activities</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>N</td>\n",
       "      <td>Administrative &amp; support services</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>O</td>\n",
       "      <td>Public admin &amp; defence; social security</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>P</td>\n",
       "      <td>Education</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Q</td>\n",
       "      <td>Human health &amp; social work activities</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>R, S, T</td>\n",
       "      <td>Other services</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>U</td>\n",
       "      <td>Student</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>V</td>\n",
       "      <td>Retired</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>X</td>\n",
       "      <td>Outside the UK or not specified</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Y</td>\n",
       "      <td>Pre-school child</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Z</td>\n",
       "      <td>Not formally employed</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Code                                            Field\n",
       "0         A                  Agriculture, forestry & fishing\n",
       "1   B, D, E                  Mining, energy and water supply\n",
       "2         C                                    Manufacturing\n",
       "3         F                                     Construction\n",
       "4         G     Wholesale, retail & repair of motor vehicles\n",
       "5         H                              Transport & storage\n",
       "6         I                    Accommodation & food services\n",
       "7         J                      Information & communication\n",
       "8         K                 Financial & insurance activities\n",
       "9         L                           Real estate activities\n",
       "10        M  Professional, scientific & technical activities\n",
       "11        N                Administrative & support services\n",
       "12        O          Public admin & defence; social security\n",
       "13        P                                        Education\n",
       "14        Q            Human health & social work activities\n",
       "15  R, S, T                                   Other services\n",
       "16        U                                          Student\n",
       "17        V                                          Retired\n",
       "18        X                  Outside the UK or not specified\n",
       "19        Y                                 Pre-school child\n",
       "20        Z                            Not formally employed"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "emp_codes = pd.read_csv('./data/code_guide.csv')\n",
    "emp_codes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Top 2 Employment Type with Highest Rate of Infection ###"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we ask you to get the top two employment types that have the highest rate of infection. We start by using `.sort_values()` to sort `emp_rate_df` by the rate of infection. We then take the first 2 results. \n",
    "\n",
    "We will also need to index `emp_codes` to get the respeictve field name. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_people</th>\n",
       "      <th>infected_people</th>\n",
       "      <th>infection_rate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Q</th>\n",
       "      <td>3802602</td>\n",
       "      <td>48505.0</td>\n",
       "      <td>0.012756</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>1556575</td>\n",
       "      <td>16116.0</td>\n",
       "      <td>0.010354</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            total_people  infected_people  infection_rate\n",
       "employment                                               \n",
       "Q                3802602          48505.0        0.012756\n",
       "I                1556575          16116.0        0.010354"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top_inf_emp = emp_rate_df.sort_values(by='infection_rate', ascending=False).iloc[:2]\n",
    "top_inf_emp.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6             Accommodation & food services\n",
       "14    Human health & social work activities\n",
       "Name: Field, dtype: object"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top_inf_emp = emp_rate_df.sort_values(by='infection_rate', ascending=False).iloc[:2].index\n",
    "top_inf_emp_df = emp_codes.loc[emp_codes['Code'].isin(top_inf_emp), 'Field']\n",
    "top_inf_emp_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.10/site-packages/cudf/io/json.py:194: UserWarning: Using CPU via Pandas to write JSON dataset\n",
      "  warnings.warn(\"Using CPU via Pandas to write JSON dataset\")\n"
     ]
    }
   ],
   "source": [
    "top_inf_emp_df.to_json('my_assessment/question_3.json', orient='records')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate Infection Rates by Employment Code and Sex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to see if there is an effect of `sex` on infection rate, either in addition to `employment` or confounding it. Group by both `employment` and `sex` simultaneously to get the infection rate for the intersection of those categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>infected</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment</th>\n",
       "      <th>sex</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <th>f</th>\n",
       "      <td>41.377627</td>\n",
       "      <td>0.015064</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Q</th>\n",
       "      <th>f</th>\n",
       "      <td>41.385400</td>\n",
       "      <td>0.014947</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>V</th>\n",
       "      <th>f</th>\n",
       "      <td>76.022214</td>\n",
       "      <td>0.010852</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B, D, E</th>\n",
       "      <th>f</th>\n",
       "      <td>41.425618</td>\n",
       "      <td>0.007973</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>R, S, T</th>\n",
       "      <th>f</th>\n",
       "      <td>41.371672</td>\n",
       "      <td>0.007748</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>O</th>\n",
       "      <th>f</th>\n",
       "      <td>41.396246</td>\n",
       "      <td>0.007719</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K</th>\n",
       "      <th>f</th>\n",
       "      <td>41.377495</td>\n",
       "      <td>0.007672</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <th>f</th>\n",
       "      <td>41.401898</td>\n",
       "      <td>0.007645</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>J</th>\n",
       "      <th>f</th>\n",
       "      <td>41.385772</td>\n",
       "      <td>0.007645</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <th>f</th>\n",
       "      <td>41.391365</td>\n",
       "      <td>0.007630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Z</th>\n",
       "      <th>f</th>\n",
       "      <td>41.384618</td>\n",
       "      <td>0.007629</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>P</th>\n",
       "      <th>f</th>\n",
       "      <td>41.399302</td>\n",
       "      <td>0.007584</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <th>f</th>\n",
       "      <td>41.415191</td>\n",
       "      <td>0.007577</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>G</th>\n",
       "      <th>f</th>\n",
       "      <td>41.394893</td>\n",
       "      <td>0.007556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <th>f</th>\n",
       "      <td>41.385010</td>\n",
       "      <td>0.007491</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>X</th>\n",
       "      <th>f</th>\n",
       "      <td>41.401447</td>\n",
       "      <td>0.007391</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>N</th>\n",
       "      <th>f</th>\n",
       "      <td>41.386565</td>\n",
       "      <td>0.007389</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H</th>\n",
       "      <th>f</th>\n",
       "      <td>41.416857</td>\n",
       "      <td>0.007385</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>L</th>\n",
       "      <th>f</th>\n",
       "      <td>41.412315</td>\n",
       "      <td>0.007221</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Q</th>\n",
       "      <th>m</th>\n",
       "      <td>41.015871</td>\n",
       "      <td>0.005120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <th>m</th>\n",
       "      <td>40.984340</td>\n",
       "      <td>0.005117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>V</th>\n",
       "      <th>m</th>\n",
       "      <td>74.962331</td>\n",
       "      <td>0.003685</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>G</th>\n",
       "      <th>m</th>\n",
       "      <td>41.001163</td>\n",
       "      <td>0.002596</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>P</th>\n",
       "      <th>m</th>\n",
       "      <td>40.973641</td>\n",
       "      <td>0.002577</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <th>m</th>\n",
       "      <td>40.997851</td>\n",
       "      <td>0.002569</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>J</th>\n",
       "      <th>m</th>\n",
       "      <td>41.007800</td>\n",
       "      <td>0.002546</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>O</th>\n",
       "      <th>m</th>\n",
       "      <td>41.017959</td>\n",
       "      <td>0.002543</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Z</th>\n",
       "      <th>m</th>\n",
       "      <td>41.010787</td>\n",
       "      <td>0.002543</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>R, S, T</th>\n",
       "      <th>m</th>\n",
       "      <td>41.021642</td>\n",
       "      <td>0.002542</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>N</th>\n",
       "      <th>m</th>\n",
       "      <td>40.975332</td>\n",
       "      <td>0.002538</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <th>m</th>\n",
       "      <td>41.007458</td>\n",
       "      <td>0.002535</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <th>m</th>\n",
       "      <td>40.989591</td>\n",
       "      <td>0.002520</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <th>m</th>\n",
       "      <td>40.992202</td>\n",
       "      <td>0.002514</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K</th>\n",
       "      <th>m</th>\n",
       "      <td>40.988683</td>\n",
       "      <td>0.002490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H</th>\n",
       "      <th>m</th>\n",
       "      <td>40.978731</td>\n",
       "      <td>0.002482</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B, D, E</th>\n",
       "      <th>m</th>\n",
       "      <td>41.014025</td>\n",
       "      <td>0.002462</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>X</th>\n",
       "      <th>m</th>\n",
       "      <td>41.024242</td>\n",
       "      <td>0.002435</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>L</th>\n",
       "      <th>m</th>\n",
       "      <td>40.953436</td>\n",
       "      <td>0.002197</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">U</th>\n",
       "      <th>f</th>\n",
       "      <td>8.330251</td>\n",
       "      <td>0.000329</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>m</th>\n",
       "      <td>8.331235</td>\n",
       "      <td>0.000110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      age  infected\n",
       "employment sex                     \n",
       "I          f    41.377627  0.015064\n",
       "Q          f    41.385400  0.014947\n",
       "V          f    76.022214  0.010852\n",
       "B, D, E    f    41.425618  0.007973\n",
       "R, S, T    f    41.371672  0.007748\n",
       "O          f    41.396246  0.007719\n",
       "K          f    41.377495  0.007672\n",
       "M          f    41.401898  0.007645\n",
       "J          f    41.385772  0.007645\n",
       "C          f    41.391365  0.007630\n",
       "Z          f    41.384618  0.007629\n",
       "P          f    41.399302  0.007584\n",
       "F          f    41.415191  0.007577\n",
       "G          f    41.394893  0.007556\n",
       "A          f    41.385010  0.007491\n",
       "X          f    41.401447  0.007391\n",
       "N          f    41.386565  0.007389\n",
       "H          f    41.416857  0.007385\n",
       "L          f    41.412315  0.007221\n",
       "Q          m    41.015871  0.005120\n",
       "I          m    40.984340  0.005117\n",
       "V          m    74.962331  0.003685\n",
       "G          m    41.001163  0.002596\n",
       "P          m    40.973641  0.002577\n",
       "C          m    40.997851  0.002569\n",
       "J          m    41.007800  0.002546\n",
       "O          m    41.017959  0.002543\n",
       "Z          m    41.010787  0.002543\n",
       "R, S, T    m    41.021642  0.002542\n",
       "N          m    40.975332  0.002538\n",
       "F          m    41.007458  0.002535\n",
       "M          m    40.989591  0.002520\n",
       "A          m    40.992202  0.002514\n",
       "K          m    40.988683  0.002490\n",
       "H          m    40.978731  0.002482\n",
       "B, D, E    m    41.014025  0.002462\n",
       "X          m    41.024242  0.002435\n",
       "L          m    40.953436  0.002197\n",
       "U          f     8.330251  0.000329\n",
       "           m     8.331235  0.000110"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "simul_groups = df.groupby(['employment', 'sex'])\n",
    "simul_groups.mean().sort_values('infected', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check Submission ##"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"Accommodation & food services\",\"Human health & social work activities\"]"
     ]
    }
   ],
   "source": [
    "!cat my_assessment/question_3.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Tip**: Your submission file should contain one line of text, similar to: \n",
    "\n",
    "```\n",
    "[\"Agriculture, forestry & fishing\",\"Mining, energy and water supply\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"center\"><h2>Please Restart the Kernel</h2></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you plan to continue work in other notebooks, please shutdown the kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'status': 'ok', 'restart': True}"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://www.nvidia.com/dli\"><img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/></a>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}