diff options
Diffstat (limited to 'Fundamentals_of_Accelerated_Data_Science/1-05_grouping.ipynb')
| -rw-r--r-- | Fundamentals_of_Accelerated_Data_Science/1-05_grouping.ipynb | 1350 |
1 files changed, 1350 insertions, 0 deletions
diff --git a/Fundamentals_of_Accelerated_Data_Science/1-05_grouping.ipynb b/Fundamentals_of_Accelerated_Data_Science/1-05_grouping.ipynb new file mode 100644 index 0000000..0595f20 --- /dev/null +++ b/Fundamentals_of_Accelerated_Data_Science/1-05_grouping.ipynb @@ -0,0 +1,1350 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b53a7b12-538d-4459-b82a-a35c8c417849", + "metadata": {}, + "source": [ + "<img src=\"./images/DLI_Header.png\" width=400/>" + ] + }, + { + "cell_type": "markdown", + "id": "ae497b71-bc43-471e-8970-88a1878e7cf9", + "metadata": {}, + "source": [ + "# Fundamentals of Accelerated Data Science # " + ] + }, + { + "cell_type": "markdown", + "id": "a149b6d1-1880-4a5d-9d71-f963d3097aa4", + "metadata": {}, + "source": [ + "## 05 - Grouping ##\n", + "\n", + "**Table of Contents**\n", + "<br>\n", + "This notebook discusses and demonstrates how grouping in used in data science. This notebook covers the below sections: \n", + "1. [Grouping](#Grouping)\n", + " * [Split, Apply, and Combine](#Split,-Apply,-and-Combine)\n", + " * [Exercise #1 - Average Age Per County](#Exercise-#1---Average-Age-Per-County)\n", + "2. [Binning](#Binning)\n", + " * [Exercise #2 - Using the Profiler](#Exercise-#2---Using-the-Profiler)\n", + "3. [Advanced Groupby Operations](#Advanced-Groupby-Operations)\n", + " * [`.apply()`](#.apply())\n", + " * [`.transform()`](#.transform())\n", + "4. [Pivot Table](#Pivot-Table)" + ] + }, + { + "cell_type": "markdown", + "id": "f4f2e800-7d61-4370-8a96-ebef3a6d0c0a", + "metadata": {}, + "source": [ + "## Grouping ##\n", + "In data science, we often would like to split data into groups and perform further analysis on them such as: \n", + "* Aggregate based on the grouping\n", + "* Compare metrics across different groups\n", + "* Understand patterns in data across different groups\n", + "* Remove duplicates or fill missing values based on group-level information\n", + "* Create new features based on group-level statistics\n", + "* Integrate with visualization" + ] + }, + { + "cell_type": "markdown", + "id": "93d943bf-aaad-42c8-9b2c-70c8b6c0bf44", + "metadata": {}, + "source": [ + "Below we load in our dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "89c435b1-35d5-4971-ade1-549ae77d22db", + "metadata": {}, + "outputs": [], + "source": [ + "# DO NOT CHANGE THIS CELL\n", + "%load_ext cudf.pandas\n", + "import pandas as pd\n", + "import time" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "8dc82f3c-f1cb-436b-becb-a97635ec5f6a", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>age</th>\n", + " <th>sex</th>\n", + " <th>county</th>\n", + " <th>lat</th>\n", + " <th>long</th>\n", + " <th>name</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.533638</td>\n", + " <td>-1.524400</td>\n", + " <td>FRANCIS</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.426254</td>\n", + " <td>-1.465314</td>\n", + " <td>EDWARD</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.555199</td>\n", + " <td>-1.496417</td>\n", + " <td>TEDDY</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.547909</td>\n", + " <td>-1.572342</td>\n", + " <td>ANGUS</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.477638</td>\n", + " <td>-1.605995</td>\n", + " <td>CHARLIE</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " age sex county lat long name\n", + "0 0 m DARLINGTON 54.533638 -1.524400 FRANCIS\n", + "1 0 m DARLINGTON 54.426254 -1.465314 EDWARD\n", + "2 0 m DARLINGTON 54.555199 -1.496417 TEDDY\n", + "3 0 m DARLINGTON 54.547909 -1.572342 ANGUS\n", + "4 0 m DARLINGTON 54.477638 -1.605995 CHARLIE" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# DO NOT CHANGE THIS CELL\n", + "dtype_dict={\n", + " 'age': 'int8', \n", + " 'sex': 'category', \n", + " 'county': 'category', \n", + " 'lat': 'float32', \n", + " 'long': 'float32', \n", + " 'name': 'category'\n", + "}\n", + " \n", + "df=pd.read_csv('./data/uk_pop.csv', dtype=dtype_dict)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "c8bbdeac-9134-443b-9b38-7a980b19decf", + "metadata": {}, + "source": [ + "## Split, Apply, and Combine ##\n", + "We use the `.groupby()` method to to group large amounts of data and compute operations on these groups. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. cuDF implements record grouping in a manner comparable to Pandas, but with some notable differences. \n", + "\n", + "<p><img src='images/groupby.png' width=720></p>\n", + "\n", + "cuDF supports a number of common `DataFrameGroupBy` computations and descriptive statistics, such as `.size()`, `.mean()`, `.count()`, `.cov()`, `.cumprod()`, `.cumsum()`, `.max()`, `.min()`, `.nunique()`. \n", + "\n", + "**Note**: More information about how `.groupby()` behaves for pandas and how it differs from cuDF can be found in the links below: \n", + "* [pandas](https://pandas.pydata.org/docs/user_guide/groupby.html)\n", + "* [cuDF](https://docs.rapids.ai/api/cudf/stable/user_guide/groupby/)\n", + "\n", + "Below we find the number of people in each county. " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "72b646d3-b175-4a4e-804f-43afcb6910a2", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "county\n", + "BARKING AND DAGENHAM 211998\n", + "BARNET 392140\n", + "BARNSLEY 245199\n", + "BATH AND NORTH EAST SOMERSET 192106\n", + "BEDFORD 171623\n", + " ... \n", + "WOKINGHAM 167979\n", + "WOLVERHAMPTON 262008\n", + "WORCESTERSHIRE 592057\n", + "WREXHAM 136126\n", + "YORK 209893\n", + "Length: 171, dtype: int64" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# DO NOT CHANGE THIS CELL\n", + "df.groupby('county').size()" + ] + }, + { + "cell_type": "markdown", + "id": "a4775215-c6d4-49b0-ba8d-4d10a7e9f434", + "metadata": {}, + "source": [ + "**Note**: The results is unsorted. We can sort the output using the `.sort_index()` or `.sort_values()` method. \n", + "\n", + "Below we count the number of people with the most and least popular names. " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "6b22bb2b-8b86-45cb-ba26-3b845be5ac00", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "name\n", + "AKASHDEEP 213\n", + "DALHA 214\n", + "BOGOMIL 215\n", + "REMMY 217\n", + "KAIYAAN 219\n", + " ... \n", + "GEORGE 459096\n", + "HARRY 459346\n", + "AMELIA 460659\n", + "OLIVIA 483789\n", + "OLIVER 576135\n", + "Length: 13212, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "name\n", + "A 752\n", + "A'ISHA 889\n", + "A'NIYAH 298\n", + "A-JAY 785\n", + "AABAN 223\n", + " ... \n", + "ZYANA 857\n", + "ZYLA 285\n", + "ZYLAN 251\n", + "ZYON 501\n", + "ZYRAH 472\n", + "Length: 13212, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# DO NOT CHANGE THIS CELL\n", + "display(df.groupby('name').size().sort_values())\n", + "display(df.groupby('name').size().sort_index())" + ] + }, + { + "cell_type": "markdown", + "id": "fb83bfcd-d9b2-434d-a0b3-d8d435901ece", + "metadata": {}, + "source": [ + "Below we find the approximate centers of each county using `.groupby().mean()`. When performing groupby operations, we should **only** include columns that are being used. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "9f0efddf-7c0a-4fbc-8c88-b0fad245b05f", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>lat</th>\n", + " <th>long</th>\n", + " </tr>\n", + " <tr>\n", + " <th>county</th>\n", + " <th></th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>BARKING AND DAGENHAM</th>\n", + " <td>51.621062</td>\n", + " <td>0.129583</td>\n", + " </tr>\n", + " <tr>\n", + " <th>BARNET</th>\n", + " <td>51.812557</td>\n", + " <td>-0.218212</td>\n", + " </tr>\n", + " <tr>\n", + " <th>BARNSLEY</th>\n", + " <td>53.571895</td>\n", + " <td>-1.548719</td>\n", + " </tr>\n", + " <tr>\n", + " <th>BATH AND NORTH EAST SOMERSET</th>\n", + " <td>51.354950</td>\n", + " <td>-2.486675</td>\n", + " </tr>\n", + " <tr>\n", + " <th>BEDFORD</th>\n", + " <td>52.145476</td>\n", + " <td>-0.454974</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>WOKINGHAM</th>\n", + " <td>51.459659</td>\n", + " <td>-0.899371</td>\n", + " </tr>\n", + " <tr>\n", + " <th>WOLVERHAMPTON</th>\n", + " <td>52.716852</td>\n", + " <td>-2.127595</td>\n", + " </tr>\n", + " <tr>\n", + " <th>WORCESTERSHIRE</th>\n", + " <td>52.057964</td>\n", + " <td>-2.209184</td>\n", + " </tr>\n", + " <tr>\n", + " <th>WREXHAM</th>\n", + " <td>53.000793</td>\n", + " <td>-2.991957</td>\n", + " </tr>\n", + " <tr>\n", + " <th>YORK</th>\n", + " <td>53.992329</td>\n", + " <td>-1.073789</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>171 rows × 2 columns</p>\n", + "</div>" + ], + "text/plain": [ + " lat long\n", + "county \n", + "BARKING AND DAGENHAM 51.621062 0.129583\n", + "BARNET 51.812557 -0.218212\n", + "BARNSLEY 53.571895 -1.548719\n", + "BATH AND NORTH EAST SOMERSET 51.354950 -2.486675\n", + "BEDFORD 52.145476 -0.454974\n", + "... ... ...\n", + "WOKINGHAM 51.459659 -0.899371\n", + "WOLVERHAMPTON 52.716852 -2.127595\n", + "WORCESTERSHIRE 52.057964 -2.209184\n", + "WREXHAM 53.000793 -2.991957\n", + "YORK 53.992329 -1.073789\n", + "\n", + "[171 rows x 2 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Total time elapsed: 0.588 seconds </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Stats </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> county_center_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df[[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">groupby(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)[[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'…</span> │ 0.096854766 │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "│ 4 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> display(county_center_df)</span><span style=\"background-color: #272822\"> </span> │ 0.177956740 │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n", + "</pre>\n" + ], + "text/plain": [ + "\u001b[3m \u001b[0m\n", + "\u001b[3m Total time elapsed: 0.588 seconds \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "\u001b[3m Stats \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcounty_center_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mgroupby\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m…\u001b[0m │ 0.096854766 │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "│ 4 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcounty_center_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.177956740 │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%%cudf.pandas.line_profile\n", + "# DO NOT CHANGE THIS CELL\n", + "\n", + "county_center_df=df[['county', 'lat', 'long']].groupby('county')[['lat', 'long']].mean()\n", + "display(county_center_df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b99556b3-b75d-4755-9b03-141d4addb023", + "metadata": {}, + "outputs": [], + "source": [ + "# DO NOT CHANGE THIS CELL\n", + "county_center_df.columns=['lat_county_center', 'long_county_center']\n", + "county_center_df.to_csv('county_centroid.csv')" + ] + }, + { + "cell_type": "markdown", + "id": "49a498f4-b298-446c-8255-60def6d3b0ed", + "metadata": {}, + "source": [ + "### Exercise #1 - Average Age Per County ###\n", + "We would like to find the average age for each county. We will need to use both `.groupby()` and `.sort_values()`. Using the `.mean()` method on the data grouped by `county`, identify the 5 counties with the highest average age. \n", + "\n", + "**Instructions**: <br>\n", + "* Modify the `<FIXME>` only and execute the below cell find the average age for each county. " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "0772c391-4ad6-4fcc-8754-97b575bca1c5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "county\n", + "DORSET 46.577193\n", + "ISLE OF WIGHT 46.149253\n", + "CONWY 45.854473\n", + "POWYS 45.849366\n", + "ISLES OF SCILLY 45.467440\n", + "Name: age, dtype: float64" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[['county', 'age']].groupby('county')['age']\\\n", + " .mean()\\\n", + " .sort_values(ascending=False)\\\n", + " .head()" + ] + }, + { + "cell_type": "raw", + "id": "cd43f771-0ccb-426d-8746-0c6ad63deff4", + "metadata": { + "jupyter": { + "source_hidden": true + }, + "scrolled": true + }, + "source": [ + "\n", + "%%cudf.pandas.line_profile\n", + "# DO NOT CHANGE THIS CELL\n", + "display(\n", + " df[['county', 'age']].groupby('county')['age']\\\n", + " .mean()\\\n", + " .sort_values(ascending=False)\\\n", + " .head()\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0c4a3e80-c5cb-410f-a5ba-fde705594c66", + "metadata": {}, + "source": [ + "Click ... for solution. " + ] + }, + { + "cell_type": "markdown", + "id": "73fc5851-33dc-42a2-b086-6b5d7d2d65bb", + "metadata": {}, + "source": [ + "## Binning ##\n", + "When grouping continuous numerical data, it is sometimes helpful to bin numbers into discrete intervals or buckets. There are primarily two ways of binning: \n", + "* Equal-width binning: divide the range into equal-sized intervals\n", + "* Custom binning: define custom bins based on domain knowledge or specific criteria\n", + "\n", + "The `.cut()` function can be used to bin values into discrete intervals" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "ccf6ff71-40e5-46d6-9d1a-0e3d60fd0b51", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "age_bucket\n", + "0 7874941\n", + "1 6630853\n", + "2 7758863\n", + "3 7691036\n", + "4 7598003\n", + "5 7712976\n", + "6 6124213\n", + "7 4530946\n", + "8 2558063\n", + "dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Total time elapsed: 0.532 seconds </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Stats </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> bins</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">[</span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">0</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">10</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">20</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">30</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">40</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">50</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">60</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">70</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">80</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">90</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">100</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"background-color: #272822\"> </span> │ │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "│ 5 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age_bucket'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">cut(df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">values, bins</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">bins, right</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #66d9ef; text-decoration-color: #66d9ef; background-color: #272822\">True</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, in…</span> │ 0.145499838 │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "│ 6 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> display(df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">groupby(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age_bucket'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">size())</span><span style=\"background-color: #272822\"> </span> │ 0.049299458 │ 0.006484685 │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n", + "</pre>\n" + ], + "text/plain": [ + "\u001b[3m \u001b[0m\n", + "\u001b[3m Total time elapsed: 0.532 seconds \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "\u001b[3m Stats \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mbins\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m10\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m20\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m30\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m40\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m50\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m60\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m70\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m80\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m90\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m100\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "│ 5 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage_bucket\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcut\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mvalues\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mbins\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mbins\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mright\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;102;217;239;48;2;39;40;34mTrue\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34min…\u001b[0m │ 0.145499838 │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "│ 6 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mgroupby\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage_bucket\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msize\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.049299458 │ 0.006484685 │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%%cudf.pandas.line_profile\n", + "# DO NOT CHANGE THIS CELL\n", + "\n", + "bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]\n", + "\n", + "df['age_bucket']=pd.cut(df['age'].values, bins=bins, right=True, include_lowest=True, labels=False)\n", + "display(df.groupby('age_bucket').size())" + ] + }, + { + "cell_type": "markdown", + "id": "db8ec7e5-f6f4-4041-8187-578a15c73808", + "metadata": {}, + "source": [ + "### Exercise #2 - Using the Profiler ###\n", + "cuDF pandas will attempt to use the GPU whenever possible and fall back to CPU for certain operations. Running the code with the `cudf.pandas.line_profile` magic command generates a report showing which operations used the GPU and which used the CPU. \n", + "\n", + "**Instructions**: <br>\n", + "* Notice that the below cell is a very similar operation as before, except that it uses the `range()` function for the `bins` parameter. As it stands, this is not supported in cuDF. \n", + "* Execute the cell below to run the binning operation on the CPU.\n", + "* Compare the time it takes to run the similar operation above. " + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "991c437f-afee-408e-9144-006c4313f9f3", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "age_bucket\n", + "0 7874941\n", + "1 6630853\n", + "2 7758863\n", + "3 7691036\n", + "4 7598003\n", + "5 7712976\n", + "6 6124213\n", + "7 4530946\n", + "8 2558063\n", + "dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Total time elapsed: 1.724 seconds </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Stats </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age_bucket'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">cut(df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">values, bins</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">range(</span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">0</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">100</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">10</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">), r…</span> │ 0.333567078 │ 1.021651556 │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "│ 4 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> display(df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">groupby(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age_bucket'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">size())</span><span style=\"background-color: #272822\"> </span> │ 0.073766744 │ 0.006484282 │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n", + "</pre>\n" + ], + "text/plain": [ + "\u001b[3m \u001b[0m\n", + "\u001b[3m Total time elapsed: 1.724 seconds \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "\u001b[3m Stats \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage_bucket\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcut\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mvalues\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mbins\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrange\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m100\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m10\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mr…\u001b[0m │ 0.333567078 │ 1.021651556 │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "│ 4 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mgroupby\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage_bucket\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msize\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.073766744 │ 0.006484282 │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%%cudf.pandas.line_profile\n", + "# DO NOT CHANGE THIS CELL\n", + "\n", + "df['age_bucket']=pd.cut(df['age'].values, bins=range(0, 100, 10), right=True, include_lowest=True, labels=False)\n", + "display(df.groupby('age_bucket').size())" + ] + }, + { + "cell_type": "markdown", + "id": "95ef539b-6ab4-4f28-b84e-8b83f32d0c0c", + "metadata": {}, + "source": [ + "**Note**: The profiler can help us identify parts of our code that could be rewritten to be more GPU-friendly. " + ] + }, + { + "cell_type": "markdown", + "id": "7c76d718-693c-4550-8eec-3de742565a9d", + "metadata": {}, + "source": [ + "## Advanced Groupby Operations ##\n", + "We can also use function application helpers on `DataFrameGroupBy` instances: \n", + "* `DataFrameGroupby.aggregate()` / `Groupby.agg()`(alias): used when we have specific computation for different columns or more than one computation on the same column\n", + "* `DataFrameGroupby.apply()`: used when we want to perform a specific user-defined function to each group\n", + "* `DataFrameGroupby.transform()`: used when the resulting values should be broadcast across the whole group and return a same-indexed dataframe" + ] + }, + { + "cell_type": "markdown", + "id": "059bded6-fb78-4129-b723-6fa1de7ed8f0", + "metadata": {}, + "source": [ + "### `.apply()` ###\n", + "The `.apply()` method will **sequentially** apply the function group-wise and concatenate the results together. We can pass a callable function to be performed on the entire DataFrame for each group. \n", + "\n", + "Below we calculate the distance of each person from their respective county center. " + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "6906dc46-32fd-45f4-8be1-ff30d944d226", + "metadata": {}, + "outputs": [], + "source": [ + "# DO NOT CHANGE THIS CELL\n", + "\n", + "# define distance function\n", + "def distance(lat_1, long_1, lat_2, long_2): \n", + " return ((lat_2-lat_1)**2+(long_2-long_1)**2)**0.5" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "63906fdc-10a5-4239-951e-e9551ff8d60b", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:1449: RuntimeWarning: GroupBy.apply() performance scales poorly with number of groups. Got 171 groups. Some functions may perform better by passing engine='jit'\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/html": [ + "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Total time elapsed: 20.268 seconds </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Stats </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> distance_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">groupby(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)[[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">apply(</span><span style=\"color: #66d9ef; text-decoration-color: #66d9ef; background-color: #272822\">lambda</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> x: d…</span> │ 3.190437470 │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "│ 4 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'R_1'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">distance_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">reset_index(level</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">0</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, drop</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #66d9ef; text-decoration-color: #66d9ef; background-color: #272822\">True</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"background-color: #272822\"> </span> │ 0.613676883 │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n", + "</pre>\n" + ], + "text/plain": [ + "\u001b[3m \u001b[0m\n", + "\u001b[3m Total time elapsed: 20.268 seconds \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "\u001b[3m Stats \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdistance_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mgroupby\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mapply\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;102;217;239;48;2;39;40;34mlambda\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mx\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34md…\u001b[0m │ 3.190437470 │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "│ 4 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mR_1\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdistance_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mreset_index\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlevel\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdrop\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;102;217;239;48;2;39;40;34mTrue\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.613676883 │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%%cudf.pandas.line_profile\n", + "# DO NOT CHANGE THIS CELL\n", + "\n", + "distance_df=df.groupby('county')[['lat', 'long']].apply(lambda x: distance(x['lat'], x['long'], x['lat'].mean(), x['long'].mean()))\n", + "df['R_1']=distance_df.reset_index(level=0, drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "fc047fb9-8121-40ce-91ba-7156f5fb98c3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>lat</th>\n", + " <th>long</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>54.533638</td>\n", + " <td>-1.524400</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>54.426254</td>\n", + " <td>-1.465314</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>54.555199</td>\n", + " <td>-1.496417</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>54.547909</td>\n", + " <td>-1.572342</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>54.477638</td>\n", + " <td>-1.605995</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>339338</th>\n", + " <td>51.624126</td>\n", + " <td>-2.814977</td>\n", + " </tr>\n", + " <tr>\n", + " <th>339339</th>\n", + " <td>51.616467</td>\n", + " <td>-2.859555</td>\n", + " </tr>\n", + " <tr>\n", + " <th>339340</th>\n", + " <td>51.604843</td>\n", + " <td>-2.915709</td>\n", + " </tr>\n", + " <tr>\n", + " <th>339341</th>\n", + " <td>51.584824</td>\n", + " <td>-2.848425</td>\n", + " </tr>\n", + " <tr>\n", + " <th>339342</th>\n", + " <td>51.623924</td>\n", + " <td>-2.901835</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>855 rows × 2 columns</p>\n", + "</div>" + ], + "text/plain": [ + " lat long\n", + "0 54.533638 -1.524400\n", + "1 54.426254 -1.465314\n", + "2 54.555199 -1.496417\n", + "3 54.547909 -1.572342\n", + "4 54.477638 -1.605995\n", + "... ... ...\n", + "339338 51.624126 -2.814977\n", + "339339 51.616467 -2.859555\n", + "339340 51.604843 -2.915709\n", + "339341 51.584824 -2.848425\n", + "339342 51.623924 -2.901835\n", + "\n", + "[855 rows x 2 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display(df.groupby('county')[['lat', 'long']].head())" + ] + }, + { + "cell_type": "markdown", + "id": "447dcea4-3fc7-4501-9f67-17801f914875", + "metadata": {}, + "source": [ + "We can also define the function in-line. " + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "97a963d9-e61c-4af6-bd32-89a5b892b09b", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:1449: RuntimeWarning: GroupBy.apply() performance scales poorly with number of groups. Got 171 groups. Some functions may perform better by passing engine='jit'\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/html": [ + "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Total time elapsed: 19.596 seconds </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Stats </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'R_2'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">groupby(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)[[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">apply(</span><span style=\"color: #66d9ef; text-decoration-color: #66d9ef; background-color: #272822\">lambda</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> x: ((x…</span> │ 3.577081005 │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n", + "</pre>\n" + ], + "text/plain": [ + "\u001b[3m \u001b[0m\n", + "\u001b[3m Total time elapsed: 19.596 seconds \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "\u001b[3m Stats \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mR_2\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mgroupby\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mapply\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;102;217;239;48;2;39;40;34mlambda\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mx\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mx\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m…\u001b[0m │ 3.577081005 │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%%cudf.pandas.line_profile\n", + "# DO NOT CHANGE THIS CELL\n", + "\n", + "df['R_2']=df.groupby('county')[['lat', 'long']].apply(lambda x: ((x['lat'].mean()-x['lat'])**2+(x['long'].mean()-x['long'])**2)**0.5).reset_index(level=0, drop=True)" + ] + }, + { + "cell_type": "markdown", + "id": "e15cfd17-9b05-4de4-9c9a-fa9167f2d05f", + "metadata": {}, + "source": [ + "**Note**: This is quite slow due to the iterative nature of the `.apply()` method. " + ] + }, + { + "cell_type": "markdown", + "id": "7b7dc9d6-8c1d-4e1f-a6c3-85d036c62c26", + "metadata": {}, + "source": [ + "### `.transform()` ###\n", + "The `.transform()` method aggregates each group, and broadcasts the result to the group size, resulting in a DataFrame that is the same size and index as the input DataFrame. Underneath the hood, the `.transform()` method passes each column individually as a Series to the function. \n", + "\n", + "Below we group the DataFrame by `county` and transform the columns `lat` and `long` using `mean`. We will subtract the transformed mean from the original columns, then apply the distance formula to calculate the resulting distance. By keeping the DataFrame the same shape, we can perform cuDF operations quickly, resulting in performance gain. " + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "b1a3b520-1da5-4486-b024-ff9c4d3e1df3", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# DO NOT CHANGE THIS CELL\n", + "# make data types more precise\n", + "df[['lat', 'long']]=df[['lat', 'long']].astype('float64')" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d6aa82c0-e7df-4f6a-837a-a9bdd3657787", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Total time elapsed: 1.283 seconds </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "<span style=\"font-style: italic\"> Stats </span>\n", + "<span style=\"font-style: italic\"> </span>\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> c</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"background-color: #272822\"> </span> │ │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "│ 4 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'R_3'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">((df[c] </span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">groupby(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)[c]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">transform(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'mean'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)) </span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">**</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">2</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">…</span> │ 1.137889476 │ │\n", + "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n", + "</pre>\n" + ], + "text/plain": [ + "\u001b[3m \u001b[0m\n", + "\u001b[3m Total time elapsed: 1.283 seconds \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "\u001b[3m Stats \u001b[0m\n", + "\u001b[3m \u001b[0m\n", + "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", + "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n", + "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", + "│ 3 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mc\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "│ 4 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mR_3\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mc\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mgroupby\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mc\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtransform\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mmean\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m*\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m*\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m2\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m…\u001b[0m │ 1.137889476 │ │\n", + "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", + "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%%cudf.pandas.line_profile\n", + "# DO NOT CHANGE THIS CELL\n", + "\n", + "c=['lat', 'long']\n", + "df['R_3']=((df[c] - df.groupby('county')[c].transform('mean')) ** 2).sum(axis=1) ** 0.5" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "05a40a5f-138b-4535-a86b-d07271d8fa1c", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>age</th>\n", + " <th>sex</th>\n", + " <th>county</th>\n", + " <th>lat</th>\n", + " <th>long</th>\n", + " <th>name</th>\n", + " <th>age_bucket</th>\n", + " <th>R_1</th>\n", + " <th>R_2</th>\n", + " <th>R_3</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.533638</td>\n", + " <td>-1.524400</td>\n", + " <td>FRANCIS</td>\n", + " <td>0</td>\n", + " <td>0.043671</td>\n", + " <td>0.043671</td>\n", + " <td>0.043671</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.426254</td>\n", + " <td>-1.465314</td>\n", + " <td>EDWARD</td>\n", + " <td>0</td>\n", + " <td>0.149973</td>\n", + " <td>0.149973</td>\n", + " <td>0.149973</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.555199</td>\n", + " <td>-1.496417</td>\n", + " <td>TEDDY</td>\n", + " <td>0</td>\n", + " <td>0.074265</td>\n", + " <td>0.074265</td>\n", + " <td>0.074265</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.547909</td>\n", + " <td>-1.572342</td>\n", + " <td>ANGUS</td>\n", + " <td>0</td>\n", + " <td>0.013108</td>\n", + " <td>0.013108</td>\n", + " <td>0.013108</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>0</td>\n", + " <td>m</td>\n", + " <td>DARLINGTON</td>\n", + " <td>54.477638</td>\n", + " <td>-1.605995</td>\n", + " <td>CHARLIE</td>\n", + " <td>0</td>\n", + " <td>0.069230</td>\n", + " <td>0.069230</td>\n", + " <td>0.069230</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " age sex county lat long name age_bucket R_1 \\\n", + "0 0 m DARLINGTON 54.533638 -1.524400 FRANCIS 0 0.043671 \n", + "1 0 m DARLINGTON 54.426254 -1.465314 EDWARD 0 0.149973 \n", + "2 0 m DARLINGTON 54.555199 -1.496417 TEDDY 0 0.074265 \n", + "3 0 m DARLINGTON 54.547909 -1.572342 ANGUS 0 0.013108 \n", + "4 0 m DARLINGTON 54.477638 -1.605995 CHARLIE 0 0.069230 \n", + "\n", + " R_2 R_3 \n", + "0 0.043671 0.043671 \n", + "1 0.149973 0.149973 \n", + "2 0.074265 0.074265 \n", + "3 0.013108 0.013108 \n", + "4 0.069230 0.069230 " + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "858b309c-4533-4798-a22c-fe6858257cbc", + "metadata": {}, + "source": [ + "Although the `.apply()` method is more flexible and can handle complex operations, it is generally slower. On the other hand, the `.transform()` method can be much faster. When we design the procedures to use vector operations, we will realize significant performance benefits. \n", + "\n", + "**Note**: `Groupby.apply()` doesn't scale well with the number of groups, therefore this performance difference will be more pronounced with higher number of groups. " + ] + }, + { + "cell_type": "markdown", + "id": "7bc715eb-e357-4a92-8778-a144007a9697", + "metadata": {}, + "source": [ + "## Pivot Table ##\n", + "Pivot tables allow us to summarize and aggregate large datasets into a more manageable format for analysis. When using `DataFrame.pivot_table()`, we provide the `index`, `columns`, and `values` arguments, as well as `aggfunc`. This will group the data based on `index` and `columns`, and perform the aggregation on `values`. We can apply multiple aggregation functions, which is generally faster and more memory-efficient than manual grouping and aggregation for large datasets. \n", + "\n", + "Below we create a pivot table that counts the number of each sex in each county. Furthermore, we derive the percentage of the total for each county. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6c8cd25-9892-458e-a9a0-8af10647f9c7", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "%%cudf.pandas.line_profile\n", + "# DO NOT CHANGE THIS CELL\n", + "\n", + "pvt_tbl=df[['county', 'sex', 'name']].pivot_table(index=['county'], columns=['sex'], values='name', aggfunc='count')\n", + "pvt_tbl=pvt_tbl.apply(lambda x: x/sum(x), axis=1)\n", + "display(pvt_tbl)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ba52ec8-18bd-4f30-a3dd-ccbdface4bb4", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "app = IPython.Application.instance()\n", + "app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "id": "9e99bf3f-be76-4ebf-aed3-624d4e8695ac", + "metadata": {}, + "source": [ + "**Well Done!** Let's move to the [next notebook](1-06_data_visualization.ipynb). " + ] + }, + { + "cell_type": "markdown", + "id": "81e47f0a-547e-4714-878d-34eb9b75c835", + "metadata": {}, + "source": [ + "<img src=\"./images/DLI_Header.png\" width=400/>" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} |
