{ "cells": [ { "cell_type": "markdown", "id": "b53a7b12-538d-4459-b82a-a35c8c417849", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "ae497b71-bc43-471e-8970-88a1878e7cf9", "metadata": {}, "source": [ "# Fundamentals of Accelerated Data Science # " ] }, { "cell_type": "markdown", "id": "a149b6d1-1880-4a5d-9d71-f963d3097aa4", "metadata": {}, "source": [ "## 02 - Data Manipulation ##\n", "\n", "**Table of Contents**\n", "
\n", "This notebook explores the fundamentals of data acquisition and manipulation using DataFrame APIs, covering essential techniques for handling and processing datasets. This notebook covers the below sections: \n", "1. [Data Background](#Data-Background)\n", "1. [cuDF and pandas](#cuDF-and-pandas)\n", " * [pandas](#pandas)\n", " * [cuDF](#cuDF)\n", "3. [Data Acquisition](#Data-Acquisition)\n", "4. [Initial Data Exploration](#Initial-Data-Exploration)\n", "5. [Indexing and Data Selection with `.loc` Accessor](#Indexing-and-Data-Selection-with-.loc-Accessor)\n", "6. [Basic Operations](#Basic-Operations)\n", " * [Exercise #1 - Convert `county` Column to Title Case](#Exercise-#1---Convert-county-Column-to-Title-Case)\n", "7. [Aggregation](#Aggregation)\n", "8. [Applying User-Defined Functions (UDFs) with `.map()` and `.apply()`](#Applying-User-Defined-Functions-(UDFs)-with-.map()-and-.apply())\n", "9. [Filtering with `.loc` and Boolean Mask](#Filtering-with-.loc-and-Boolean-Mask)\n", " * [Exercise #2 - Counties North of Sunderland](#Exercise-#2---Counties-North-of-Sunderland)\n", "10. [Creating New Columns](#Creating-New-Columns)\n", "11. [pandas vs. cuDF](#pandas-vs.-cuDF)\n", "12. [cuDF pandas](#cuDF-pandas)\n", " * [Exercise #3 - Automatic Acceleration](#Exercise-#3---Automatic-Acceleration)" ] }, { "cell_type": "markdown", "id": "8b739635-4883-40b2-94e9-7a08f853871c", "metadata": {}, "source": [ "## Data Background ##\n", "For this workshop, we will be reading almost 60 million records (corresponding to the entire population of England and Wales) which were synthesized from official UK census data. " ] }, { "cell_type": "markdown", "id": "95e6bbed-1c08-4002-837c-392d5a12658f", "metadata": {}, "source": [ "## cuDF and pandas ##" ] }, { "cell_type": "markdown", "id": "050926cb-1dee-447a-9da8-49ebb1292d55", "metadata": {}, "source": [ "### pandas ###\n", "[pandas](https://pandas.pydata.org/) is a widely-used open-source library for data manipulation and analysis in Python. It provides high-performance, easy-to-use data structures and tools for working with structured data. It popularized the term DataFrame as a data structure for statistical computing. In data science, pandas is used for: \n", "* **Data loading and writing**: reads from and writes to various file formats like CSV, Excel, JSON, and SQL databases\n", "* **Data cleaning and processing/preprocessing**: helps users with handling missing data, merging datasets, and reshaping data\n", "* **Data analysis**: performs grouping, aggregating, and statistical operations\n", "\n", "**Note**: Data preprocessing refers to the process of transforming raw data into a format that is more suitable for analysis and other downstream tasks. " ] }, { "cell_type": "markdown", "id": "79e09f10-be1d-4ffe-9247-1e605e3f450f", "metadata": {}, "source": [ "### cuDF ###\n", "Similarly, [cuDF](https://docs.rapids.ai/api/cudf/stable/) is a Python GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF is designed to accelerate data science workflows by utilizing the parallel processing power of GPUs, potentially offering significant speed improvements over CPU-based alternatives for large datasets. The key features of cuDF include: \n", "* **GPU Acceleration**: leverages NVIDIA GPUs for fast data processing and analysis\n", "* **pandas-like API**: provides users a familiar interface and transition to GPU-based computing\n", "* **Integration with other RAPIDS libraries**: works seamlessly with other GPU-accelerated tools in the RAPIDS ecosystem" ] }, { "cell_type": "markdown", "id": "ff5519e2-f77f-4160-b362-979301705733", "metadata": {}, "source": [ "**Note**: Both Pandas and cuDF serve similar purposes in data manipulation and analysis, but cuDF is specifically optimized for GPU acceleration, making it particularly useful for working with large datasets where performance is critical." ] }, { "cell_type": "markdown", "id": "770fb1d8-73c5-4c45-a1e4-599f66e6b833", "metadata": {}, "source": [ "## Data Acquisition ##\n", "In our context, data acquisition refers to the process of collecting and importing data from various sources into a Python environment for analysis, processing, and manipulation. Data can come from a variety of sources: \n", "* Local file in various formats\n", "* Databases\n", "* APIs\n", "* Web scraping\n", "\n", "It's worth noting that Python's rich ecosystem of libraries makes it versatile for acquiring data from various sources, allowing data scientists to work with diverse datasets efficiently. CPU processing will be responsible for acquiring data from APIs or Web Scraping. In most cases, network bandwidth will likely be the bottleneck. Furthermore, cuDF doesn't have a way to get transactions from SQL data bases directly into GPU memory. The recommended approach for reading data from a database is to first use CPU-based methods (i.e. pandas), then convert to cuDF for GPU-accelerated processing. \n", "\n", "Below we use the `head` linux command to display the beginning of the data file. This allows us to understand how to read the data correctly. " ] }, { "cell_type": "code", "execution_count": 1, "id": "247d2b96-1bce-4e26-89bd-d659df3528d7", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"age\",\"sex\",\"county\",\"lat\",\"long\",\"name\"\n", "0,\"m\",\"DARLINGTON\",54.53364379,-1.524400639,\"FRANCIS\"\n", "0,\"m\",\"DARLINGTON\",54.42625551,-1.465313919,\"EDWARD\"\n", "0,\"m\",\"DARLINGTON\",54.55520036,-1.496417277,\"TEDDY\"\n", "0,\"m\",\"DARLINGTON\",54.54790635,-1.572341399,\"ANGUS\"\n" ] } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "!head -n 5 data/uk_pop.csv" ] }, { "cell_type": "markdown", "id": "be7bd168-eb68-45cc-8009-64569974a187", "metadata": {}, "source": [ "One row will represent one person. We have information about their `age`, `sex`, `county`, location, and `name`. Using cuDF, the RAPIDS API providing a GPU-accelerated DataFrame, we can read data from [a variety of formats](https://rapidsai.github.io/projects/cudf/en/0.10.0/api.html#module-cudf.io.csv), including csv, json, parquet, feather, orc, and pandas DataFrames, among others. " ] }, { "cell_type": "code", "execution_count": 2, "id": "89c435b1-35d5-4971-ade1-549ae77d22db", "metadata": {}, "outputs": [], "source": [ "# DO NOT CHANGE THIS CELL\n", "import cudf\n", "import cupy as cp\n", "import numpy as np\n", "\n", "from datetime import datetime\n", "import random\n", "import time" ] }, { "cell_type": "markdown", "id": "9cb9faf3-4dc9-42bf-b481-98fb4155033e", "metadata": {}, "source": [ "Below we read the data from a local csv file directly into GPU memory with the `read_csv()` function. " ] }, { "cell_type": "code", "execution_count": 3, "id": "a343a943-fd64-45f6-abd5-a991810cf5f3", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Duration: 3.72 seconds\n" ] } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "start=time.time()\n", "df=cudf.read_csv('./data/uk_pop.csv')\n", "print(f'Duration: {round(time.time()-start, 2)} seconds')" ] }, { "cell_type": "markdown", "id": "c406541c-884a-49c3-b5cb-7aaf21b60403", "metadata": {}, "source": [ "**Note**: Because of the sophisticated GPU memory management behind the scenes in cuDF, the first data load into a fresh RAPIDS memory environment is sometimes substantially slower than subsequent loads. The [RAPIDS Memory Manager](https://github.com/rapidsai/rmm) is preparing additional memory to accommodate the array of data science operations that we may be interested in using on the data, rather than allocating and deallocating the memory repeatedly throughout the workflow. \n", "\n", "Below we get the general information about the DataFrame with the `DataFrame.info()` method. " ] }, { "cell_type": "code", "execution_count": 5, "id": "18cd5602-9129-4809-a95f-1e30940558c5", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 58479894 entries, 0 to 58479893\n", "Data columns (total 6 columns):\n", " # Column Dtype\n", "--- ------ -----\n", " 0 age int64\n", " 1 sex object\n", " 2 county object\n", " 3 lat float64\n", " 4 long float64\n", " 5 name object\n", "dtypes: float64(2), int64(1), object(3)\n", "memory usage: 2.9 GB\n" ] } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "df.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "id": "a8289385-f1ac-4ccd-8ba4-6b127b200b42", "metadata": {}, "source": [ "The **DataFrame** is a two-dimensional labeled data structure. It's organized in rows and columns, similar to a spreadsheet or SQL table. Both rows and columns have labels. Rows are typically labeled with an index, while columns have column names. Data is aligned based on row and column labels when performing operations. This is useful for enabling highly efficient vectorized operations across columns or rows. A **Series** refers to a one-dimensional array and is typically associated with a single column of data with an index. \n", "\n", "There are ~60MM records across 6 columns. cuDF is able to read data from local files directly into the GPU very efficiently. By default, cuDF samples the dataset to infer the most appropriate data types for each columns. \n", "\n", "**Note**: The DataFrame has `.dtypes` and `.columns` attributes that can be used to get similar information. " ] }, { "cell_type": "markdown", "id": "af6127ab-437e-4a60-b9bd-5f9671c10602", "metadata": {}, "source": [ "## Initial Data Exploration ##\n", "Now that we have some data loaded, let's do some initial exploration. \n", "\n", "Below we preview the DataFrame with the `DataFrame.head()` method. " ] }, { "cell_type": "code", "execution_count": 6, "id": "7faf372e-644c-4120-8080-779f3a23a152", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcountylatlongname
00mDARLINGTON54.533644-1.524401FRANCIS
10mDARLINGTON54.426256-1.465314EDWARD
20mDARLINGTON54.555200-1.496417TEDDY
30mDARLINGTON54.547906-1.572341ANGUS
40mDARLINGTON54.477639-1.605995CHARLIE
\n", "
" ], "text/plain": [ " age sex county lat long name\n", "0 0 m DARLINGTON 54.533644 -1.524401 FRANCIS\n", "1 0 m DARLINGTON 54.426256 -1.465314 EDWARD\n", "2 0 m DARLINGTON 54.555200 -1.496417 TEDDY\n", "3 0 m DARLINGTON 54.547906 -1.572341 ANGUS\n", "4 0 m DARLINGTON 54.477639 -1.605995 CHARLIE" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "df.head()" ] }, { "cell_type": "markdown", "id": "ee4649b1-2730-47e9-a9d7-331fe7514241", "metadata": {}, "source": [ "## Indexing and Data Selection with `.loc` Accessor ##\n", "The `.loc` accessor in cuDF DataFrames is used for label-based indexing and selection of data. It allows us to access and manipulate data in a DataFrame based on row and column labels. We can use `DataFrame.loc[row_label(s), column_label(s)]` to access a group of rows and columns. When selecting multiple labels, a list (`[]`) is used. Furthermore, we can use the slicing operator (`:`, i.e. `start:end`) to specify a range of elements. " ] }, { "cell_type": "code", "execution_count": 7, "id": "40d63289-8f23-424c-b3f4-0b7098c9b5a1", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "----------------------------------------\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcounty
00mDARLINGTON
10mDARLINGTON
20mDARLINGTON
\n", "
" ], "text/plain": [ " age sex county\n", "0 0 m DARLINGTON\n", "1 0 m DARLINGTON\n", "2 0 m DARLINGTON" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "----------------------------------------\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcounty
00mDARLINGTON
10mDARLINGTON
20mDARLINGTON
30mDARLINGTON
40mDARLINGTON
50mDARLINGTON
\n", "
" ], "text/plain": [ " age sex county\n", "0 0 m DARLINGTON\n", "1 0 m DARLINGTON\n", "2 0 m DARLINGTON\n", "3 0 m DARLINGTON\n", "4 0 m DARLINGTON\n", "5 0 m DARLINGTON" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "----------------------------------------\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcountylatlongname
00mDARLINGTON54.533644-1.524401FRANCIS
10mDARLINGTON54.426256-1.465314EDWARD
20mDARLINGTON54.555200-1.496417TEDDY
30mDARLINGTON54.547906-1.572341ANGUS
40mDARLINGTON54.477639-1.605995CHARLIE
50mDARLINGTON54.522900-1.599255VICTOR
60mDARLINGTON54.501872-1.667874EAMONN
70mDARLINGTON54.554709-1.494506HARRY
80mDARLINGTON54.602288-1.586457HECTOR
90mDARLINGTON54.489992-1.652537THEODORE
100mDARLINGTON54.551315-1.519593MUHAMMAD
\n", "
" ], "text/plain": [ " age sex county lat long name\n", "0 0 m DARLINGTON 54.533644 -1.524401 FRANCIS\n", "1 0 m DARLINGTON 54.426256 -1.465314 EDWARD\n", "2 0 m DARLINGTON 54.555200 -1.496417 TEDDY\n", "3 0 m DARLINGTON 54.547906 -1.572341 ANGUS\n", "4 0 m DARLINGTON 54.477639 -1.605995 CHARLIE\n", "5 0 m DARLINGTON 54.522900 -1.599255 VICTOR\n", "6 0 m DARLINGTON 54.501872 -1.667874 EAMONN\n", "7 0 m DARLINGTON 54.554709 -1.494506 HARRY\n", "8 0 m DARLINGTON 54.602288 -1.586457 HECTOR\n", "9 0 m DARLINGTON 54.489992 -1.652537 THEODORE\n", "10 0 m DARLINGTON 54.551315 -1.519593 MUHAMMAD" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "# get first cell\n", "display(df.loc[0, 'age'])\n", "print('-'*40)\n", "\n", "# get multiple rows and columns\n", "display(df.loc[[0, 1, 2], ['age', 'sex', 'county']])\n", "print('-'*40)\n", "\n", "# slice a range of rows and columns\n", "display(df.loc[0:5, 'age':'county'])\n", "print('-'*40)\n", "\n", "# slice a range of rows and columns\n", "display(df.loc[:10, :'name'])" ] }, { "cell_type": "markdown", "id": "a451118f-b986-49b6-ae03-2526e44007a7", "metadata": {}, "source": [ "**Note**: `df[column_label(s)]` is another way to access specific columns, similar to `df.loc[:, column_labels]`. " ] }, { "cell_type": "markdown", "id": "055f8828-db5b-419a-aab5-bf149b9fd829", "metadata": {}, "source": [ "## Basic Operations ##\n", "cuDF support a wide range of operations for numerical data. Although strings are not a data type traditionally associated with GPUs, cuDF supports powerful accelerated string operations.\n", "* Numerical operations:\n", " * Arithmetic operations: addition, subtraction, multiplication, division\n", "* String operations:\n", " * Case conversion: `.upper()`, `.lower()`, `.title()`\n", " * String manipulation: concatenation, substring, extraction, padding\n", " * Pattern matching: `contains()`\n", " * Splitting: `.split()`\n", "* Comparison operations: greater than, less than, equal to, etc.\n", "\n", "These operations will be performed element-wise for each row. This allows for efficient **vectorized operations** across entire columns. These operations are implemented as vector operations instead of iteration because vector operations can be applied to entire arrays of data, instead of iterating through each element individually. Vectorization is significantly faster than iterating over elements, especially for large datasets. When operating on multiple columns, operations are aligned by index, ensuring that calculations are performed on the correct corresponding elements across columns. These element-wise operations are typically highly optimized and can be much faster than explicit loops, especially for large datasets. We can get the underlying array of data with the `.values` attribute. This is useful when we want to perform operations on the underlying data. \n", "\n", "**Note**: Iterating over a cuDF Series, DataFrame or Index is not supported. This is because iterating over data that resides on the GPU will yield extremely poor performance, as GPUs are optimized for highly parallel operations rather than sequential operations. \n", "\n", "Below we calculate the birth year for each person. " ] }, { "cell_type": "code", "execution_count": 8, "id": "d4286299-3a43-4e53-a9fb-04e1f20a40a4", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 2025\n", "1 2025\n", "2 2025\n", "3 2025\n", "4 2025\n", " ... \n", "58479889 1935\n", "58479890 1935\n", "58479891 1935\n", "58479892 1935\n", "58479893 1935\n", "Name: age, Length: 58479894, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "array([2025, 2025, 2025, ..., 1935, 1935, 1935])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "# get current year\n", "current_year=datetime.now().year\n", "\n", "# derive the birth year\n", "display(current_year-df.loc[:, 'age'])\n", "\n", "# get the age array (CuPy for cuDF)\n", "age_ary=df.loc[:, 'age'].values\n", "\n", "# derive the birth year\n", "current_year-age_ary" ] }, { "cell_type": "markdown", "id": "00213228-e32e-4a88-853e-eef53fad4da8", "metadata": {}, "source": [ "When performing operations between a DataFrame and a scalar value, the scalar is \"broadcast\" to match the shape of the DataFrame, effectively applying it to each element. \n", "\n", "```\n", "current_year - df.loc[:, 'age']\n", "-------------------------------\n", " (scalar) (array) \n", " 2024, - 0\n", " 2024, - 0\n", " 2024, - 0\n", " 2024, - 0\n", " 2024, - 0\n", " ... - ...\n", "```\n", "\n", "This partially explains why cuDF provides significant performance improvements over pandas, especially for large datasets. The parallel processing architecture of GPUs are designed with thousands of small, specialized cores that can execute many operations simultaneously. This architecture is ideal for vectorized operations, which perform the same instruction on multiple data elements in parallel. " ] }, { "cell_type": "markdown", "id": "760a2729-9c7a-4602-83ac-5e171cc4f5f9", "metadata": {}, "source": [ "\n", "### Exercise #1 - Convert `county` Column to Title Case ###\n", "As it stands, all of the counties are UPPERCASE. We want to convert the `county` column to title case. \n", "\n", "**Instructions**:
\n", "* Modify the `` only and execute the below cell to convert the `county` column to title case. " ] }, { "cell_type": "code", "execution_count": 10, "id": "5365b5b6-6a00-47e7-8839-0cc8b183c6d7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Darlington\n", "1 Darlington\n", "2 Darlington\n", "3 Darlington\n", "4 Darlington\n", " ... \n", "58479889 Newport\n", "58479890 Newport\n", "58479891 Newport\n", "58479892 Newport\n", "58479893 Newport\n", "Name: county, Length: 58479894, dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['county'].str.title()" ] }, { "cell_type": "raw", "id": "738892d3-4bab-4404-af83-b8623804ca5d", "metadata": {}, "source": [ "\n", "df['county'].str.title()" ] }, { "cell_type": "markdown", "id": "35a2520b-5eec-4d59-a956-3f31b43a98b2", "metadata": {}, "source": [ "Click ... for solution. " ] }, { "cell_type": "markdown", "id": "d163438b-9993-41e7-856c-76101135a9ad", "metadata": {}, "source": [ "Performing comparison operations or applying conditions create boolean values (`True`/`False`) that corresponds element-wise. \n", "\n", "Below we check if each person is an adult. " ] }, { "cell_type": "code", "execution_count": 11, "id": "481218f1-ec09-4776-bda6-b039ccc190ab", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "58479889 True\n", "58479890 True\n", "58479891 True\n", "58479892 True\n", "58479893 True\n", "Name: age, Length: 58479894, dtype: bool" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "df['age']>=18" ] }, { "cell_type": "markdown", "id": "0b98c380-828a-404b-9fb2-1d215337eff0", "metadata": {}, "source": [ "## Aggregation ##\n", "Aggregation is an important operation for data science tasks, allowing us to summarize and analyze grouped data. It's commonly used for tasks like calculating totals, averages, counts, etc. cuDF supports common aggregations like `.sum()`, `.mean()`, `.min()`, `.max()`, `.count()`, `.std()`(standard deviation), etc. It also supports more advanced aggregations like `.quantile()` and `.corr()` (correlation). With the `axis` parameter, aggregation operations can be applied column-wise (`0`) or row-wise (`1`). \n", "\n", "When the aggregation is implemented as a vector operation, specifically a reduction operation, it is very efficient on the GPU becasue a large number of data elements can be processed simultaneously and in parallel. Column-wise operations also benefit from the [Apache Arrow columnar memory format](https://arrow.apache.org/docs/format/Columnar.html). \n", "\n", "

\n", "\n", "Below we calculate the arithmetic mean of `lat` and `long` to get an approximate center. " ] }, { "cell_type": "code", "execution_count": 14, "id": "bee8be82-8631-4863-a1fa-e46eed47e334", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "lat 52.350600\n", "long -1.304956\n", "dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "df[['lat', 'long']].mean()" ] }, { "cell_type": "markdown", "id": "8a53f0f4-7dc5-40fd-af07-3e82d6556393", "metadata": {}, "source": [ "## Applying User-Defined Functions (UDFs) with `.map()` and `.apply()` ##\n", "The `.map()` and `.apply()` methods are the primary ways of applying user-defined functions element-wise, and row or column-wise, respectively. We can pass a callable function (built-in or user-defined) as the argument, which is then applied to the entire data structure. Not all operations can be vectorized, especially complex custom logic. In such cases, methods like `.apply()` or custom UDFs might be necessary.\n", "\n", "Below we use `.apply()` to check if each person is an adult. " ] }, { "cell_type": "code", "execution_count": 15, "id": "c65e80f8-1cc3-453c-85f2-910dab451228", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", " ..\n", "58479889 1\n", "58479890 1\n", "58479891 1\n", "58479892 1\n", "58479893 1\n", "Length: 58479894, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Duration: 2.19 seconds\n" ] } ], "source": [ "# DO NOT CHNAGE THIS CELL\n", "# define a function to check if age is greater than or equal to 18\n", "start=time.time()\n", "def is_adult(row): \n", " if row['age']>=18: \n", " return 1\n", " else: \n", " return 0\n", "\n", "# derive the birth year\n", "display(df.apply(is_adult, axis=1))\n", "print(f'Duration: {round(time.time()-start, 2)} seconds')" ] }, { "cell_type": "markdown", "id": "02828781-6f5b-49d5-87a5-aa5ef08adf15", "metadata": {}, "source": [ "We can also use a [**lambda function**](https://docs.python.org/3/glossary.html#term-lambda) when the function is simple. Lambda functions are limited to a single expression but can include a conditional statement and mulitple arguments. " ] }, { "cell_type": "code", "execution_count": 16, "id": "31d6e7fb-f435-4b1f-8e74-e732cc406b51", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", " ..\n", "58479889 1\n", "58479890 1\n", "58479891 1\n", "58479892 1\n", "58479893 1\n", "Length: 58479894, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Duration: 0.05 seconds\n" ] } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "# derive the birth year\n", "start=time.time()\n", "display(df.apply(lambda x: 1 if x['age']>=18 else 0, axis=1))\n", "print(f'Duration: {round(time.time()-start, 2)} seconds')" ] }, { "cell_type": "markdown", "id": "475cc2f1-4dc9-4492-aab5-9e51ebf54ebb", "metadata": {}, "source": [ "**Note**: The `.apply()` function in pandas accepts any user-defined function that can include arbitrary operations that are applied to each value of a Series and DataFrame. cuDF also supports `.apply()`, but it relies on Numba to JIT compile the UDF (not in scope) and execute it on the GPU. This can be extremely fast, but imposes a few limitations on what operations are allowed in the UDF. See the docs on [UDFs](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs/) for details." ] }, { "cell_type": "code", "execution_count": 17, "id": "ecadefaa-380c-412c-87af-05c63d3f7871", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", " ..\n", "58479889 1\n", "58479890 1\n", "58479891 1\n", "58479892 1\n", "58479893 1\n", "Name: age, Length: 58479894, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Duration: 0.02 seconds\n" ] } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "# derive the birth year\n", "start=time.time()\n", "display((df['age']>=18).astype('int'))\n", "print(f'Duration: {round(time.time()-start, 2)} seconds')" ] }, { "cell_type": "markdown", "id": "84f0c41b-ad56-4cc7-b074-f2390f41cc70", "metadata": {}, "source": [ "Below we use `Series.map()` to determine the number of characters in each person's name. " ] }, { "cell_type": "code", "execution_count": null, "id": "c4a3e4e1-fd83-4024-bcbf-29216c11016f", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# DO NOT CHANGE THIS CELL\n", "df['name'].map(lambda x: len(x))" ] }, { "cell_type": "markdown", "id": "7c717ada-69b2-4982-b81b-8594af6d9bf1", "metadata": {}, "source": [ "## Filtering with `.loc` and Boolean Mask ##\n", "A boolean mask is an array of `True`/`False` values that corresponds element-wise to another array or data structure. It's used for filtering and selecting data based on certain conditions. In this context, the mask can be used to index or filter a DataFrame with `.loc`, selecting only the elements where the mask is `True`. \n", "\n", "**Note**: Boolean masking is often more efficient than iterative approaches, especially for large datasets, as it leverages vectorized operations. \n", "\n", "Below we use the `.loc` accessor and a boolean mask to filter people whose names start with an `E`. " ] }, { "cell_type": "code", "execution_count": 18, "id": "cf9cc540-1de6-4e50-986a-5bf9bd9056a6", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcountylatlongname
10mDARLINGTON54.426256-1.465314EDWARD
60mDARLINGTON54.501872-1.667874EAMONN
340mDARLINGTON54.483065-1.501312ETHAN
450mDARLINGTON54.640205-1.558986ELVIN
490mDARLINGTON54.575450-1.600592EDWARD
.....................
5847985990fNEWPORT51.576452-2.891774EDIE
5847986790fNEWPORT51.555083-3.080259ELEANOR
5847987190fNEWPORT51.515820-2.839532EMERSON
5847987590fNEWPORT51.510140-3.004406ELLA
5847988590fNEWPORT51.586575-2.799302ELIN
\n", "

5081794 rows × 6 columns

\n", "
" ], "text/plain": [ " age sex county lat long name\n", "1 0 m DARLINGTON 54.426256 -1.465314 EDWARD\n", "6 0 m DARLINGTON 54.501872 -1.667874 EAMONN\n", "34 0 m DARLINGTON 54.483065 -1.501312 ETHAN\n", "45 0 m DARLINGTON 54.640205 -1.558986 ELVIN\n", "49 0 m DARLINGTON 54.575450 -1.600592 EDWARD\n", "... ... .. ... ... ... ...\n", "58479859 90 f NEWPORT 51.576452 -2.891774 EDIE\n", "58479867 90 f NEWPORT 51.555083 -3.080259 ELEANOR\n", "58479871 90 f NEWPORT 51.515820 -2.839532 EMERSON\n", "58479875 90 f NEWPORT 51.510140 -3.004406 ELLA\n", "58479885 90 f NEWPORT 51.586575 -2.799302 ELIN\n", "\n", "[5081794 rows x 6 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "boolean_mask=df['name'].str.startswith('E')\n", "df.loc[boolean_mask]" ] }, { "cell_type": "markdown", "id": "5d76a6dd-5d0b-4fd2-868a-235255375af0", "metadata": {}, "source": [ "Multiple conditions can be combined using logical operators (`&` and `|`). \n", "\n", "**Note**: When using multiple conditions, it's important to wrap each condition in parentheses (`(` and `)`) to ensure correct order of operations. \n", "\n", "Below we use the `.loc` accessor and multiple conditions to filter adults whose names start with an `E`. " ] }, { "cell_type": "code", "execution_count": 19, "id": "03713403-6575-437d-99f0-c7f8ec3cb13b", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcountylatlongname
10mDARLINGTON54.426256-1.465314EDWARD
60mDARLINGTON54.501872-1.667874EAMONN
340mDARLINGTON54.483065-1.501312ETHAN
450mDARLINGTON54.640205-1.558986ELVIN
490mDARLINGTON54.575450-1.600592EDWARD
.....................
5847988990fNEWPORT51.626744-2.859381FREYA
5847989090fNEWPORT51.546043-2.897815GEORGINA
5847989190fNEWPORT51.605268-2.849656REBECCA
5847989290fNEWPORT51.554649-2.934364JESSICA
5847989390fNEWPORT51.578787-2.827954FLORENCE
\n", "

47085782 rows × 6 columns

\n", "
" ], "text/plain": [ " age sex county lat long name\n", "1 0 m DARLINGTON 54.426256 -1.465314 EDWARD\n", "6 0 m DARLINGTON 54.501872 -1.667874 EAMONN\n", "34 0 m DARLINGTON 54.483065 -1.501312 ETHAN\n", "45 0 m DARLINGTON 54.640205 -1.558986 ELVIN\n", "49 0 m DARLINGTON 54.575450 -1.600592 EDWARD\n", "... ... .. ... ... ... ...\n", "58479889 90 f NEWPORT 51.626744 -2.859381 FREYA\n", "58479890 90 f NEWPORT 51.546043 -2.897815 GEORGINA\n", "58479891 90 f NEWPORT 51.605268 -2.849656 REBECCA\n", "58479892 90 f NEWPORT 51.554649 -2.934364 JESSICA\n", "58479893 90 f NEWPORT 51.578787 -2.827954 FLORENCE\n", "\n", "[47085782 rows x 6 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "df[(df['age']>=18) | (df['name'].str.startswith('E'))]" ] }, { "cell_type": "markdown", "id": "69b7f7ef-b270-4eae-8852-d6f48bf83086", "metadata": {}, "source": [ "\n", "### Exercise #2 - Counties North of Sunderland ###\n", "This exercise will require to use the `.loc` accessor, and several of the techniques described above. We want to identify the latitude of the northernmost resident of Sunderland county (the person with the maximum `lat` value), and then determine which counties have any residents north of this resident. Use the `Series.unique()` method of to de-duplicate the result.\n", "\n", "**Instructions**:
\n", "* Modify the `` only and execute the below cell to identify counties north of Sunderland. " ] }, { "cell_type": "code", "execution_count": 20, "id": "8c1394db-6bca-473b-a053-61a6066bd835", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 COUNTY DURHAM\n", "1 NORTHUMBERLAND\n", "2 GATESHEAD\n", "3 NEWCASTLE UPON TYNE\n", "4 NORTH TYNESIDE\n", "5 SOUTH TYNESIDE\n", "6 CUMBRIA\n", "7 NORTH YORKSHIRE\n", "Name: county, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sunderland_residents=df.loc[df['county'].str.upper() == 'SUNDERLAND']\n", "northmost_sunderland_lat=sunderland_residents['lat'].max()\n", "df.loc[df['lat'] > northmost_sunderland_lat]['county'].unique()" ] }, { "cell_type": "raw", "id": "0cf99881-1a27-409a-822b-7e62b5953f3a", "metadata": {}, "source": [ "\n", "sunderland_residents=df.loc[df['county'] == 'SUNDERLAND']\n", "northmost_sunderland_lat=sunderland_residents['lat'].max()\n", "df.loc[df['lat'] > northmost_sunderland_lat]['county'].unique()" ] }, { "cell_type": "markdown", "id": "0594efe7-97d4-4884-bffb-a26f5144ad54", "metadata": {}, "source": [ "Click ... for solution. " ] }, { "cell_type": "markdown", "id": "d5ead43e-37bc-4a3d-a64f-bb82ad01ad99", "metadata": {}, "source": [ "## Creating New Columns ##" ] }, { "cell_type": "markdown", "id": "45ada779-bc14-4fba-88d8-62c282345a63", "metadata": {}, "source": [ "We can create new columns by assigning values to the column label. The new column should have the same number of rows as the existing DataFrame. Typically, we create new columns by performing operations on existing columns. \n", "\n", "Below we create a few additional columns. " ] }, { "cell_type": "code", "execution_count": 21, "id": "977fdb2b-dbf1-4842-ab0f-31b9af65e0d1", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcountylatlongnamebirth_yearsex_normalizecounty_normalize
00mDARLINGTON54.533644-1.524401Francis2025MDarlington
10mDARLINGTON54.426256-1.465314Edward2025MDarlington
20mDARLINGTON54.555200-1.496417Teddy2025MDarlington
30mDARLINGTON54.547906-1.572341Angus2025MDarlington
40mDARLINGTON54.477639-1.605995Charlie2025MDarlington
\n", "
" ], "text/plain": [ " age sex county lat long name birth_year \\\n", "0 0 m DARLINGTON 54.533644 -1.524401 Francis 2025 \n", "1 0 m DARLINGTON 54.426256 -1.465314 Edward 2025 \n", "2 0 m DARLINGTON 54.555200 -1.496417 Teddy 2025 \n", "3 0 m DARLINGTON 54.547906 -1.572341 Angus 2025 \n", "4 0 m DARLINGTON 54.477639 -1.605995 Charlie 2025 \n", "\n", " sex_normalize county_normalize \n", "0 M Darlington \n", "1 M Darlington \n", "2 M Darlington \n", "3 M Darlington \n", "4 M Darlington " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "# get current year\n", "current_year=datetime.now().year\n", "\n", "# numerical operations\n", "df['birth_year']=current_year-df['age']\n", "\n", "# string operations\n", "df['sex_normalize']=df['sex'].str.upper()\n", "df['county_normalize']=df['county'].str.title().str.replace(' ', '_')\n", "df['name']=df['name'].str.title()\n", "\n", "# preview\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 22, "id": "66c5d332-a6ef-4f8d-9560-fa860ea1679a", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'status': 'ok', 'restart': True}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "id": "cd7bed0a-d46d-402a-884e-e2e17c98738c", "metadata": {}, "source": [ "## pandas vs. cuDF ##\n", "Except for being much more performant with large datasets, cuDF looks and feels a lot like Pandas. By way of review, cuDF and pandas share the below similarities: \n", "* **API similarity**: cuDF provides a pandas-like API that is familiar to data engineers and data scientists. It aims to implement many of the same functions and operations as pandas, allowing users to easily accelerate their existing pandas workflows.\n", "* **Similar operations**: cuDF implements many common pandas operations such as filtering, joining, aggregating, and groupby.\n", "\n", "

\n", "\n", "Comparing to pandas, cuDF tends to perform better for large datasets because of the follow features: \n", "* GPUs excel at parallel computation, which is advantageous for many data science and machine learning tasks.\n", "* GPUs typically have much higher memory bandwidth than CPUs, allowing for faster data access in memory-bound operations.\n", "* cuDF leverages GPU's ability to perform vectorized operations efficiently, which is particularly beneficial for large datasets.\n", "* cuDF uses a columnar data format, which can lead to more efficient memory access patterns on GPUs. When performing data operations on cuDF Dataframes, column operations are typically much more performant than row-wise operations.\n", "\n", "**Note**: It's important to note that the performance advantage of cuDF over pandas can vary depending on the specific operation, data size, and hardware configuration. For smaller datasets or simpler operations, the overhead of GPU initialization might make pandas on CPU faster." ] }, { "cell_type": "markdown", "id": "c9c00b32-f5f2-46de-a50a-fc54f3244dab", "metadata": {}, "source": [ "## cuDF pandas ##\n", "Starting with version `23.10.01`, cuDF introduced a **pandas accelerator mode** (`cudf.pandas`) that supports 100% of the pandas API. This mode allows users to accelerate pandas code on the GPU without requiring any code changes. Not all operations can be performed on the GPU. When using `cudf.pandas`, operations that can be accelerated will run on the GPU, while unsupported operations will automatically fall back to pandas on the CPU. For example, `.read_sql()`. this will first read sql with pandas and move the data to cuDF. \n", "\n", "There are two ways to activate cuDF pandas:\n", "- Jupyter Magic Command\n", "```\n", "%load_ext cudf.pandas\n", "import pandas\n", "...\n", "```\n", "- Python Import\n", "```\n", "import cudf.pandas\n", "cudf.pandas.install()\n", "\n", "import pandas as pd\n", "...\n", "```\n", "\n", "**Note**: There are no other changes required - this is useful to quickly accelerate existing workloads with minimum code change. More information about cuDF pandas can be found [here](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/). \n", "\n", "cuDF pandas is a no code change accelerator for pandas for automatic acceleration of any supported pandas call. \n", "\n", "Below we run some basic DataFrame operations with pandas, before demonstrating how cudf pandas is enabled. " ] }, { "cell_type": "code", "execution_count": 1, "id": "5fed82ae-0ecb-4471-bb8f-060b1bf4542f", "metadata": {}, "outputs": [], "source": [ "# DO NOT CHANGE THIS CELL\n", "%load_ext cudf.pandas" ] }, { "cell_type": "code", "execution_count": 2, "id": "7671791e-c491-4831-bd1b-956de6b455e5", "metadata": {}, "outputs": [], "source": [ "# DO NOT CHANGE THIS CELL\n", "import pandas as pd\n", "import time\n", "from datetime import datetime" ] }, { "cell_type": "code", "execution_count": 3, "id": "47c87c9f-5b97-4a0d-bfa7-a26c1369314f", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Duration: 1.99 seconds\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcountylatlongnamebirth_yearsex_normalizecounty_normalize
00mDARLINGTON54.533644-1.524401Francis2025MDarlington
10mDARLINGTON54.426256-1.465314Edward2025MDarlington
20mDARLINGTON54.555200-1.496417Teddy2025MDarlington
30mDARLINGTON54.547906-1.572341Angus2025MDarlington
40mDARLINGTON54.477639-1.605995Charlie2025MDarlington
\n", "
" ], "text/plain": [ " age sex county lat long name birth_year \\\n", "0 0 m DARLINGTON 54.533644 -1.524401 Francis 2025 \n", "1 0 m DARLINGTON 54.426256 -1.465314 Edward 2025 \n", "2 0 m DARLINGTON 54.555200 -1.496417 Teddy 2025 \n", "3 0 m DARLINGTON 54.547906 -1.572341 Angus 2025 \n", "4 0 m DARLINGTON 54.477639 -1.605995 Charlie 2025 \n", "\n", " sex_normalize county_normalize \n", "0 M Darlington \n", "1 M Darlington \n", "2 M Darlington \n", "3 M Darlington \n", "4 M Darlington " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "                                             Total time elapsed: 3.759 seconds                                     \n",
       "                                                                                                                   \n",
       "                                                           Stats                                                   \n",
       "                                                                                                                   \n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃ Line no.  Line                                                                      GPU TIME(s)  CPU TIME(s) ┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │     start=time.time()                                                    │             │             │\n",
       "│          │                                                                          │             │             │\n",
       "│ 4        │     df=pd.read_csv('./data/uk_pop.csv')                                  │ 1.540004319 │             │\n",
       "│          │                                                                          │             │             │\n",
       "│ 5        │     current_year=datetime.now().year                                     │             │             │\n",
       "│          │                                                                          │             │             │\n",
       "│ 7        │     df['birth_year']=current_year-df['age']                              │ 0.011366894 │             │\n",
       "│          │                                                                          │             │             │\n",
       "│ 9        │     df['sex_normalize']=df['sex'].str.upper()                            │ 0.015399261 │             │\n",
       "│          │                                                                          │             │             │\n",
       "│ 10       │     df['county_normalize']=df['county'].str.title().str.replace(' ', '_… │ 0.066522101 │             │\n",
       "│          │                                                                          │             │             │\n",
       "│ 11       │     df['name']=df['name'].str.title()                                    │ 0.037128624 │             │\n",
       "│          │                                                                          │             │             │\n",
       "│ 13       │     print(f'Duration: {round(time.time()-start, 2)} seconds')            │             │             │\n",
       "│          │                                                                          │             │             │\n",
       "│ 15       │     display(df.head())                                                   │ 1.463330139 │             │\n",
       "│          │                                                                          │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
       "
\n" ], "text/plain": [ "\u001b[3m \u001b[0m\n", "\u001b[3m Total time elapsed: 3.759 seconds \u001b[0m\n", "\u001b[3m \u001b[0m\n", "\u001b[3m Stats \u001b[0m\n", "\u001b[3m \u001b[0m\n", "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n", "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n", "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n", "│ 2 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ 4 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_csv\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m./data/uk_pop.csv\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 1.540004319 │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ 5 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcurrent_year\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdatetime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mnow\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34myear\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ 7 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mbirth_year\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcurrent_year\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.011366894 │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ 9 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex_normalize\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstr\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mupper\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.015399261 │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ 10 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty_normalize\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstr\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtitle\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstr\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mreplace\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m_\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m…\u001b[0m │ 0.066522101 │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ 11 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mname\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mname\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstr\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtitle\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.037128624 │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ 13 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mprint\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mf\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mDuration: \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mround\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m2\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m}\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m seconds\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "│ 15 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mhead\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 1.463330139 │ │\n", "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n", "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%cudf.pandas.line_profile\n", "# DO NOT CHANGE THIS CELL\n", "start=time.time()\n", "\n", "df=pd.read_csv('./data/uk_pop.csv')\n", "current_year=datetime.now().year\n", "\n", "df['birth_year']=current_year-df['age']\n", "\n", "df['sex_normalize']=df['sex'].str.upper()\n", "df['county_normalize']=df['county'].str.title().str.replace(' ', '_')\n", "df['name']=df['name'].str.title()\n", "\n", "print(f'Duration: {round(time.time()-start, 2)} seconds')\n", "\n", "display(df.head())" ] }, { "cell_type": "markdown", "id": "7ebe113c-9fc0-4da5-932c-0a68af0d5a31", "metadata": {}, "source": [ "\n", "### Exercise #3 - Automatic Acceleration ###\n", "**Instructions**:
\n", "* Go back to the top of this subsection, re-execute the cells and uncomment the `%load_ext` magic command to accelerate with cuDF pandas. \n", "* Observe the acceleration. \n", "* Go back to the top of this subsection, re-execute the cells and uncomment the `%%cudf.pandas.line_profile` magic command to use the line profiler. \n", "* Observe the output from the line profiler. " ] }, { "cell_type": "code", "execution_count": 4, "id": "f1688462-783c-4fea-ae18-5d37524d26d8", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'status': 'ok', 'restart': True}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "id": "0a691784-dac5-4485-89fb-5e405f10c05c", "metadata": {}, "source": [ "**Well Done!** Let's move to the [next notebook](1-03_memory_management.ipynb). " ] }, { "cell_type": "markdown", "id": "81e47f0a-547e-4714-878d-34eb9b75c835", "metadata": {}, "source": [ "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 5 }