aboutsummaryrefslogtreecommitdiff
path: root/Fundamentals_of_Accelerated_Data_Science/1-03_memory_management.ipynb
diff options
context:
space:
mode:
authorleshe4ka46 <alex9102naid1@ya.ru>2025-10-18 12:25:53 +0300
committerleshe4ka46 <alex9102naid1@ya.ru>2025-10-18 12:25:53 +0300
commit910a222fa60ce6ea0831f2956470b8a0b9f62670 (patch)
tree1d6bbccafb667731ad127f93390761100fc11b53 /Fundamentals_of_Accelerated_Data_Science/1-03_memory_management.ipynb
parent35b9040e4104b0e79bf243a2c9769c589f96e2c4 (diff)
nvidia2
Diffstat (limited to 'Fundamentals_of_Accelerated_Data_Science/1-03_memory_management.ipynb')
-rw-r--r--Fundamentals_of_Accelerated_Data_Science/1-03_memory_management.ipynb1135
1 files changed, 1135 insertions, 0 deletions
diff --git a/Fundamentals_of_Accelerated_Data_Science/1-03_memory_management.ipynb b/Fundamentals_of_Accelerated_Data_Science/1-03_memory_management.ipynb
new file mode 100644
index 0000000..cae5373
--- /dev/null
+++ b/Fundamentals_of_Accelerated_Data_Science/1-03_memory_management.ipynb
@@ -0,0 +1,1135 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "def31b0f-921a-43eb-9807-8b9b31eb7b32",
+ "metadata": {},
+ "source": [
+ "<img src=\"./images/DLI_Header.png\" width=400/>"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4a0fd4dd-f7be-4c90-8ddd-384a760ac04f",
+ "metadata": {},
+ "source": [
+ "# Fundamentals of Accelerated Data Science # "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6a8fdf2e-a481-455e-8a52-8be8472b63bf",
+ "metadata": {},
+ "source": [
+ "## 03 - Memory Management ##\n",
+ "\n",
+ "**Table of Contents**\n",
+ "<br>\n",
+ "This notebook explores the dynamics between data and memory. This notebook covers the below sections: \n",
+ "1. [Memory Management](#Memory-Management)\n",
+ " * [Memory Usage](#Memory-Usage)\n",
+ "2. [Data Types](#Data-Types)\n",
+ " * [Convert Data Types](#Convert-Data-Types)\n",
+ " * [Exercise #1 - Modify `dtypes`](#Exercise-#1---Modify-dtypes)\n",
+ " * [Categorical](#Categorical)\n",
+ "3. [Efficient Data Loading](#Efficient-Data-Loading)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1b59367c-48bc-4c72-b1f4-4cfdfa5470cf",
+ "metadata": {},
+ "source": [
+ "## Memory Management ##\n",
+ "During the data acquisition process, data is transferred to memory in order to be operated on by the processor. Memory management is crucial for cuDF and GPU operations for several key reasons: \n",
+ "* **Limited GPU memory**: GPUs typically have less memory than CPUs, therefore efficient memory management is essential to maximize the use of available GPU memory, especially for large datasets.\n",
+ "* **Data transfer overhead**: Transferring data between CPU and GPU memory is relatively slow compared to GPU computation speed. Minimizing these transfers through smart memory management is critical for performance.\n",
+ "* **Performance tuning**: Understanding and optimizing memory usage is key to achieving peak performance in GPU-accelerated data processing tasks.\n",
+ "\n",
+ "When done correctly, keeping the data on the GPU can enable cuDF and the RAPIDS ecosystem to achieve significant performance improvements, handle larger datasets, and provide more efficient data processing capabilities. \n",
+ "\n",
+ "Below we import the data from the csv file. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "b7b8a623-f799-4dad-aca9-0e571bb6e527",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "import pandas as pd\n",
+ "import random\n",
+ "import time"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "711d0a7f-8598-49fc-949c-5caf6029ce47",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>age</th>\n",
+ " <th>sex</th>\n",
+ " <th>county</th>\n",
+ " <th>lat</th>\n",
+ " <th>long</th>\n",
+ " <th>name</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.533644</td>\n",
+ " <td>-1.524401</td>\n",
+ " <td>FRANCIS</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.426256</td>\n",
+ " <td>-1.465314</td>\n",
+ " <td>EDWARD</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.555200</td>\n",
+ " <td>-1.496417</td>\n",
+ " <td>TEDDY</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.547906</td>\n",
+ " <td>-1.572341</td>\n",
+ " <td>ANGUS</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.477639</td>\n",
+ " <td>-1.605995</td>\n",
+ " <td>CHARLIE</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " age sex county lat long name\n",
+ "0 0 m DARLINGTON 54.533644 -1.524401 FRANCIS\n",
+ "1 0 m DARLINGTON 54.426256 -1.465314 EDWARD\n",
+ "2 0 m DARLINGTON 54.555200 -1.496417 TEDDY\n",
+ "3 0 m DARLINGTON 54.547906 -1.572341 ANGUS\n",
+ "4 0 m DARLINGTON 54.477639 -1.605995 CHARLIE"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "df=pd.read_csv('./data/uk_pop.csv')\n",
+ "\n",
+ "# preview\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "36416fd0-7081-42aa-bf31-d1231b81ec0b",
+ "metadata": {},
+ "source": [
+ "### Memory Usage ###\n",
+ "Memory utilization of a DataFrame depends on the date types for each column.\n",
+ "\n",
+ "<p><img src='images/dtypes.png' width=720></p>\n",
+ "\n",
+ "We can use `DataFrame.memory_usage()` to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as `int`, `float`, `datetime`, and `bool`. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For `string` data type, the memory usage reported _for pandas_ is the number of elements times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but not the memory used for the actual string values. The actual memory required for a string value is 49 bytes plus an additional byte for each character. The `deep` parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained `string` data type. \n",
+ "\n",
+ "Below we get the memory usage. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "8378207b-2d9e-4102-8408-c2dddafc8a40",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "<class 'pandas.core.frame.DataFrame'>\n",
+ "RangeIndex: 58479894 entries, 0 to 58479893\n",
+ "Data columns (total 6 columns):\n",
+ " # Column Dtype \n",
+ "--- ------ ----- \n",
+ " 0 age int64 \n",
+ " 1 sex object \n",
+ " 2 county object \n",
+ " 3 lat float64\n",
+ " 4 long float64\n",
+ " 5 name object \n",
+ "dtypes: float64(2), int64(1), object(3)\n",
+ "memory usage: 11.5 GB\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "Index 128\n",
+ "age 467839152\n",
+ "sex 3391833852\n",
+ "county 3934985133\n",
+ "lat 467839152\n",
+ "long 467839152\n",
+ "name 3666922374\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "# pandas memory utilization\n",
+ "df.info(memory_usage='deep')\n",
+ "mem_usage_df=df.memory_usage(deep=True)\n",
+ "mem_usage_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07c24bb1-c4f7-440c-a949-d4c57800ec61",
+ "metadata": {},
+ "source": [
+ "Below we define a `make_decimal()` function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found [here](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units). "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "5ae42218-1547-49fd-9123-ab508a2b03de",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "suffixes = ['B', 'kB', 'MB', 'GB', 'TB', 'PB']\n",
+ "def make_decimal(nbytes):\n",
+ " i=0\n",
+ " while nbytes >= 1024 and i < len(suffixes)-1:\n",
+ " nbytes/=1024.\n",
+ " i+=1\n",
+ " f=('%.2f' % nbytes).rstrip('0').rstrip('.')\n",
+ " return '%s %s' % (f, suffixes[i])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "e6d4a613-3eea-4dce-8e71-39593ff6f226",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'11.55 GB'"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "make_decimal(mem_usage_df.sum())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a352c0b2-65aa-4231-b753-556aca46ff49",
+ "metadata": {},
+ "source": [
+ "Below we calculate the memory usage manually based on the data types. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "630327b9-6dc1-4b70-9fdf-9f7763ec4d50",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Numerical columns use 467839152 bytes of memory\n"
+ ]
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "# get number of rows\n",
+ "num_rows=len(df)\n",
+ "\n",
+ "# 64-bit numbers uses 8 bytes of memory\n",
+ "print(f'Numerical columns use {num_rows*8} bytes of memory')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "bb22b5f4-e38f-438e-9426-61746b509e50",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "county column uses 3934985133 bytes of memory.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "# check random string-typed column\n",
+ "string_cols=[col for col in df.columns if df[col].dtype=='object' ]\n",
+ "column_to_check=random.choice(string_cols)\n",
+ "\n",
+ "overhead=49\n",
+ "pointer_size=8\n",
+ "\n",
+ "# nan==nan when value is not a number\n",
+ "# nan uses 32 bytes of memory\n",
+ "string_col_mem_usage_df=df[column_to_check].map(lambda x: len(x)+overhead+pointer_size if x else 32)\n",
+ "string_col_mem_usage=string_col_mem_usage_df.sum()\n",
+ "print(f'{column_to_check} column uses {string_col_mem_usage} bytes of memory.')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "94e393c2-c0d0-40ee-82d2-730c4667e9b8",
+ "metadata": {},
+ "source": [
+ "**Note**: The `string` data type is stored differently in cuDF than it is in pandas. More information about `libcudf` stores string data using the [Arrow format](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) can be found [here](https://developer.nvidia.com/blog/mastering-string-transformations-in-rapids-libcudf/). "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "737ff50b-9426-4e08-a00a-d7ee69f48b9f",
+ "metadata": {},
+ "source": [
+ "## Data Types ##\n",
+ "By default, pandas (and cuDF) uses 64-bit for numerical values. Using 64-bit numbers provides the highest precision but many applications do not require 64-bit precision when aggregating over a very large number of data points. When possible, using 32-bit numbers reduces storage and memory requirements in half, and also typically greatly speeds up computations because only half as much data needs to be accessed in memory. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0b77d450-c415-44b8-87ac-20ce616ec809",
+ "metadata": {},
+ "source": [
+ "### Convert Data Types ###\n",
+ "The `.astype()` method can be used to convert numerical data types to use different bit-size containers. Here we convert the `age` column from `int64` to `int8`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "603f7c70-134e-4466-a790-8a18b9088ca6",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "age int8\n",
+ "sex object\n",
+ "county object\n",
+ "lat float64\n",
+ "long float64\n",
+ "name object\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "df['age']=df['age'].astype('int8')\n",
+ "\n",
+ "df.dtypes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "973a6dd4-2aef-44d9-8b01-8853032eddae",
+ "metadata": {},
+ "source": [
+ "### Exercise #1 - Modify `dtypes` ###\n",
+ "**Instructions**: <br>\n",
+ "* Modify the `<FIXME>` only and execute the below cell to convert any 64-bit data types to their 32-bit counterparts."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "beb7d71b-6672-462e-b65c-a64dbe5f7a57",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df['lat']=df['lat'].astype('float32')\n",
+ "df['long']=df['long'].astype('float32')"
+ ]
+ },
+ {
+ "cell_type": "raw",
+ "id": "3b44fb22-a0f1-4e43-a332-1ccbad50caee",
+ "metadata": {},
+ "source": [
+ "\n",
+ "df['lat']=df['lat'].astype('float32')\n",
+ "df['long']=df['long'].astype('float32')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "98b6542d-22cc-4926-b600-a3e052c37c96",
+ "metadata": {},
+ "source": [
+ "Click ... for solution. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b2cd622-977c-4915-a87f-2fe03c1793f5",
+ "metadata": {},
+ "source": [
+ "### Categorical ###\n",
+ "Categorical data is a type of data that represents discrete, distinct categories or groups. They can have a meaningful order or ranking but generally cannot be used for numerical operations. When appropriate, using the `categorical` data type can reduce memory usage and lead to faster operations. It can also be used to define and maintain a custom order of categories. \n",
+ "\n",
+ "Below we get the number of unique values in the string columns. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "f249e4b8-5d7a-4b44-ac15-bd3360a43f2a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "sex 2\n",
+ "county 171\n",
+ "name 13212\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "df.select_dtypes(include='object').nunique()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f1d8bd88-b39b-4043-9039-d8bd75fe851a",
+ "metadata": {},
+ "source": [
+ "Below we convert columns with few discrete values to `category`. The `category` data type has `.categories` and `codes` properties that are accessed through `.cat`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "a99bebbf-2e5b-4720-96f9-9fd7d42d2fe8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "df['sex']=df['sex'].astype('category')\n",
+ "df['county']=df['county'].astype('category')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "41b7b290-cfcf-4ff6-b6b4-454c19b44a62",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Index(['BARKING AND DAGENHAM', 'BARNET', 'BARNSLEY',\n",
+ " 'BATH AND NORTH EAST SOMERSET', 'BEDFORD', 'BEXLEY', 'BIRMINGHAM',\n",
+ " 'BLACKBURN WITH DARWEN', 'BLACKPOOL', 'BLAENAU GWENT',\n",
+ " ...\n",
+ " 'WESTMINSTER', 'WIGAN', 'WILTSHIRE', 'WINDSOR AND MAIDENHEAD', 'WIRRAL',\n",
+ " 'WOKINGHAM', 'WOLVERHAMPTON', 'WORCESTERSHIRE', 'WREXHAM', 'YORK'],\n",
+ " dtype='object', length=171)"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "----------------------------------------\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "0 37\n",
+ "1 37\n",
+ "2 37\n",
+ "3 37\n",
+ "4 37\n",
+ " ..\n",
+ "58479889 96\n",
+ "58479890 96\n",
+ "58479891 96\n",
+ "58479892 96\n",
+ "58479893 96\n",
+ "Length: 58479894, dtype: int16"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "display(df['county'].cat.categories)\n",
+ "print('-'*40)\n",
+ "display(df['county'].cat.codes)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "01d12a78-5f70-4152-a708-68ee68046a1e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "age int8\n",
+ "sex category\n",
+ "county category\n",
+ "lat float32\n",
+ "long float32\n",
+ "name object\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.dtypes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "24138ffc-80b2-46ea-930d-1c1ab9706b10",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>age</th>\n",
+ " <th>sex</th>\n",
+ " <th>county</th>\n",
+ " <th>lat</th>\n",
+ " <th>long</th>\n",
+ " <th>name</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.533646</td>\n",
+ " <td>-1.524401</td>\n",
+ " <td>FRANCIS</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.426254</td>\n",
+ " <td>-1.465314</td>\n",
+ " <td>EDWARD</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.555199</td>\n",
+ " <td>-1.496417</td>\n",
+ " <td>TEDDY</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.547905</td>\n",
+ " <td>-1.572341</td>\n",
+ " <td>ANGUS</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>0</td>\n",
+ " <td>m</td>\n",
+ " <td>DARLINGTON</td>\n",
+ " <td>54.477638</td>\n",
+ " <td>-1.605994</td>\n",
+ " <td>CHARLIE</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " age sex county lat long name\n",
+ "0 0 m DARLINGTON 54.533646 -1.524401 FRANCIS\n",
+ "1 0 m DARLINGTON 54.426254 -1.465314 EDWARD\n",
+ "2 0 m DARLINGTON 54.555199 -1.496417 TEDDY\n",
+ "3 0 m DARLINGTON 54.547905 -1.572341 ANGUS\n",
+ "4 0 m DARLINGTON 54.477638 -1.605994 CHARLIE"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3d0addcc-c078-42f5-a66a-3bb9a969d7e8",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "id": "737385ab-677c-4bef-a86a-10aa3119e29a",
+ "metadata": {},
+ "source": [
+ "**Note**: `.astype()` can also be used to convert data to `datetime` or `object` to enable datetime and string methods. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "552c47c2-0fbc-455e-8745-cb98fc777243",
+ "metadata": {},
+ "source": [
+ "## Efficient Data Loading ##\n",
+ "It is often advantageous to specify the most appropriate data types for each columns, based on range, precision requirement, and how they are used. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "c2b9f0c3-8598-4a28-9481-ce28fea7544b",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Index 128\n",
+ "age 467839152\n",
+ "sex 3391833852\n",
+ "county 3934985133\n",
+ "lat 467839152\n",
+ "long 467839152\n",
+ "name 3666922374\n",
+ "dtype: int64"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Loading 11.55 GB took 33.87 seconds.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "start=time.time()\n",
+ "df=pd.read_csv('./data/uk_pop.csv')\n",
+ "duration=time.time()-start\n",
+ "\n",
+ "mem_usage_df=df.memory_usage(deep=True)\n",
+ "display(mem_usage_df)\n",
+ "\n",
+ "print(f'Loading {make_decimal(mem_usage_df.sum())} took {round(duration, 2)} seconds.')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5729520e-3ed8-4ec6-ae1f-ba46d642f48d",
+ "metadata": {},
+ "source": [
+ "Below we enable `cuda.pandas` to see the difference. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "99aa0f32-4d2a-43a7-bec1-f1b88bcc37c2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "%load_ext cudf.pandas\n",
+ "\n",
+ "import pandas as pd\n",
+ "import time"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "2b724201-9ad1-4e9b-b712-f3b31bdc4104",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "suffixes = ['B', 'kB', 'MB', 'GB', 'TB', 'PB']\n",
+ "def make_decimal(nbytes):\n",
+ " i=0\n",
+ " while nbytes >= 1024 and i < len(suffixes)-1:\n",
+ " nbytes/=1024.\n",
+ " i+=1\n",
+ " f=('%.2f' % nbytes).rstrip('0').rstrip('.')\n",
+ " return '%s %s' % (f, suffixes[i])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "99bdd7b0-8563-41db-bd8e-3a7279394ede",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "age 58479894\n",
+ "sex 58479908\n",
+ "county 58482446\n",
+ "lat 467839152\n",
+ "long 467839152\n",
+ "name 117096917\n",
+ "Index 0\n",
+ "dtype: int64"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Loading 1.14 GB took 2.12 seconds.\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n",
+ "<span style=\"font-style: italic\"> Total time elapsed: 2.687 seconds </span>\n",
+ "<span style=\"font-style: italic\"> </span>\n",
+ "<span style=\"font-style: italic\"> Stats </span>\n",
+ "<span style=\"font-style: italic\"> </span>\n",
+ "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
+ "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
+ "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
+ "│ 2 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> start</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 5 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> dtype_dict</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">{</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 6 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'int8'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 7 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'sex'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 8 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 9 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'float64'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 10 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'float64'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 11 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'name'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 14 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> efficient_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">read_csv(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'./data/uk_pop.csv'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, dtype</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">dtype_dict)</span><span style=\"background-color: #272822\"> </span> │ 1.718211215 │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 15 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> duration</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">start</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 17 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> mem_usage_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">efficient_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">memory_usage(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'deep'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"background-color: #272822\"> </span> │ 0.005751408 │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 18 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> display(mem_usage_df)</span><span style=\"background-color: #272822\"> </span> │ 0.011270449 │ 0.007067286 │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "│ 20 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> print(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">f'Loading {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">make_decimal(mem_usage_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">sum())</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">} took {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">round(dura…</span> │ 0.004789912 │ │\n",
+ "│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
+ "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
+ "</pre>\n"
+ ],
+ "text/plain": [
+ "\u001b[3m \u001b[0m\n",
+ "\u001b[3m Total time elapsed: 2.687 seconds \u001b[0m\n",
+ "\u001b[3m \u001b[0m\n",
+ "\u001b[3m Stats \u001b[0m\n",
+ "\u001b[3m \u001b[0m\n",
+ "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
+ "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
+ "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
+ "│ 2 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 5 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype_dict\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m{\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 6 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mint8\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 7 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 8 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 9 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mfloat64\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 10 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mfloat64\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 11 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mname\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 14 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mefficient_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_csv\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m./data/uk_pop.csv\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype_dict\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 1.718211215 │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 15 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mduration\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 17 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mefficient_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmemory_usage\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mdeep\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.005751408 │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 18 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.011270449 │ 0.007067286 │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "│ 20 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mprint\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mf\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mLoading \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmake_decimal\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msum\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m}\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m took \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mround\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdura…\u001b[0m │ 0.004789912 │ │\n",
+ "│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
+ "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "%%cudf.pandas.line_profile\n",
+ "# DO NOT CHANGE THIS CELL\n",
+ "start=time.time()\n",
+ "\n",
+ "# define data types for each column\n",
+ "dtype_dict={\n",
+ " 'age': 'int8', \n",
+ " 'sex': 'category', \n",
+ " 'county': 'category', \n",
+ " 'lat': 'float64', \n",
+ " 'long': 'float64', \n",
+ " 'name': 'category'\n",
+ "}\n",
+ " \n",
+ "efficient_df=pd.read_csv('./data/uk_pop.csv', dtype=dtype_dict)\n",
+ "duration=time.time()-start\n",
+ "\n",
+ "mem_usage_df=efficient_df.memory_usage('deep')\n",
+ "display(mem_usage_df)\n",
+ "\n",
+ "print(f'Loading {make_decimal(mem_usage_df.sum())} took {round(duration, 2)} seconds.')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0f4607d8-6de3-4b27-96d4-a9720d268333",
+ "metadata": {},
+ "source": [
+ "We were able to load data faster and more efficiently. \n",
+ "\n",
+ "**Note**: Notice that the memory utilized on the GPU is larger than the memory used by the DataFrame. This is expected because there are intermediary processes that use some memory during the data loading process, specifically related to parsing the csv file in this case. \n",
+ "\n",
+ "```\n",
+ "+-----------------------------------------------------------------------------+\n",
+ "| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |\n",
+ "|-------------------------------+----------------------+----------------------+\n",
+ "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
+ "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
+ "| | | MIG M. |\n",
+ "|===============================+======================+======================|\n",
+ "| 0 Tesla T4 Off | 00000000:00:1B.0 Off | 0 |\n",
+ "| N/A 32C P0 26W / 70W | 1378MiB / 15360MiB | 0% Default |\n",
+ "| | | N/A |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ "| 1 Tesla T4 Off | 00000000:00:1C.0 Off | 0 |\n",
+ "| N/A 31C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
+ "| | | N/A |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ "| 2 Tesla T4 Off | 00000000:00:1D.0 Off | 0 |\n",
+ "| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
+ "| | | N/A |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ "| 3 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |\n",
+ "| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
+ "| | | N/A |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ " \n",
+ "+-----------------------------------------------------------------------------+\n",
+ "| Processes: |\n",
+ "| GPU GI CI PID Type Process name GPU Memory |\n",
+ "| ID ID Usage |\n",
+ "|=============================================================================|\n",
+ "+-----------------------------------------------------------------------------+\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "92f7ee37-4acb-46aa-bb73-4c0139d3f6b8",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Sat Oct 11 16:44:59 2025 \n",
+ "+-----------------------------------------------------------------------------+\n",
+ "| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
+ "|-------------------------------+----------------------+----------------------+\n",
+ "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
+ "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
+ "| | | MIG M. |\n",
+ "|===============================+======================+======================|\n",
+ "| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
+ "| N/A 30C P0 25W / 70W | 11338MiB / 15360MiB | 0% Default |\n",
+ "| | | N/A |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ "| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
+ "| N/A 31C P0 25W / 70W | 168MiB / 15360MiB | 0% Default |\n",
+ "| | | N/A |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ "| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
+ "| N/A 31C P0 25W / 70W | 168MiB / 15360MiB | 0% Default |\n",
+ "| | | N/A |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ "| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
+ "| N/A 31C P0 25W / 70W | 168MiB / 15360MiB | 0% Default |\n",
+ "| | | N/A |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ " \n",
+ "+-----------------------------------------------------------------------------+\n",
+ "| Processes: |\n",
+ "| GPU GI CI PID Type Process name GPU Memory |\n",
+ "| ID ID Usage |\n",
+ "|=============================================================================|\n",
+ "+-----------------------------------------------------------------------------+\n"
+ ]
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "!nvidia-smi"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c031d2c7-03cb-4ac7-a195-70fc25cb191d",
+ "metadata": {},
+ "source": [
+ "When loading data this way, we may be able to fit more data. The optimal dataset size depends on various factors including the specific operations being performed, the complexity of the workload, and the available GPU memory. To maximize acceleration, datasets should ideally fit within GPU memory, with ample space left for operations that can spike memory requirements. As a general rule of thumb, cuDF recommends data sets that are less than 50% of the GPU memory capacity. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "ec6cefea-dc64-4f13-815e-081cd35651b9",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "We can load 408997980 rows.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "# 1 gigabytes = 1073741824 bytes\n",
+ "mem_capacity=16*1073741824\n",
+ "\n",
+ "mem_per_record=mem_usage_df.sum()/len(efficient_df)\n",
+ "\n",
+ "print(f'We can load {int(mem_capacity/2/mem_per_record)} rows.')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "ddaaa1ac-66ec-4323-9842-2543c6d85e4e",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'status': 'ok', 'restart': True}"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# DO NOT CHANGE THIS CELL\n",
+ "import IPython\n",
+ "app = IPython.Application.instance()\n",
+ "app.kernel.do_shutdown(True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "658e9847-775f-4d12-af4e-8f896df4e6fe",
+ "metadata": {},
+ "source": [
+ "**Well Done!** Let's move to the [next notebook](1-04_interoperability.ipynb). "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b86451cf-60e6-4733-b431-1bc0bd586bc2",
+ "metadata": {},
+ "source": [
+ "<img src=\"./images/DLI_Header.png\" width=400/>"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.15"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}