From 910a222fa60ce6ea0831f2956470b8a0b9f62670 Mon Sep 17 00:00:00 2001 From: leshe4ka46 Date: Sat, 18 Oct 2025 12:25:53 +0300 Subject: nvidia2 --- .../1-08_cudf-polars.ipynb | 1357 ++++++++++++++++++++ 1 file changed, 1357 insertions(+) create mode 100644 Fundamentals_of_Accelerated_Data_Science/1-08_cudf-polars.ipynb (limited to 'Fundamentals_of_Accelerated_Data_Science/1-08_cudf-polars.ipynb') diff --git a/Fundamentals_of_Accelerated_Data_Science/1-08_cudf-polars.ipynb b/Fundamentals_of_Accelerated_Data_Science/1-08_cudf-polars.ipynb new file mode 100644 index 0000000..c0f1115 --- /dev/null +++ b/Fundamentals_of_Accelerated_Data_Science/1-08_cudf-polars.ipynb @@ -0,0 +1,1357 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "734557aa-90fb-468f-9ed0-0e6f295bb9eb", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "id": "6dbb572c-1291-4011-9ed4-120eb2ec7b29", + "metadata": {}, + "source": [ + "# Fundamentals of Accelerated Data Science # " + ] + }, + { + "cell_type": "markdown", + "id": "377ba9f0-0acc-4574-9a43-c475cfd52dd5", + "metadata": {}, + "source": [ + "## 08 - Introduction to cuDF Polars ##\n", + "\n", + "**Table of Contents**\n", + "
\n", + "This notebook briefly introduces Polars and covers the new GPU engine. This notebook covers the below sections: \n", + "1. [Introduction to Polars](#Introduction-to-Polars)\n", + " * [Installation](#Installation)\n", + " * [Creating a DataFrame](#Creating-a-DataFrame)\n", + " * [Running Basic Operations](#Running-Basic-Operations)\n", + " * [Pandas Comparison](#Pandas-Comparison)\n", + " * [cuDF Pandas Comparison](#cuDF-Pandas-Comparison)\n", + "2. [Basic Polars Operations](#Basic-Polars-Operations)\n", + " * [Polars Eager Execution API Reference](#Polars-Eager-Execution-API-Reference)\n", + " * [Exercise #1 - Load Data](#Exercise-#1---Load-Data)\n", + " * [Exercise #2 - Calculate Average Age of Population](#Exercise-#2---Calculate-Average-Age-of-Population)\n", + " * [Exercise #3 - Group By and Aggregation](#Exercise-#3---Group-By-and-Aggregation)\n", + " * [Exercise #4 - Gender Distribution](#Exercise-#4---Gender-Distribution)\n", + "4. [Lazy Execution](#Lazy-Execution)\n", + " * [Polars Lazy Execution API Reference](#Polars-Lazy-Execution-API-Reference)\n", + " * [Execution Graph](#Execution-Graph)\n", + " * [Exercise #5 - Creating a Lazy Dataframe](#Exercise-#5---Creating-a-Lazy-Dataframe)\n", + " * [Exercise #6 - Query Creation](#Exercise-#6---Query-Creation)\n", + "5. [cuDF Polars](#cuDF-Polars)\n", + " * [Accelerate Previous Code](#Accelerate-Previous-Code)\n", + " * [Verify Results Across Engines](#Verify-Results-Across-Engines)\n", + " * [Fallback](#Fallback)\n", + " * [Exercise #7 - Enable GPU Engine](#Exercise-#7---Enable-GPU-Engine)" + ] + }, + { + "cell_type": "markdown", + "id": "e7dd978a-a785-4b67-822e-048d56e231e2", + "metadata": {}, + "source": [ + "## Introduction to Polars ##\n", + "Polars is a data analysis and manipulation library that is designed for large data processing (10-100GB) on a single GPU and is known for its speed and memory efficiency. While Pandas makes use of eager execution, Polars additionally has the capability for lazy execution through the built-in query optimizer and makes use of zero-copy optimization techniques. Due to these improvements, Polars typically performs common operations 5-10x faster than Pandas, and requires 2-4 times less RAM. NVIDIA brings hardware acceleration to Polars through a new GPU engine named cuDF Polars, which is available as a pip install." + ] + }, + { + "cell_type": "markdown", + "id": "a2187687-9a53-44f9-ba55-ba9e1081cac6", + "metadata": {}, + "source": [ + "### Creating a DataFrame ###\n", + "Now let's see how the syntax looks! We will create a dataframe to use within Polars." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "be96aa2e-f1fc-4521-bf11-89505a2981ac", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Time Taken: 0.7773 seconds\n" + ] + } + ], + "source": [ + "import polars as pl\n", + "import time\n", + "\n", + "start_time = time.time()\n", + "\n", + "polars_df = pl.read_csv('./data/uk_pop.csv')\n", + "\n", + "polars_time = time.time() - start_time\n", + "\n", + "print(f\"Time Taken: {polars_time:.4f} seconds\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "90dedd84-0792-4a54-9380-4b3a020bccd1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (5, 6)
agesexcountylatlongname
i64strstrf64f64str
0"m""DARLINGTON"54.533644-1.524401"FRANCIS"
0"m""DARLINGTON"54.426256-1.465314"EDWARD"
0"m""DARLINGTON"54.5552-1.496417"TEDDY"
0"m""DARLINGTON"54.547906-1.572341"ANGUS"
0"m""DARLINGTON"54.477639-1.605995"CHARLIE"
" + ], + "text/plain": [ + "shape: (5, 6)\n", + "┌─────┬─────┬────────────┬───────────┬───────────┬─────────┐\n", + "│ age ┆ sex ┆ county ┆ lat ┆ long ┆ name │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ i64 ┆ str ┆ str ┆ f64 ┆ f64 ┆ str │\n", + "╞═════╪═════╪════════════╪═══════════╪═══════════╪═════════╡\n", + "│ 0 ┆ m ┆ DARLINGTON ┆ 54.533644 ┆ -1.524401 ┆ FRANCIS │\n", + "│ 0 ┆ m ┆ DARLINGTON ┆ 54.426256 ┆ -1.465314 ┆ EDWARD │\n", + "│ 0 ┆ m ┆ DARLINGTON ┆ 54.5552 ┆ -1.496417 ┆ TEDDY │\n", + "│ 0 ┆ m ┆ DARLINGTON ┆ 54.547906 ┆ -1.572341 ┆ ANGUS │\n", + "│ 0 ┆ m ┆ DARLINGTON ┆ 54.477639 ┆ -1.605995 ┆ CHARLIE │\n", + "└─────┴─────┴────────────┴───────────┴───────────┴─────────┘" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "polars_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "12c66adc-5ac0-49a1-95fe-dc78fbb3950a", + "metadata": {}, + "source": [ + "### Running Basic Operations ###\n", + "That was simple- now let's try running a few operations on the dataset! We will be loading the dataset again for a fair comparison with Pandas later." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "7fc2280f-a89c-42dd-befc-39ffce2f01ec", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape: (5, 6)\n", + "┌─────┬─────┬──────────────────────────┬───────────┬───────────┬───────┐\n", + "│ age ┆ sex ┆ county ┆ lat ┆ long ┆ name │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ i64 ┆ str ┆ str ┆ f64 ┆ f64 ┆ str │\n", + "╞═════╪═════╪══════════════════════════╪═══════════╪═══════════╪═══════╡\n", + "│ 1 ┆ f ┆ EAST RIDING OF YORKSHIRE ┆ 53.737344 ┆ -0.638535 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ SHEFFIELD ┆ 53.35529 ┆ -1.669447 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ LINCOLNSHIRE ┆ 53.164176 ┆ 0.015812 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ WORCESTERSHIRE ┆ 52.258629 ┆ -2.31696 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ HERTFORDSHIRE ┆ 51.731816 ┆ -0.377476 ┆ ZYRAH │\n", + "└─────┴─────┴──────────────────────────┴───────────┴───────────┴───────┘\n", + "Time Taken: 6.6932 seconds\n" + ] + } + ], + "source": [ + "start_time = time.time()\n", + "\n", + "#load data\n", + "polars_df = pl.read_csv('./data/uk_pop.csv')\n", + "\n", + "# Filter for ages above 0\n", + "filtered_df = polars_df.filter(pl.col('age') > 0.0)\n", + "\n", + "#Sort by name\n", + "sorted_df = filtered_df.sort('name', descending=True)\n", + "\n", + "print(sorted_df.head())\n", + "polars_time = time.time() - start_time\n", + "print(f\"Time Taken: {polars_time:.4f} seconds\")" + ] + }, + { + "cell_type": "markdown", + "id": "bf06893b-a772-4c15-a867-e4cf6aa0984b", + "metadata": {}, + "source": [ + "### Pandas Comparison ###\n", + "Let's see how long this would've taken in Pandas." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "99468ce0-ecb1-4da2-9e96-44c41c73605a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Time Taken: 147.3623 seconds\n", + "\n" + ] + } + ], + "source": [ + "import pandas as pd\n", + "import time\n", + "start_time = time.time()\n", + "pandas_df = pd.read_csv('./data/uk_pop.csv')\n", + "\n", + "filtered_df = pandas_df[pandas_df['age'] > 0.0]\n", + "\n", + "sorted_df = filtered_df.sort_values(by=['name'], ascending=False)\n", + "\n", + "pandas_time = time.time() - start_time\n", + "print(f\"Time Taken: {pandas_time:.4f} seconds\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "d5f25cae-1ee8-462e-a487-68af7d9004ce", + "metadata": {}, + "source": [ + "### cuDF Pandas Comparison ###\n", + "Wow! That took quite some time to execute. Let's see if we can run it faster with cuDF Pandas." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "23484488-4d35-4359-804d-80bb60463d2e", + "metadata": {}, + "outputs": [], + "source": [ + "# Activate cuDF Pandas\n", + "%load_ext cudf.pandas\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "714b14dd-0ebd-4814-a341-75cbf8e44fd5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Time Taken for cuDF Pandas: 7.2949 seconds\n", + "\n" + ] + } + ], + "source": [ + "import pandas as pd\n", + "import time\n", + "start_time = time.time()\n", + "pandas_df = pd.read_csv('./data/uk_pop.csv')\n", + "\n", + "filtered_df = pandas_df[pandas_df['age'] > 0.0]\n", + "\n", + "sorted_df = filtered_df.sort_values(by=['name'], ascending=False)\n", + "\n", + "pandas_time = time.time() - start_time\n", + "print(f\"Time Taken for cuDF Pandas: {pandas_time:.4f} seconds\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "d9a3692e-fe36-4bb2-bf1d-84ce5d4834f7", + "metadata": {}, + "source": [ + "**Note**: Even with cuDF Pandas, we sometimes notice that the performance can be slower than Polars." + ] + }, + { + "cell_type": "markdown", + "id": "dd499b82-9b82-472b-934c-5d7e34743458", + "metadata": {}, + "source": [ + "## Basic Polars Operations ##\n", + "Please refer to the following API reference guide to complete the exercises below.\n", + "\n", + "1. Load data\n", + "2. Calculate average age of population\n", + "3. Group By and Aggregation\n", + "4. Gender Distribution" + ] + }, + { + "cell_type": "markdown", + "id": "b3972070-b02e-4870-90c2-bcae0c43f82e", + "metadata": {}, + "source": [ + "### Polars Eager Execution API Reference ###\n", + "\n", + "**DataFrame**\n", + "\n", + "The main data structure for eager execution in Polars.\n", + "\n", + "- `pl.DataFrame(data)`: Create a DataFrame from data\n", + "- `pl.read_csv(file)`: Read CSV file into DataFrame\n", + "- `pl.read_parquet(file)`: Read Parquet file into DataFrame\n", + "\n", + "**Key Methods**\n", + "\n", + "- `filter(mask)`: Filter rows based on a boolean mask\n", + "- `select(columns)`: Select specific columns\n", + "- `with_columns(expressions)`: Add or modify columns\n", + "- `group_by(columns)`: Group by specified columns\n", + "- `agg(aggregations)`: Perform aggregations on grouped data\n", + "- `sort(columns)`: Sort the data by specified columns\n", + "- `join(other, on)`: Join with another DataFrame\n", + "\n", + "**Expressions**\n", + "\n", + "Used to define operations on columns:\n", + "\n", + "- `pl.col(\"column\")`: Reference a column\n", + "- `pl.lit(value)`: Create a literal value\n", + "- `pl.when(predicate).then(value).otherwise(other)`: Conditional expression\n", + "\n", + "**Series Operations**\n", + "\n", + "- `series.sum()`: Calculate sum of series\n", + "- `series.mean()`: Calculate mean of series\n", + "- `series.max()`: Find maximum value in series\n", + "- `series.min()`: Find minimum value in series\n", + "- `series.sort()`: Sort series values\n", + "\n", + "**Data Types**\n", + "\n", + "- `pl.Int64`: 64-bit integer\n", + "- `pl.Float64`: 64-bit float\n", + "- `pl.Utf8`: String\n", + "- `pl.Boolean`: Boolean\n", + "- `pl.Date`: Date\n", + "\n", + "**Utilities**\n", + "\n", + "- `pl.concat([df1, df2])`: Concatenate DataFrames\n", + "- `df.describe()`: Generate summary statistics\n", + "- `df.to_csv(file)`: Write DataFrame to CSV\n", + "- `df.to_parquet(file)`: Write DataFrame to Parquet\n", + "\n", + "The eager API executes operations immediately, providing direct access to results. It's suitable for interactive data exploration and smaller datasets." + ] + }, + { + "cell_type": "markdown", + "id": "d87e5c16-46df-4a5b-8454-64fc7bb761a4", + "metadata": {}, + "source": [ + "### Exercise #1 - Load Data ###\n", + "Load the csv file into a Dataframe using Polars." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "e138575b-8817-44e0-981c-128b07d6bf8f", + "metadata": {}, + "outputs": [], + "source": [ + "import polars as pl\n", + "\n", + "polars_df = pl.read_csv('./data/uk_pop.csv')" + ] + }, + { + "cell_type": "raw", + "id": "8e2c5846-9762-46f9-a440-b8fab6949b73", + "metadata": {}, + "source": [ + "\n", + "df = pl.read_csv('./data/uk_pop.csv')\n", + "\n", + "print(df.head())" + ] + }, + { + "cell_type": "markdown", + "id": "d7833215-ea20-4b6b-bcd1-be4c8ca983ca", + "metadata": {}, + "source": [ + "Click ... for solution. " + ] + }, + { + "cell_type": "markdown", + "id": "51b230dd-7c17-46fb-be96-1dd11dfbf90d", + "metadata": {}, + "source": [ + "### Exercise #2 - Calculate Average Age of Population ###\n", + "Now, filter for individuals aged 65 and above, and sort by ascending age." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "d22e37d3-a0c0-4b05-891a-c74086a79e59", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape: (5, 6)\n", + "┌─────┬─────┬──────────────────────────┬───────────┬───────────┬──────┐\n", + "│ age ┆ sex ┆ county ┆ lat ┆ long ┆ name │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ i64 ┆ str ┆ str ┆ f64 ┆ f64 ┆ str │\n", + "╞═════╪═════╪══════════════════════════╪═══════════╪═══════════╪══════╡\n", + "│ 65 ┆ m ┆ CHESHIRE EAST ┆ 53.320049 ┆ -2.581048 ┆ A │\n", + "│ 65 ┆ m ┆ LANCASHIRE ┆ 53.835459 ┆ -2.6268 ┆ A │\n", + "│ 65 ┆ m ┆ EAST RIDING OF YORKSHIRE ┆ 53.846112 ┆ -0.731101 ┆ A │\n", + "│ 65 ┆ m ┆ NORTHAMPTONSHIRE ┆ 52.179982 ┆ -0.962304 ┆ A │\n", + "│ 65 ┆ m ┆ BRIGHTON AND HOVE ┆ 50.828285 ┆ -0.143481 ┆ A │\n", + "└─────┴─────┴──────────────────────────┴───────────┴───────────┴──────┘\n" + ] + } + ], + "source": [ + "filtered_df = polars_df.filter(pl.col('age') >= 65.0)\n", + "sorted_df = filtered_df.sort('name', descending=False)\n", + "print(sorted_df.head())" + ] + }, + { + "cell_type": "raw", + "id": "de1b9849-e5fe-43a7-b38e-68ec4aa20bce", + "metadata": {}, + "source": [ + "\n", + "filtered = (\n", + " df.filter(pl.col(\"age\") >= 65)\n", + " .sort(\"age\", descending=False)\n", + ")\n", + "\n", + "print(filtered)" + ] + }, + { + "cell_type": "markdown", + "id": "dd6a015f-d086-4856-972a-510d4073f614", + "metadata": {}, + "source": [ + "Click ... for solution. " + ] + }, + { + "cell_type": "markdown", + "id": "2af1bb9b-91bc-4665-a371-699b66067cd2", + "metadata": {}, + "source": [ + "### Exercise #3 - Group By and Aggregation ###\n", + "Next, group by county and calculate the total population and average age." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "191a03a2-f420-4669-96ed-ddb290ce09f7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (5, 3)
countylenage
stru32f64
"MIDDLESBROUGH"14054538.067509
"WARWICKSHIRE"57101041.794508
"NEATH PORT TALBOT"14290642.042343
"HOUNSLOW"27078236.310597
"SOMERSET"55939944.004834
" + ], + "text/plain": [ + "shape: (5, 3)\n", + "┌───────────────────┬────────┬───────────┐\n", + "│ county ┆ len ┆ age │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ u32 ┆ f64 │\n", + "╞═══════════════════╪════════╪═══════════╡\n", + "│ MIDDLESBROUGH ┆ 140545 ┆ 38.067509 │\n", + "│ WARWICKSHIRE ┆ 571010 ┆ 41.794508 │\n", + "│ NEATH PORT TALBOT ┆ 142906 ┆ 42.042343 │\n", + "│ HOUNSLOW ┆ 270782 ┆ 36.310597 │\n", + "│ SOMERSET ┆ 559399 ┆ 44.004834 │\n", + "└───────────────────┴────────┴───────────┘" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "grouped = polars_df.group_by(\"county\")\n", + "aggregated = grouped.agg([pl.len(), pl.mean(\"age\")])\n", + "aggregated.head()" + ] + }, + { + "cell_type": "raw", + "id": "456a1ac3-a6dd-4c06-b2b4-7adaf59783fe", + "metadata": {}, + "source": [ + "\n", + "agg = (\n", + " df.group_by(\"county\")\n", + " .agg([\n", + " pl.len().alias(\"population\"),\n", + " pl.mean(\"age\").alias(\"average_age\")\n", + " ])\n", + " .sort(\"population\", descending=True)\n", + ")\n", + "\n", + "print(agg.head())" + ] + }, + { + "cell_type": "markdown", + "id": "59956d51-c6e1-4129-aa5f-11e29ab44a9a", + "metadata": {}, + "source": [ + "Click ... for solution. " + ] + }, + { + "cell_type": "markdown", + "id": "fb6e2475-3823-43a7-b75b-460e971da230", + "metadata": {}, + "source": [ + "### Exercise #4 - Gender Distribution ###\n", + "Lastly, let's calculate the percentage of males to females in the sample data." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "1fa57226-f174-4967-a75a-b11d9df8765f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (2, 3)
sexcountpercentage
stru32f64
"m"2890078149.42003
"f"2957911350.57997
" + ], + "text/plain": [ + "shape: (2, 3)\n", + "┌─────┬──────────┬────────────┐\n", + "│ sex ┆ count ┆ percentage │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ u32 ┆ f64 │\n", + "╞═════╪══════════╪════════════╡\n", + "│ m ┆ 28900781 ┆ 49.42003 │\n", + "│ f ┆ 29579113 ┆ 50.57997 │\n", + "└─────┴──────────┴────────────┘" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "grouped_age = polars_df.group_by(\"sex\")\n", + "aggregated_age = grouped_age.agg(pl.len().alias(\"count\")).with_columns(\n", + " (pl.col(\"count\") / polars_df.shape[0] * 100).alias(\"percentage\")\n", + " )\n", + "aggregated_age.head()" + ] + }, + { + "cell_type": "raw", + "id": "0a635638-1c6b-4752-b2c2-dadee0b73162", + "metadata": {}, + "source": [ + "\n", + "gender = (\n", + " df.group_by(\"sex\")\n", + " .agg(pl.len().alias(\"count\"))\n", + " .with_columns(\n", + " (pl.col(\"count\") / df.shape[0] * 100).alias(\"percentage\")\n", + " )\n", + ")\n", + "\n", + "print(gender)" + ] + }, + { + "cell_type": "markdown", + "id": "799cf9d3-1805-4a82-8d78-b780c73ab25e", + "metadata": {}, + "source": [ + "Click ... for solution. " + ] + }, + { + "cell_type": "markdown", + "id": "3b44f79a-a542-492d-ae01-b1ee6f9203dc", + "metadata": {}, + "source": [ + "## Lazy Execution ##\n", + "Polars utilizes a technique called lazy execution to perform operations. Unlike eager execution, where operations are performed immediately, Polars defines and stores operations in a computational graph that isn't executed until explicitly required. This allows Polars to optimize the sequence of operations to minimize computation overhead and apply optimization techniques such as: applying filters early (predicate pushdown), selecting only necessary columns (projection pushdown), and executing operations in parallel. To make use of lazy execution in polars, a \"LazyFrame\" data structure is used.\n", + "\n", + "Now, lets run the same operations with lazy execution and visualize the graph!" + ] + }, + { + "cell_type": "markdown", + "id": "0a709c71-6fae-4eca-8236-853e3d97e12a", + "metadata": {}, + "source": [ + "### Polars Lazy Execution API Reference ###\n", + "\n", + "**LazyFrame**\n", + "\n", + "The main entry point for lazy execution in Polars. Created from a DataFrame or data source.\n", + "\n", + "- `pl.LazyFrame(data)`: Create a LazyFrame from data.\n", + "- `df.lazy()`: Convert a DataFrame to LazyFrame.\n", + "\n", + "**Key Methods**\n", + "\n", + "- `filter(predicate)`: Filter rows based on a condition.\n", + "- `select(columns)`: Select specific columns.\n", + "- `with_columns(expressions)`: Add or modify columns.\n", + "- `group_by(columns)`: Group by specified columns.\n", + "- `agg(aggregations)`: Perform aggregations on grouped data.\n", + "- `sort(columns)`: Sort the data by specified columns.\n", + "- `join(other, on)`: Join with another LazyFrame.\n", + "- `collect()`: Execute the lazy query and return a DataFrame.\n", + "\n", + "**Expressions**\n", + "\n", + "Used to define operations on columns:\n", + "\n", + "- `pl.col(\"column\")`: Reference a column.\n", + "- `pl.lit(value)`: Create a literal value.\n", + "- `pl.when(predicate).then(value).otherwise(other)`: Define a conditional expression.\n", + "\n", + "**Execution**\n", + "\n", + "- `collect()`: Execute and return a DataFrame.\n", + "- `fetch(n)`: Execute and return the first n rows.\n", + "- `describe_plan()`: Show the query plan for optimization insights.\n", + "- `explain()`: Explain the query execution process.\n", + "\n", + "**Optimization**\n", + "\n", + "- `cache()`: Cache intermediate results for faster access.\n", + "- `optimize()`: Apply query optimizations to improve performance.\n", + "\n", + "The lazy API allows building complex queries that are optimized before execution, enabling better performance for large datasets." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "f183fa43-99e2-45dc-930c-44fee40d9135", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape: (5, 6)\n", + "┌─────┬─────┬──────────────────────────┬───────────┬───────────┬───────┐\n", + "│ age ┆ sex ┆ county ┆ lat ┆ long ┆ name │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ i64 ┆ str ┆ str ┆ f64 ┆ f64 ┆ str │\n", + "╞═════╪═════╪══════════════════════════╪═══════════╪═══════════╪═══════╡\n", + "│ 1 ┆ f ┆ EAST RIDING OF YORKSHIRE ┆ 53.737344 ┆ -0.638535 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ SHEFFIELD ┆ 53.35529 ┆ -1.669447 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ LINCOLNSHIRE ┆ 53.164176 ┆ 0.015812 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ WORCESTERSHIRE ┆ 52.258629 ┆ -2.31696 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ HERTFORDSHIRE ┆ 51.731816 ┆ -0.377476 ┆ ZYRAH │\n", + "└─────┴─────┴──────────────────────────┴───────────┴───────────┴───────┘\n", + "Time Taken: 6.6778 seconds\n" + ] + } + ], + "source": [ + "import polars as pl\n", + "import time\n", + "\n", + "start_time = time.time()\n", + "\n", + "# Create a lazy DataFrame\n", + "lazy_df = pl.scan_csv('./data/uk_pop.csv')\n", + "\n", + "# Define the lazy operations\n", + "lazy_result = (\n", + " lazy_df\n", + " .filter(pl.col('age') > 0.0)\n", + " .sort('name', descending=True)\n", + ")\n", + "\n", + "# Execute the lazy query and collect the results\n", + "result = lazy_result.collect()\n", + "\n", + "print(result.head())\n", + "polars_time = time.time() - start_time\n", + "print(f\"Time Taken: {polars_time:.4f} seconds\")" + ] + }, + { + "cell_type": "markdown", + "id": "0ae1e01f-da85-453f-a708-37afca4753f6", + "metadata": {}, + "source": [ + "### Execution Graph ###" + ] + }, + { + "cell_type": "markdown", + "id": "4affd1a7-80b2-4f9b-88d8-26a7008f6dc9", + "metadata": {}, + "source": [ + "Let's see how the unoptimized execution graph looks." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "d43e9843-ef8e-48aa-b49c-482454d6e93f", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "polars_query\n", + "\n", + "\n", + "\n", + "p1\n", + "\n", + "SORT BY [col("name")]\n", + "\n", + "\n", + "\n", + "p2\n", + "\n", + "FILTER BY [(col("age").cast(Unknown(Float))) > (dyn float: 0.0)]\n", + "\n", + "\n", + "\n", + "p1--p2\n", + "\n", + "\n", + "\n", + "\n", + "p3\n", + "\n", + "Csv SCAN [./data/uk_pop.csv]\n", + "π */6;\n", + "\n", + "\n", + "\n", + "p2--p3\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Show unoptimized Graph\n", + "lazy_result.show_graph(optimized=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "6f5883f3-6238-4f98-970b-f06adabfb50e", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "polars_query\n", + "\n", + "\n", + "\n", + "p1\n", + "\n", + "SORT BY [col("name")]\n", + "\n", + "\n", + "\n", + "p2\n", + "\n", + "Csv SCAN [./data/uk_pop.csv]\n", + "π */6;\n", + "σ [(col("age").cast(Unknown(Float))) > (dyn float: 0.0)]\n", + "\n", + "\n", + "\n", + "p1--p2\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Show optimized Graph\n", + "lazy_result.show_graph(optimized=True)" + ] + }, + { + "cell_type": "markdown", + "id": "7d0f6912-9b19-4473-80d3-4223271efe26", + "metadata": {}, + "source": [ + "As we can see, during execution, Polars ran the age filter in parallel with reading the csv to save time! These type of optimizations is part of the reason why Polars is such a powerful Data Science tool." + ] + }, + { + "cell_type": "markdown", + "id": "5301bcfb-634e-4659-b413-e8279c9bc2ce", + "metadata": {}, + "source": [ + "### Exercise #5 - Creating a Lazy Dataframe ###\n", + "First, let's load the csv as a lazy dataframe." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "f096b49a-1395-4ddb-bf28-f6e04fb8b4f1", + "metadata": {}, + "outputs": [], + "source": [ + "lazy_df = pl.scan_csv('./data/uk_pop.csv')" + ] + }, + { + "cell_type": "raw", + "id": "c43adefb-2176-4a9b-9217-76260033c29a", + "metadata": {}, + "source": [ + "\n", + "lazy_df = pl.scan_csv('./data/uk_pop.csv')" + ] + }, + { + "cell_type": "markdown", + "id": "5e287242-6d01-4848-a795-3f434ae1901a", + "metadata": {}, + "source": [ + "Click ... for solution. " + ] + }, + { + "cell_type": "markdown", + "id": "3581e118-7902-4d04-b4b4-9887c6ff73ba", + "metadata": {}, + "source": [ + "### Exercise #6 - Query Creation ###\n", + "Now, let's create a query to find the 5 most common names for individuals under 30. " + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "1615956e-c6b5-485c-85d3-c0cf89cedbca", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape: (5, 2)\n", + "┌────────┬────────┐\n", + "│ name ┆ count │\n", + "│ --- ┆ --- │\n", + "│ str ┆ u32 │\n", + "╞════════╪════════╡\n", + "│ OLIVER ┆ 218505 │\n", + "│ GEORGE ┆ 174261 │\n", + "│ HARRY ┆ 173862 │\n", + "│ OLIVIA ┆ 171424 │\n", + "│ AMELIA ┆ 163302 │\n", + "└────────┴────────┘\n" + ] + } + ], + "source": [ + "lazy_result = (\n", + " lazy_df\n", + " .filter(pl.col('age') < 30.0)\n", + " .group_by(\"name\")\n", + " .agg(pl.len().alias(\"count\"))\n", + " .sort('count', descending=True)\n", + " .limit(5)\n", + " .select([\"name\",\"count\"])\n", + ")\n", + "\n", + "top=lazy_result.collect()\n", + "print(top)" + ] + }, + { + "cell_type": "raw", + "id": "fef093ee-a9e5-44df-9c10-d7514dcc9eea", + "metadata": {}, + "source": [ + "\n", + "result = (\n", + " lazy_df.filter(pl.col(\"age\") < 30)\n", + " .group_by(\"name\")\n", + " .agg(pl.len().alias(\"count\"))\n", + " .sort(\"count\", descending=True)\n", + " .limit(5)\n", + " .select([\"name\", \"count\"])\n", + ")\n", + "\n", + "top_5_names=result.collect()\n", + "print(top_5_names)" + ] + }, + { + "cell_type": "markdown", + "id": "e2832dcd-9f2e-4970-932b-8d0732f9c4ab", + "metadata": {}, + "source": [ + "Click ... for solution. " + ] + }, + { + "cell_type": "markdown", + "id": "07cf331d-0e3d-407c-9990-c7c0873565da", + "metadata": {}, + "source": [ + "## cuDF Polars ##\n", + "cuDF Polars is built directly into the Polars Lazy API. The only requirement is to pass engine=\"gpu\" to the collect operation. Polars also allows defining an instance of the GPU engine for greater customization!" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "9c6506b1-5c96-4b43-9d53-5826b5a8a44d", + "metadata": {}, + "outputs": [], + "source": [ + "lazy_df = pl.scan_csv('./data/uk_pop.csv').collect(engine=\"gpu\")" + ] + }, + { + "cell_type": "markdown", + "id": "676d16d0-9da0-48a1-a9cc-dc60bde2b223", + "metadata": {}, + "source": [ + "Now let's try defining our own engine object!" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "66672414-a47d-48c4-9ad9-5329ae159765", + "metadata": {}, + "outputs": [], + "source": [ + "import polars as pl\n", + "import time\n", + "\n", + "gpu_engine = pl.GPUEngine(\n", + " device=0, # This is the default\n", + " raise_on_fail=True, # Fail loudly if we can't run on the GPU.\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "a939efd7-6c74-456b-baf0-8a2d9905956f", + "metadata": {}, + "outputs": [], + "source": [ + "lazy_df = pl.scan_csv('./data/uk_pop.csv').collect(engine=gpu_engine)" + ] + }, + { + "cell_type": "markdown", + "id": "7eafdfb6-92b0-4f0e-a2ea-90a2ee8a222c", + "metadata": {}, + "source": [ + "Now that the GPU is warmed up, let's try accelerating the same code as before! Notice that we added an engine parameter to the collect call." + ] + }, + { + "cell_type": "markdown", + "id": "52995c50-bf64-44be-a22e-ba41c778f509", + "metadata": {}, + "source": [ + "### Accelerate Previous Code ###" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "69e5074d-d15b-471a-a9dd-e1f7a52013a5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape: (5, 6)\n", + "┌─────┬─────┬──────────────────────────┬───────────┬───────────┬───────┐\n", + "│ age ┆ sex ┆ county ┆ lat ┆ long ┆ name │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ i64 ┆ str ┆ str ┆ f64 ┆ f64 ┆ str │\n", + "╞═════╪═════╪══════════════════════════╪═══════════╪═══════════╪═══════╡\n", + "│ 1 ┆ f ┆ EAST RIDING OF YORKSHIRE ┆ 53.737344 ┆ -0.638535 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ SHEFFIELD ┆ 53.35529 ┆ -1.669447 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ LINCOLNSHIRE ┆ 53.164176 ┆ 0.015812 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ WORCESTERSHIRE ┆ 52.258629 ┆ -2.31696 ┆ ZYRAH │\n", + "│ 1 ┆ f ┆ HERTFORDSHIRE ┆ 51.731816 ┆ -0.377476 ┆ ZYRAH │\n", + "└─────┴─────┴──────────────────────────┴───────────┴───────────┴───────┘\n", + "Time Taken: 5.9006 seconds\n" + ] + } + ], + "source": [ + "start_time = time.time()\n", + "\n", + "# Create a lazy DataFrame\n", + "lazy_df = pl.scan_csv('./data/uk_pop.csv')\n", + "\n", + "# Define the lazy operations\n", + "lazy_result = (\n", + " lazy_df\n", + " .filter(pl.col('age') > 0.0)\n", + " .sort('name', descending=True)\n", + ")\n", + "\n", + "# Switch to gpu_engine\n", + "result = lazy_result.collect(engine=gpu_engine)\n", + "\n", + "print(result.head())\n", + "polars_time = time.time() - start_time\n", + "print(f\"Time Taken: {polars_time:.4f} seconds\")" + ] + }, + { + "cell_type": "markdown", + "id": "025606e3-9d16-416b-9501-b4cf702fb317", + "metadata": {}, + "source": [ + "### Verify Results Across Engines ###\n", + "How do we know the results are the same with both the CPU and GPU engine? Luckily with Polars, we can execute the same query across both and compare results using the built in testing module! " + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "759ad96e-6bc1-4c79-bc50-ab89430fdb42", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The test frames are equal\n" + ] + } + ], + "source": [ + "from polars.testing import assert_frame_equal\n", + "\n", + "# Run on the CPU\n", + "result_cpu = lazy_result.collect()\n", + "\n", + "# Run on the GPU\n", + "result_gpu = lazy_result.collect(engine=\"gpu\")\n", + "\n", + "# assert both result are equal - Will error if not equal, return None otherwise\n", + "if (assert_frame_equal(result_gpu, result_cpu) == None):\n", + " print(\"The test frames are equal\")" + ] + }, + { + "cell_type": "markdown", + "id": "5b4325cd-0e2d-4cae-8f71-83967708c7b9", + "metadata": {}, + "source": [ + "### Fallback ###\n", + "What happens when an operation isn't supported? " + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "30401f1d-45ad-4365-a00f-e5e79e01412f", + "metadata": {}, + "outputs": [ + { + "ename": "ComputeError", + "evalue": "'cuda' conversion failed: NotImplementedError: ('Query execution with GPU not possible: unsupported operations.\\nThe errors were:\\n- NotImplementedError: rolling mean', [NotImplementedError('rolling mean')])", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mComputeError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[47], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m result \u001b[38;5;241m=\u001b[39m (\n\u001b[0;32m----> 2\u001b[0m \u001b[43mlazy_df\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mwith_columns\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpl\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcol\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mage\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrolling_mean\u001b[49m\u001b[43m(\u001b[49m\u001b[43mwindow_size\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m7\u001b[39;49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43malias\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mage_rolling_mean\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfilter\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpl\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcol\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mage\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m>\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m0.0\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m \u001b[49m\n\u001b[1;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcollect\u001b[49m\u001b[43m(\u001b[49m\u001b[43mengine\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgpu_engine\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 6\u001b[0m )\n\u001b[1;32m 7\u001b[0m \u001b[38;5;28mprint\u001b[39m(result[::\u001b[38;5;241m7\u001b[39m])\n", + "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/polars/lazyframe/frame.py:2029\u001b[0m, in \u001b[0;36mLazyFrame.collect\u001b[0;34m(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)\u001b[0m\n\u001b[1;32m 2027\u001b[0m \u001b[38;5;66;03m# Only for testing purposes\u001b[39;00m\n\u001b[1;32m 2028\u001b[0m callback \u001b[38;5;241m=\u001b[39m _kwargs\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mpost_opt_callback\u001b[39m\u001b[38;5;124m\"\u001b[39m, callback)\n\u001b[0;32m-> 2029\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m wrap_df(\u001b[43mldf\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcollect\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcallback\u001b[49m\u001b[43m)\u001b[49m)\n", + "\u001b[0;31mComputeError\u001b[0m: 'cuda' conversion failed: NotImplementedError: ('Query execution with GPU not possible: unsupported operations.\\nThe errors were:\\n- NotImplementedError: rolling mean', [NotImplementedError('rolling mean')])" + ] + } + ], + "source": [ + "result = (\n", + " lazy_df\n", + " .with_columns(pl.col('age').rolling_mean(window_size=7).alias('age_rolling_mean'))\n", + " .filter(pl.col('age') > 0.0) \n", + " .collect(engine=gpu_engine)\n", + ")\n", + "print(result[::7])" + ] + }, + { + "cell_type": "markdown", + "id": "bad531f3-ecfc-4214-b3c7-6506a3321e11", + "metadata": {}, + "source": [ + "We intially constructed the GPU engine with raise_on_fail=True to ensure all operations ran on GPU. But as we can see, the rolling mean operation is not currently supported, which results in the query not executing. To enable fallback, we can simply change the raise_on_fail parameter to False." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "96b95c34-d3ea-494a-95b9-5d2fd8d49938", + "metadata": {}, + "outputs": [], + "source": [ + "gpu_engine_with_fallback = pl.GPUEngine(\n", + " device=0, # This is the default\n", + " raise_on_fail=False, # Fallback to CPU if we can't run on the GPU (this is the default)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a7e15a86-dad3-4b71-8ccb-c03b61a35e3c", + "metadata": {}, + "source": [ + "Now let's try this query again." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "73c30f3c-4075-44aa-8715-648b49fec1e9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape: (8_259_508, 7)\n", + "┌─────┬─────┬────────────┬───────────┬───────────┬───────────┬──────────────────┐\n", + "│ age ┆ sex ┆ county ┆ lat ┆ long ┆ name ┆ age_rolling_mean │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ i64 ┆ str ┆ str ┆ f64 ┆ f64 ┆ str ┆ f64 │\n", + "╞═════╪═════╪════════════╪═══════════╪═══════════╪═══════════╪══════════════════╡\n", + "│ 1 ┆ m ┆ DARLINGTON ┆ 54.580675 ┆ -1.51359 ┆ PHILIP ┆ 0.142857 │\n", + "│ 1 ┆ m ┆ DARLINGTON ┆ 54.589555 ┆ -1.533749 ┆ SCOTT ┆ 1.0 │\n", + "│ 1 ┆ m ┆ DARLINGTON ┆ 54.526772 ┆ -1.557881 ┆ ISAAC ┆ 1.0 │\n", + "│ 1 ┆ m ┆ DARLINGTON ┆ 54.617086 ┆ -1.557996 ┆ SEBASTIAN ┆ 1.0 │\n", + "│ 1 ┆ m ┆ DARLINGTON ┆ 54.551081 ┆ -1.50386 ┆ FINN ┆ 1.0 │\n", + "│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", + "│ 90 ┆ f ┆ NEWPORT ┆ 51.589194 ┆ -2.825451 ┆ FREYA ┆ 90.0 │\n", + "│ 90 ┆ f ┆ NEWPORT ┆ 51.51582 ┆ -2.839532 ┆ EMERSON ┆ 90.0 │\n", + "│ 90 ┆ f ┆ NEWPORT ┆ 51.569834 ┆ -2.803327 ┆ THEA ┆ 90.0 │\n", + "│ 90 ┆ f ┆ NEWPORT ┆ 51.586575 ┆ -2.799302 ┆ ELIN ┆ 90.0 │\n", + "│ 90 ┆ f ┆ NEWPORT ┆ 51.554649 ┆ -2.934364 ┆ JESSICA ┆ 90.0 │\n", + "└─────┴─────┴────────────┴───────────┴───────────┴───────────┴──────────────────┘\n" + ] + } + ], + "source": [ + "result = (\n", + " lazy_df\n", + " .with_columns(pl.col('age').rolling_mean(window_size=7).alias('age_rolling_mean'))\n", + " .filter(pl.col('age') > 0.0) \n", + " .collect(engine=gpu_engine_with_fallback)\n", + ")\n", + "print(result[::7])" + ] + }, + { + "cell_type": "markdown", + "id": "19be5382-ba87-4bc3-8862-df73049598f7", + "metadata": {}, + "source": [ + "### Exercise #7 - Enable GPU Engine ###\n", + "The below code calculates the average latitude and longitude for each county. Let's try enabling the GPU Engine for this query!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94243bde-3a3a-4ce9-887c-13d0b0875750", + "metadata": {}, + "outputs": [], + "source": [ + "# Create the lazy query with column pruning\n", + "gpu_engine = pl.GPUEngine(\n", + " device=0, # This is the default\n", + " raise_on_fail=True, # Fail loudly if we can't run on the GPU.\n", + ")\n", + "\n", + "\n", + "lazy_query = (\n", + " lazy_df\n", + " .select([\"county\", \"lat\", \"long\"]) # Column pruning: select only necessary columns\n", + " .group_by(\"county\")\n", + " .agg([\n", + " pl.col(\"lat\").mean().alias(\"avg_latitude\"),\n", + " pl.col(\"long\").mean().alias(\"avg_longitude\")\n", + " ])\n", + " .sort(\"county\")\n", + ")\n", + "\n", + "# Execute the query\n", + "result = lazy_query.collect(engine=gpu_engine)\n", + "\n", + "print(\"\\nAverage latitude and longitude for each county:\")\n", + "print(result.head()) # Display first few rows" + ] + }, + { + "cell_type": "raw", + "id": "42acb118-cb9d-478f-9636-694a5ddcb071", + "metadata": {}, + "source": [ + "\n", + "# Create the lazy query with column pruning\n", + "lazy_query = (\n", + " lazy_df\n", + " .select([\"county\", \"lat\", \"long\"]) # Column pruning: select only necessary columns\n", + " .group_by(\"county\")\n", + " .agg([\n", + " pl.col(\"lat\").mean().alias(\"avg_latitude\"),\n", + " pl.col(\"long\").mean().alias(\"avg_longitude\")\n", + " ])\n", + " .sort(\"county\")\n", + ")\n", + "\n", + "# Execute the query\n", + "result = lazy_query.collect(engine=\"gpu\")\n", + "\n", + "print(\"\\nAverage latitude and longitude for each county:\")\n", + "print(result.head()) # Display first few rows" + ] + }, + { + "cell_type": "markdown", + "id": "b942cf78-b3cc-4fa1-8d62-b355e20dd2ae", + "metadata": {}, + "source": [ + "Click ... for solution. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf1158c3-8429-4637-a9c2-e8ca91a59965", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "app = IPython.Application.instance()\n", + "app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "id": "cdecd196-8e0d-4b36-a428-4411bd480778", + "metadata": {}, + "source": [ + "**Well Done!**" + ] + }, + { + "cell_type": "markdown", + "id": "ed99fbcd-6e89-401f-b0cc-0b4fe3f112ad", + "metadata": {}, + "source": [ + "" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} -- cgit v1.2.3