{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Reshape Data in Polars Efficiently from Wide to Long Form - Part I" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Introduction**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows), to make it tidy, or for ease of analysis. In polars, this is achieved with the [unpivot](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unpivot.html#polars.DataFrame.unpivot) method. There are scenarios however where some processing has to be done before and/or after unpivoting. This article shows how to reshape data for such scenarios.\n", "\n", "Most of the examples here are based on the [tidyr pivoting vignette](https://tidyr.tidyverse.org/articles/pivot.html#longer). \n", "\n", "Polars' [eager API](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html) is used in the examples below; for more performance, or in production mode, it is recommended to use the [lazy API](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "polars version : 1.7.1\n", "python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]\n" ] } ], "source": [ "import polars as pl\n", "import polars.selectors as cs\n", "import sys\n", "\n", "print(\"polars version :\", pl.__version__)\n", "print(\"python version :\", sys.version)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(String-data-in-column-names)=\n", "## **String data in column names**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (18, 11)
religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k$75-100k$100-150k>150kDon't know/refused
stri64i64i64i64i64i64i64i64i64i64
"Agnostic"27346081761371221098496
"Atheist"12273752357073597476
"Buddhist"27213034335862395354
"Catholic"41861773267063811169497926331489
"Don’t know/refused"151415111035211718116
"Orthodox"13172332324738424673
"Other Christian"971113131418141218
"Other Faiths"20334046496346404171
"Other World Religions"5234273448
"Unaffiliated"217299374365341528407321258597
" ], "text/plain": [ "shape: (18, 11)\n", "┌────────────────────┬───────┬─────────┬─────────┬───┬──────────┬───────────┬───────┬──────────────┐\n", "│ religion ┆ <$10k ┆ $10-20k ┆ $20-30k ┆ … ┆ $75-100k ┆ $100-150k ┆ >150k ┆ Don't │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ know/refused │\n", "│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ --- │\n", "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ i64 │\n", "╞════════════════════╪═══════╪═════════╪═════════╪═══╪══════════╪═══════════╪═══════╪══════════════╡\n", "│ Agnostic ┆ 27 ┆ 34 ┆ 60 ┆ … ┆ 122 ┆ 109 ┆ 84 ┆ 96 │\n", "│ Atheist ┆ 12 ┆ 27 ┆ 37 ┆ … ┆ 73 ┆ 59 ┆ 74 ┆ 76 │\n", "│ Buddhist ┆ 27 ┆ 21 ┆ 30 ┆ … ┆ 62 ┆ 39 ┆ 53 ┆ 54 │\n", "│ Catholic ┆ 418 ┆ 617 ┆ 732 ┆ … ┆ 949 ┆ 792 ┆ 633 ┆ 1489 │\n", "│ Don’t know/refused ┆ 15 ┆ 14 ┆ 15 ┆ … ┆ 21 ┆ 17 ┆ 18 ┆ 116 │\n", "│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", "│ Orthodox ┆ 13 ┆ 17 ┆ 23 ┆ … ┆ 38 ┆ 42 ┆ 46 ┆ 73 │\n", "│ Other Christian ┆ 9 ┆ 7 ┆ 11 ┆ … ┆ 18 ┆ 14 ┆ 12 ┆ 18 │\n", "│ Other Faiths ┆ 20 ┆ 33 ┆ 40 ┆ … ┆ 46 ┆ 40 ┆ 41 ┆ 71 │\n", "│ Other World ┆ 5 ┆ 2 ┆ 3 ┆ … ┆ 3 ┆ 4 ┆ 4 ┆ 8 │\n", "│ Religions ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ Unaffiliated ┆ 217 ┆ 299 ┆ 374 ┆ … ┆ 407 ┆ 321 ┆ 258 ┆ 597 │\n", "└────────────────────┴───────┴─────────┴─────────┴───┴──────────┴───────────┴───────┴──────────────┘" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "relig_income = pl.read_csv(\"./Data_files/relig_income.csv\")\n", "relig_income" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> The relig_income dataset stores counts based on a survey which (among other things) asked people about their religion and annual income. \n", "> This dataset contains three variables:\n", "\n", "> - religion, stored in the rows,\n", "> - income spread across the column names, and\n", "> - count stored in the cell values.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can flip this into long form, and ensure each variable has its own column - this can be achieved with the [unpivot](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unpivot.html#polars.DataFrame.unpivot) method:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (180, 3)
religionincomecount
strstri64
"Agnostic""<$10k"27
"Atheist""<$10k"12
"Buddhist""<$10k"27
"Catholic""<$10k"418
"Don’t know/refused""<$10k"15
"Orthodox""Don't know/refused"73
"Other Christian""Don't know/refused"18
"Other Faiths""Don't know/refused"71
"Other World Religions""Don't know/refused"8
"Unaffiliated""Don't know/refused"597
" ], "text/plain": [ "shape: (180, 3)\n", "┌───────────────────────┬────────────────────┬───────┐\n", "│ religion ┆ income ┆ count │\n", "│ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ i64 │\n", "╞═══════════════════════╪════════════════════╪═══════╡\n", "│ Agnostic ┆ <$10k ┆ 27 │\n", "│ Atheist ┆ <$10k ┆ 12 │\n", "│ Buddhist ┆ <$10k ┆ 27 │\n", "│ Catholic ┆ <$10k ┆ 418 │\n", "│ Don’t know/refused ┆ <$10k ┆ 15 │\n", "│ … ┆ … ┆ … │\n", "│ Orthodox ┆ Don't know/refused ┆ 73 │\n", "│ Other Christian ┆ Don't know/refused ┆ 18 │\n", "│ Other Faiths ┆ Don't know/refused ┆ 71 │\n", "│ Other World Religions ┆ Don't know/refused ┆ 8 │\n", "│ Unaffiliated ┆ Don't know/refused ┆ 597 │\n", "└───────────────────────┴────────────────────┴───────┘" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "relig_income.unpivot(index=\"religion\", variable_name=\"income\", value_name=\"count\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No preprocessing required here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Numeric data in column names**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (317, 81)
yearartisttracktimedate.enteredwk1wk2wk3wk4wk5wk6wk7wk8wk9wk10wk11wk12wk13wk14wk15wk16wk17wk18wk19wk20wk21wk22wk23wk24wk25wk26wk27wk28wk29wk30wk31wk32wk40wk41wk42wk43wk44wk45wk46wk47wk48wk49wk50wk51wk52wk53wk54wk55wk56wk57wk58wk59wk60wk61wk62wk63wk64wk65wk66wk67wk68wk69wk70wk71wk72wk73wk74wk75wk76
i64strstrstrstri32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26"87827277879499nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02"918792nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"3 Doors Down""Kryptonite""3:53""2000-04-08"817068676657545351515151474438282218181412766655444431514131416172122242833424249nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"3 Doors Down""Loser""4:24""2000-10-21"7676726967655559626161596166727675677370nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15"573425171731364953576470757678859296nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Yankee Grey""Another Nine Minutes""3:10""2000-04-29"8683777483798895nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Yearwood, Trisha""Real Live Woman""3:55""2000-04-01"858383828191nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Ying Yang Twins""Whistle While You Tw...""4:19""2000-03-18"9594918584787478858997969999nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Zombie Nation""Kernkraft 400""3:30""2000-09-02"9999nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"matchbox twenty""Bent""4:12""2000-04-29"60372924222118161312861232234544691213192020242928nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
" ], "text/plain": [ "shape: (317, 81)\n", "┌──────┬──────────────────┬─────────────────────────┬──────┬───┬──────┬──────┬──────┬──────┐\n", "│ year ┆ artist ┆ track ┆ time ┆ … ┆ wk73 ┆ wk74 ┆ wk75 ┆ wk76 │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ str ┆ str ┆ ┆ i32 ┆ i32 ┆ i32 ┆ i32 │\n", "╞══════╪══════════════════╪═════════════════════════╪══════╪═══╪══════╪══════╪══════╪══════╡\n", "│ 2000 ┆ 2 Pac ┆ Baby Don't Cry (Keep... ┆ 4:22 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ 2000 ┆ 2Ge+her ┆ The Hardest Part Of ... ┆ 3:15 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ 2000 ┆ 3 Doors Down ┆ Kryptonite ┆ 3:53 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ 2000 ┆ 3 Doors Down ┆ Loser ┆ 4:24 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ 2000 ┆ 504 Boyz ┆ Wobble Wobble ┆ 3:35 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", "│ 2000 ┆ Yankee Grey ┆ Another Nine Minutes ┆ 3:10 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ 2000 ┆ Yearwood, Trisha ┆ Real Live Woman ┆ 3:55 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ 2000 ┆ Ying Yang Twins ┆ Whistle While You Tw... ┆ 4:19 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ 2000 ┆ Zombie Nation ┆ Kernkraft 400 ┆ 3:30 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ 2000 ┆ matchbox twenty ┆ Bent ┆ 4:12 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "└──────┴──────────────────┴─────────────────────────┴──────┴───┴──────┴──────┴──────┴──────┘" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "billboard = pl.read_csv(\"./Data_files/billboard.csv\", null_values=\"NA\")\n", "billboard = billboard.with_columns(cs.starts_with(\"wk\").cast(pl.Int32))\n", "billboard" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">The billboard dataset records the billboard rank of songs in the year 2000. It has a form similar to the relig_income data, but the data encoded in the column names is really a number, not a string." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's unpivot the data into a tidy form:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5_307, 7)
yearartisttracktimedate.enteredweekrank
i64strstrstrstrstri32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26""wk1"87
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02""wk1"91
2000"3 Doors Down""Kryptonite""3:53""2000-04-08""wk1"81
2000"3 Doors Down""Loser""4:24""2000-10-21""wk1"76
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15""wk1"57
2000"Creed""Higher""5:16""1999-09-11""wk63"50
2000"Lonestar""Amazed""4:25""1999-06-05""wk63"45
2000"Creed""Higher""5:16""1999-09-11""wk64"50
2000"Lonestar""Amazed""4:25""1999-06-05""wk64"50
2000"Creed""Higher""5:16""1999-09-11""wk65"49
" ], "text/plain": [ "shape: (5_307, 7)\n", "┌──────┬──────────────┬─────────────────────────┬──────┬──────────────┬──────┬──────┐\n", "│ year ┆ artist ┆ track ┆ time ┆ date.entered ┆ week ┆ rank │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ i32 │\n", "╞══════╪══════════════╪═════════════════════════╪══════╪══════════════╪══════╪══════╡\n", "│ 2000 ┆ 2 Pac ┆ Baby Don't Cry (Keep... ┆ 4:22 ┆ 2000-02-26 ┆ wk1 ┆ 87 │\n", "│ 2000 ┆ 2Ge+her ┆ The Hardest Part Of ... ┆ 3:15 ┆ 2000-09-02 ┆ wk1 ┆ 91 │\n", "│ 2000 ┆ 3 Doors Down ┆ Kryptonite ┆ 3:53 ┆ 2000-04-08 ┆ wk1 ┆ 81 │\n", "│ 2000 ┆ 3 Doors Down ┆ Loser ┆ 4:24 ┆ 2000-10-21 ┆ wk1 ┆ 76 │\n", "│ 2000 ┆ 504 Boyz ┆ Wobble Wobble ┆ 3:35 ┆ 2000-04-15 ┆ wk1 ┆ 57 │\n", "│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", "│ 2000 ┆ Creed ┆ Higher ┆ 5:16 ┆ 1999-09-11 ┆ wk63 ┆ 50 │\n", "│ 2000 ┆ Lonestar ┆ Amazed ┆ 4:25 ┆ 1999-06-05 ┆ wk63 ┆ 45 │\n", "│ 2000 ┆ Creed ┆ Higher ┆ 5:16 ┆ 1999-09-11 ┆ wk64 ┆ 50 │\n", "│ 2000 ┆ Lonestar ┆ Amazed ┆ 4:25 ┆ 1999-06-05 ┆ wk64 ┆ 50 │\n", "│ 2000 ┆ Creed ┆ Higher ┆ 5:16 ┆ 1999-09-11 ┆ wk65 ┆ 49 │\n", "└──────┴──────────────┴─────────────────────────┴──────┴──────────────┴──────┴──────┘" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " billboard.unpivot(\n", " index=~cs.starts_with(\"wk\"), variable_name=\"week\", value_name=\"rank\"\n", " )\n", " # irrelevant nulls\n", " .drop_nulls(subset=\"rank\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> It would be nice to easily determine how long each song stayed in the charts, but to do that, we’ll need to convert the week variable to an integer.\n", "\n", "This is where the post processing comes into play after the unpivoting - extract the number from the string column, and cast the extracted column to an integer dtype:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5_307, 7)
yearartisttracktimedate.enteredweekrank
i64strstrstrstri16i32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26"187
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02"191
2000"3 Doors Down""Kryptonite""3:53""2000-04-08"181
2000"3 Doors Down""Loser""4:24""2000-10-21"176
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15"157
2000"Creed""Higher""5:16""1999-09-11"350
2000"Lonestar""Amazed""4:25""1999-06-05"345
2000"Creed""Higher""5:16""1999-09-11"450
2000"Lonestar""Amazed""4:25""1999-06-05"450
2000"Creed""Higher""5:16""1999-09-11"549
" ], "text/plain": [ "shape: (5_307, 7)\n", "┌──────┬──────────────┬─────────────────────────┬──────┬──────────────┬──────┬──────┐\n", "│ year ┆ artist ┆ track ┆ time ┆ date.entered ┆ week ┆ rank │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ str ┆ str ┆ str ┆ i16 ┆ i32 │\n", "╞══════╪══════════════╪═════════════════════════╪══════╪══════════════╪══════╪══════╡\n", "│ 2000 ┆ 2 Pac ┆ Baby Don't Cry (Keep... ┆ 4:22 ┆ 2000-02-26 ┆ 1 ┆ 87 │\n", "│ 2000 ┆ 2Ge+her ┆ The Hardest Part Of ... ┆ 3:15 ┆ 2000-09-02 ┆ 1 ┆ 91 │\n", "│ 2000 ┆ 3 Doors Down ┆ Kryptonite ┆ 3:53 ┆ 2000-04-08 ┆ 1 ┆ 81 │\n", "│ 2000 ┆ 3 Doors Down ┆ Loser ┆ 4:24 ┆ 2000-10-21 ┆ 1 ┆ 76 │\n", "│ 2000 ┆ 504 Boyz ┆ Wobble Wobble ┆ 3:35 ┆ 2000-04-15 ┆ 1 ┆ 57 │\n", "│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", "│ 2000 ┆ Creed ┆ Higher ┆ 5:16 ┆ 1999-09-11 ┆ 3 ┆ 50 │\n", "│ 2000 ┆ Lonestar ┆ Amazed ┆ 4:25 ┆ 1999-06-05 ┆ 3 ┆ 45 │\n", "│ 2000 ┆ Creed ┆ Higher ┆ 5:16 ┆ 1999-09-11 ┆ 4 ┆ 50 │\n", "│ 2000 ┆ Lonestar ┆ Amazed ┆ 4:25 ┆ 1999-06-05 ┆ 4 ┆ 50 │\n", "│ 2000 ┆ Creed ┆ Higher ┆ 5:16 ┆ 1999-09-11 ┆ 5 ┆ 49 │\n", "└──────┴──────────────┴─────────────────────────┴──────┴──────────────┴──────┴──────┘" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " billboard.unpivot(\n", " index=~cs.starts_with(\"wk\"), variable_name=\"week\", value_name=\"rank\"\n", " )\n", " # irrelevant nulls\n", " .drop_nulls(subset=\"rank\").with_columns(\n", " week=pl.col(\"week\").str.slice(-1).cast(pl.Int16)\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Many variables in column names**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (7_240, 60)
countryiso2iso3yearnew_sp_m014new_sp_m1524new_sp_m2534new_sp_m3544new_sp_m4554new_sp_m5564new_sp_m65new_sp_f014new_sp_f1524new_sp_f2534new_sp_f3544new_sp_f4554new_sp_f5564new_sp_f65new_sn_m014new_sn_m1524new_sn_m2534new_sn_m3544new_sn_m4554new_sn_m5564new_sn_m65new_sn_f014new_sn_f1524new_sn_f2534new_sn_f3544new_sn_f4554new_sn_f5564new_sn_f65new_ep_m014new_ep_m1524new_ep_m2534new_ep_m3544new_ep_m4554new_ep_m5564new_ep_m65new_ep_f014new_ep_f1524new_ep_f2534new_ep_f3544new_ep_f4554new_ep_f5564new_ep_f65newrel_m014newrel_m1524newrel_m2534newrel_m3544newrel_m4554newrel_m5564newrel_m65newrel_f014newrel_f1524newrel_f2534newrel_f3544newrel_f4554newrel_f5564newrel_f65
strstrstri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
"Afghanistan""AF""AFG"1980nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1981nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1982nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1983nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1984nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2009125578null3471681293192180873null30284192291261560860null6496165588286114251334null702315517295142442660192249123122321039401944438182138nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"201015071022081682761350252173974218512834902651711826821334232701545882864173212824013285113777895632702439028684182291922203191058677338181146nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2011152784246720717803772781741084216113864482741601364596247328131264702728127194727542216962587495250195746796342172172209318802640284137129nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"201212078324212086796360271173939205312864832311611169613230226571154708796100888822871957829516432233214658789331178182208319710579228140143nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2013nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull131516425331536323491206120812522069464935261453811725
" ], "text/plain": [ "shape: (7_240, 60)\n", "┌─────────────┬──────┬──────┬──────┬───┬──────────────┬──────────────┬──────────────┬────────────┐\n", "│ country ┆ iso2 ┆ iso3 ┆ year ┆ … ┆ newrel_f3544 ┆ newrel_f4554 ┆ newrel_f5564 ┆ newrel_f65 │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │\n", "╞═════════════╪══════╪══════╪══════╪═══╪══════════════╪══════════════╪══════════════╪════════════╡\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1980 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1981 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1982 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1983 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1984 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2009 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2010 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2011 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2012 ┆ … ┆ null ┆ null ┆ null ┆ null │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2013 ┆ … ┆ 3526 ┆ 1453 ┆ 811 ┆ 725 │\n", "└─────────────┴──────┴──────┴──────┴───┴──────────────┴──────────────┴──────────────┴────────────┘" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "who = pl.read_csv(\"./Data_files/who.csv\", null_values=\"NA\")\n", "who" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the who dataset above.\n", ">\n", ">- country, iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to newrel_f65 encode four variables in their names:\n", ">\n", ">- the new_/new prefix indicates these are counts of new cases. This dataset only contains new cases, so we’ll ignore it here because it’s constant.\n", ">\n", ">- sp/rel/ep describe how the case was diagnosed.\n", ">\n", ">- m/f gives the gender.\n", ">\n", ">- 014/1524/2535/3544/4554/65 supplies the age range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sticking to the same pattern as [before](#String-data-in-column-names), we unpivot into long form; however, after unpivoting, we extract the relevant fields into new columns using Polars' [efficient string methods](https://docs.pola.rs/api/python/stable/reference/series/string.html) - for this use case a regex is a natural fit:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (405_440, 8)
countryiso2iso3yeardiagnosisgenderagecount
strstrstri64strstrstri64
"Afghanistan""AF""AFG"1980"sp""m""014"null
"Afghanistan""AF""AFG"1981"sp""m""014"null
"Afghanistan""AF""AFG"1982"sp""m""014"null
"Afghanistan""AF""AFG"1983"sp""m""014"null
"Afghanistan""AF""AFG"1984"sp""m""014"null
"Zimbabwe""ZW""ZWE"2009"rel""f""65"null
"Zimbabwe""ZW""ZWE"2010"rel""f""65"null
"Zimbabwe""ZW""ZWE"2011"rel""f""65"null
"Zimbabwe""ZW""ZWE"2012"rel""f""65"null
"Zimbabwe""ZW""ZWE"2013"rel""f""65"725
" ], "text/plain": [ "shape: (405_440, 8)\n", "┌─────────────┬──────┬──────┬──────┬───────────┬────────┬─────┬───────┐\n", "│ country ┆ iso2 ┆ iso3 ┆ year ┆ diagnosis ┆ gender ┆ age ┆ count │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ i64 ┆ str ┆ str ┆ str ┆ i64 │\n", "╞═════════════╪══════╪══════╪══════╪═══════════╪════════╪═════╪═══════╡\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1980 ┆ sp ┆ m ┆ 014 ┆ null │\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1981 ┆ sp ┆ m ┆ 014 ┆ null │\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1982 ┆ sp ┆ m ┆ 014 ┆ null │\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1983 ┆ sp ┆ m ┆ 014 ┆ null │\n", "│ Afghanistan ┆ AF ┆ AFG ┆ 1984 ┆ sp ┆ m ┆ 014 ┆ null │\n", "│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2009 ┆ rel ┆ f ┆ 65 ┆ null │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2010 ┆ rel ┆ f ┆ 65 ┆ null │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2011 ┆ rel ┆ f ┆ 65 ┆ null │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2012 ┆ rel ┆ f ┆ 65 ┆ null │\n", "│ Zimbabwe ┆ ZW ┆ ZWE ┆ 2013 ┆ rel ┆ f ┆ 65 ┆ 725 │\n", "└─────────────┴──────┴──────┴──────┴───────────┴────────┴─────┴───────┘" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex = r\"new_?(?.*)_(?.)(?.*)\"\n", "expression = pl.col(\"variable\").str.extract_groups(regex)\n", "(\n", " who.unpivot(\n", " index=[\"country\", \"iso2\", \"iso3\", \"year\"],\n", " value_name=\"count\",\n", " )\n", " .with_columns(variable=expression)\n", " .unnest(\"variable\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(group-related-columns)=\n", "## **Group related sets of columns**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 5)
hr1hr2teamyear1year2
i64i64stri64i64
514545"Red Sox"20072008
573526"Yankees"20072008
" ], "text/plain": [ "shape: (2, 5)\n", "┌─────┬─────┬─────────┬───────┬───────┐\n", "│ hr1 ┆ hr2 ┆ team ┆ year1 ┆ year2 │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │\n", "╞═════╪═════╪═════════╪═══════╪═══════╡\n", "│ 514 ┆ 545 ┆ Red Sox ┆ 2007 ┆ 2008 │\n", "│ 573 ┆ 526 ┆ Yankees ┆ 2007 ┆ 2008 │\n", "└─────┴─────┴─────────┴───────┴───────┘" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball = pl.DataFrame(\n", " {\n", " \"hr1\": [514, 573],\n", " \"hr2\": [545, 526],\n", " \"team\": [\"Red Sox\", \"Yankees\"],\n", " \"year1\": [2007, 2007],\n", " \"year2\": [2008, 2008],\n", " }\n", ")\n", "baseball" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the dataframe above, we wish to combine the years (`year1` and `year2`) into a single `year` column, and the hours(`hr1` and `hr2`) into a single `hour` column.\n", "\n", "For this use case, we have to do some processing before unpivoting - we create structs of the `year*` and `hr*` columns, then we unpivot:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (4, 3)
teamhouryear
stri64i64
"Red Sox"5142007
"Yankees"5732007
"Red Sox"5452008
"Yankees"5262008
" ], "text/plain": [ "shape: (4, 3)\n", "┌─────────┬──────┬──────┐\n", "│ team ┆ hour ┆ year │\n", "│ --- ┆ --- ┆ --- │\n", "│ str ┆ i64 ┆ i64 │\n", "╞═════════╪══════╪══════╡\n", "│ Red Sox ┆ 514 ┆ 2007 │\n", "│ Yankees ┆ 573 ┆ 2007 │\n", "│ Red Sox ┆ 545 ┆ 2008 │\n", "│ Yankees ┆ 526 ┆ 2008 │\n", "└─────────┴──────┴──────┘" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expression = [\n", " \"team\",\n", " pl.struct(hour=\"hr1\", year=\"year1\").alias(\"1\"),\n", " pl.struct(hour=\"hr2\", year=\"year2\").alias(\"2\"),\n", "]\n", "baseball.select(expression).unpivot(index=\"team\").drop(\"variable\").unnest(\"value\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above is a variant of [multiple observations per row](#multiple-observations-per-row) which is discussed in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(multiple-observations-per-row)=\n", "## **Multiple observations per row**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> So far, we have been working with data frames that have one observation per row, but many important pivoting problems involve multiple observations per row. You can usually recognise this case because the name of the column that you want to appear in the output is part of the column name in the input. In this section, you’ll learn how to pivot this sort of data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 5)
familydob_child1dob_child2name_child1name_child2
i64strstrstrstr
1"1998-11-26""2000-01-29""Susan""Jose"
2"1996-06-22"null"Mark"null
3"2002-07-11""2004-04-05""Sam""Seth"
4"2004-10-10""2009-08-27""Craig""Khai"
5"2000-12-05""2005-02-28""Parker""Gracie"
" ], "text/plain": [ "shape: (5, 5)\n", "┌────────┬────────────┬────────────┬─────────────┬─────────────┐\n", "│ family ┆ dob_child1 ┆ dob_child2 ┆ name_child1 ┆ name_child2 │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ str ┆ str ┆ str │\n", "╞════════╪════════════╪════════════╪═════════════╪═════════════╡\n", "│ 1 ┆ 1998-11-26 ┆ 2000-01-29 ┆ Susan ┆ Jose │\n", "│ 2 ┆ 1996-06-22 ┆ null ┆ Mark ┆ null │\n", "│ 3 ┆ 2002-07-11 ┆ 2004-04-05 ┆ Sam ┆ Seth │\n", "│ 4 ┆ 2004-10-10 ┆ 2009-08-27 ┆ Craig ┆ Khai │\n", "│ 5 ┆ 2000-12-05 ┆ 2005-02-28 ┆ Parker ┆ Gracie │\n", "└────────┴────────────┴────────────┴─────────────┴─────────────┘" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "household = pl.read_csv(\"./Data_files/household.csv\", null_values=\"NA\")\n", "household" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note that we have two pieces of information (or values) for each child: their name and their dob (date of birth). These need to go into separate columns in the result. \n", "\n", "Steps involved:\n", "- unpivot into long form\n", "- string split into individual columns. \n", "- [pivot](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.pivot.html) the dataframe, to keep `dob` and `name` as headers" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (9, 4)
familychilddobname
i64strstrstr
1"child1""1998-11-26""Susan"
2"child1""1996-06-22""Mark"
3"child1""2002-07-11""Sam"
4"child1""2004-10-10""Craig"
5"child1""2000-12-05""Parker"
1"child2""2000-01-29""Jose"
3"child2""2004-04-05""Seth"
4"child2""2009-08-27""Khai"
5"child2""2005-02-28""Gracie"
" ], "text/plain": [ "shape: (9, 4)\n", "┌────────┬────────┬────────────┬────────┐\n", "│ family ┆ child ┆ dob ┆ name │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ str ┆ str │\n", "╞════════╪════════╪════════════╪════════╡\n", "│ 1 ┆ child1 ┆ 1998-11-26 ┆ Susan │\n", "│ 2 ┆ child1 ┆ 1996-06-22 ┆ Mark │\n", "│ 3 ┆ child1 ┆ 2002-07-11 ┆ Sam │\n", "│ 4 ┆ child1 ┆ 2004-10-10 ┆ Craig │\n", "│ 5 ┆ child1 ┆ 2000-12-05 ┆ Parker │\n", "│ 1 ┆ child2 ┆ 2000-01-29 ┆ Jose │\n", "│ 3 ┆ child2 ┆ 2004-04-05 ┆ Seth │\n", "│ 4 ┆ child2 ┆ 2009-08-27 ┆ Khai │\n", "│ 5 ┆ child2 ┆ 2005-02-28 ┆ Gracie │\n", "└────────┴────────┴────────────┴────────┘" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expression = pl.col(\"variable\").str.split(\"_\")\n", "expression = expression.list.to_struct(fields=[\"header\", \"child\"])\n", "(\n", " household.unpivot(index=\"family\")\n", " .drop_nulls(\"value\")\n", " .with_columns(variable=expression)\n", " .unnest(\"variable\")\n", " .pivot(index=[\"family\", \"child\"], on=\"header\", values=\"value\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the household data above, we had just one header to retain from the column names; what if we had [multiple](https://stackoverflow.com/q/64107566/7175713)?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (3, 9)
Sony | TV | Model | valueSony | TV | Quantity | valueSony | TV | Max-quant | valuePanasonic | TV | Model | valuePanasonic | TV | Quantity | valuePanasonic | TV | Max-quant | valueSanyo | Radio | Model | valueSanyo | Radio | Quantity | valueSanyo | Radio | Max-quant | value
stri64i64stri64i64stri64i64
"A222"510"T232"110"S111"49
"A234"59"S3424"512"S1s1"29
"A4345"49"X3421"111"S1s2"410
" ], "text/plain": [ "shape: (3, 9)\n", "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", "│ Sony | TV ┆ Sony | TV ┆ Sony | TV ┆ Panasonic ┆ … ┆ Panasonic ┆ Sanyo | ┆ Sanyo | ┆ Sanyo | │\n", "│ | Model | ┆ | ┆ | ┆ | TV | ┆ ┆ | TV | ┆ Radio | ┆ Radio | ┆ Radio | │\n", "│ value ┆ Quantity ┆ Max-quant ┆ Model | ┆ ┆ Max-quant ┆ Model | ┆ Quantity ┆ Max-quan │\n", "│ --- ┆ | value ┆ | value ┆ value ┆ ┆ | v… ┆ value ┆ | val… ┆ t | va… │\n", "│ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ ┆ i64 ┆ i64 ┆ str ┆ ┆ i64 ┆ str ┆ i64 ┆ i64 │\n", "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", "│ A222 ┆ 5 ┆ 10 ┆ T232 ┆ … ┆ 10 ┆ S111 ┆ 4 ┆ 9 │\n", "│ A234 ┆ 5 ┆ 9 ┆ S3424 ┆ … ┆ 12 ┆ S1s1 ┆ 2 ┆ 9 │\n", "│ A4345 ┆ 4 ┆ 9 ┆ X3421 ┆ … ┆ 11 ┆ S1s2 ┆ 4 ┆ 10 │\n", "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "manufacturer = pl.DataFrame(\n", " {\n", " \"Sony | TV | Model | value\": [\"A222\", \"A234\", \"A4345\"],\n", " \"Sony | TV | Quantity | value\": [5, 5, 4],\n", " \"Sony | TV | Max-quant | value\": [10, 9, 9],\n", " \"Panasonic | TV | Model | value\": [\"T232\", \"S3424\", \"X3421\"],\n", " \"Panasonic | TV | Quantity | value\": [1, 5, 1],\n", " \"Panasonic | TV | Max-quant | value\": [10, 12, 11],\n", " \"Sanyo | Radio | Model | value\": [\"S111\", \"S1s1\", \"S1s2\"],\n", " \"Sanyo | Radio | Quantity | value\": [4, 2, 4],\n", " \"Sanyo | Radio | Max-quant | value\": [9, 9, 10],\n", " }\n", ")\n", "\n", "manufacturer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we stick to the same pattern as with the `household` reshaping - unpivot into long form, split into individual columns, and finally pivot the relevant column:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (9, 5)
ManufacturerDeviceModelQuantityMax-quant
strstrstri16i16
"Sony""TV""A222"510
"Sony""TV""A234"59
"Sony""TV""A4345"49
"Panasonic""TV""T232"110
"Panasonic""TV""S3424"512
"Panasonic""TV""X3421"111
"Sanyo""Radio""S111"49
"Sanyo""Radio""S1s1"29
"Sanyo""Radio""S1s2"410
" ], "text/plain": [ "shape: (9, 5)\n", "┌──────────────┬────────┬───────┬──────────┬───────────┐\n", "│ Manufacturer ┆ Device ┆ Model ┆ Quantity ┆ Max-quant │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ i16 ┆ i16 │\n", "╞══════════════╪════════╪═══════╪══════════╪═══════════╡\n", "│ Sony ┆ TV ┆ A222 ┆ 5 ┆ 10 │\n", "│ Sony ┆ TV ┆ A234 ┆ 5 ┆ 9 │\n", "│ Sony ┆ TV ┆ A4345 ┆ 4 ┆ 9 │\n", "│ Panasonic ┆ TV ┆ T232 ┆ 1 ┆ 10 │\n", "│ Panasonic ┆ TV ┆ S3424 ┆ 5 ┆ 12 │\n", "│ Panasonic ┆ TV ┆ X3421 ┆ 1 ┆ 11 │\n", "│ Sanyo ┆ Radio ┆ S111 ┆ 4 ┆ 9 │\n", "│ Sanyo ┆ Radio ┆ S1s1 ┆ 2 ┆ 9 │\n", "│ Sanyo ┆ Radio ┆ S1s2 ┆ 4 ┆ 10 │\n", "└──────────────┴────────┴───────┴──────────┴───────────┘" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expression = pl.col(\"variable\").str.split(\"|\")\n", "expression = expression.list.to_struct(\n", " fields=[\"Manufacturer\", \"Device\", \"header\", \"counts\"]\n", ")\n", "(\n", " manufacturer.unpivot(index=None)\n", " .with_columns(variable=expression)\n", " .unnest(\"variable\")\n", " .with_columns(cs.exclude(\"value\").str.strip_chars())\n", " .with_row_index(name=\"index\")\n", " # ensure the pivoting operation does not encounter duplicate entries\n", " .with_columns(pl.cum_count(\"index\").over(\"Manufacturer\", \"Device\", \"header\"))\n", " .pivot(index=[\"index\", \"Manufacturer\", \"Device\"], on=\"header\", values=\"value\")\n", " .drop(\"index\")\n", " .with_columns(cs.starts_with(\"Q\", \"Max\").as_expr().cast(pl.Int16))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if our header is actually composed of different parts of a column name? This is best explained with an example:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 5)
x_1_meanx_2_meany_1_meany_2_meanunit
i64i64i64i64i64
1020304050
" ], "text/plain": [ "shape: (1, 5)\n", "┌──────────┬──────────┬──────────┬──────────┬──────┐\n", "│ x_1_mean ┆ x_2_mean ┆ y_1_mean ┆ y_2_mean ┆ unit │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │\n", "╞══════════╪══════════╪══════════╪══════════╪══════╡\n", "│ 10 ┆ 20 ┆ 30 ┆ 40 ┆ 50 │\n", "└──────────┴──────────┴──────────┴──────────┴──────┘" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_mean = pl.DataFrame(\n", " {\n", " \"x_1_mean\": [10],\n", " \"x_2_mean\": [20],\n", " \"y_1_mean\": [30],\n", " \"y_2_mean\": [40],\n", " \"unit\": [50],\n", " }\n", ")\n", "\n", "df_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.\n", "\n", "Steps:\n", "- unpivot into long form\n", "- split string and unnest into individucal columns\n", "- combine the relevant columns into a single header column\n", "- pivot the dataframe to get the desired output" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 4)
numunitxmeanymean
i8i64i64i64
1501030
2502040
" ], "text/plain": [ "shape: (2, 4)\n", "┌─────┬──────┬───────┬───────┐\n", "│ num ┆ unit ┆ xmean ┆ ymean │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ i8 ┆ i64 ┆ i64 ┆ i64 │\n", "╞═════╪══════╪═══════╪═══════╡\n", "│ 1 ┆ 50 ┆ 10 ┆ 30 │\n", "│ 2 ┆ 50 ┆ 20 ┆ 40 │\n", "└─────┴──────┴───────┴───────┘" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = df_mean.unpivot(index=\"unit\")\n", "expression = pl.col(\"variable\").str.split(\"_\")\n", "expression = expression.list.to_struct(fields=[\"letter\", \"num\", \"agg\"])\n", "out = out.with_columns(variable=expression).unnest(\"variable\")\n", "expression = pl.concat_str(\"letter\", \"agg\").alias(\"header\")\n", "(\n", " out.select(cs.exclude(\"letter\", \"agg\"), expression)\n", " .pivot(index=[\"num\", \"unit\"], on=\"header\", values=\"value\")\n", " .cast({\"num\": pl.Int8})\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we want multiple header columns and multiple value columns? Let's have a look at this [scenario](https://stackoverflow.com/q/51519101/7175713):" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (3, 8)
CityStateNameMangoOrangeWatermelonGinVodka
strstrstri64i64i64i64i64
"Houston""Texas""Aria"410401620
"Austin""Texas""Penelope"1089920033
"Hoover""Alabama""Niko"9014433418
" ], "text/plain": [ "shape: (3, 8)\n", "┌─────────┬─────────┬──────────┬───────┬────────┬────────────┬─────┬───────┐\n", "│ City ┆ State ┆ Name ┆ Mango ┆ Orange ┆ Watermelon ┆ Gin ┆ Vodka │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │\n", "╞═════════╪═════════╪══════════╪═══════╪════════╪════════════╪═════╪═══════╡\n", "│ Houston ┆ Texas ┆ Aria ┆ 4 ┆ 10 ┆ 40 ┆ 16 ┆ 20 │\n", "│ Austin ┆ Texas ┆ Penelope ┆ 10 ┆ 8 ┆ 99 ┆ 200 ┆ 33 │\n", "│ Hoover ┆ Alabama ┆ Niko ┆ 90 ┆ 14 ┆ 43 ┆ 34 ┆ 18 │\n", "└─────────┴─────────┴──────────┴───────┴────────┴────────────┴─────┴───────┘" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pl.DataFrame(\n", " {\n", " \"City\": [\"Houston\", \"Austin\", \"Hoover\"],\n", " \"State\": [\"Texas\", \"Texas\", \"Alabama\"],\n", " \"Name\": [\"Aria\", \"Penelope\", \"Niko\"],\n", " \"Mango\": [4, 10, 90],\n", " \"Orange\": [10, 8, 14],\n", " \"Watermelon\": [40, 99, 43],\n", " \"Gin\": [16, 200, 34],\n", " \"Vodka\": [20, 33, 18],\n", " },\n", ")\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The desired output is below:\n", "\n", "```\n", " City State Fruit Pounds Drink Ounces\n", "0 Houston Texas Mango 4 Gin 16.0\n", "1 Austin Texas Mango 10 Gin 200.0\n", "2 Hoover Alabama Mango 90 Gin 34.0\n", "3 Houston Texas Orange 10 Vodka 20.0\n", "4 Austin Texas Orange 8 Vodka 33.0\n", "5 Hoover Alabama Orange 14 Vodka 18.0\n", "6 Houston Texas Watermelon 40 nan NaN\n", "7 Austin Texas Watermelon 99 nan NaN\n", "8 Hoover Alabama Watermelon 43 nan NaN\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````{margin}\n", "```{note}\n", "This is not necessarily a tidy approach\n", "```\n", "````\n", "From the above expected output, the goal is to pair the Fruits(headers) with Drinks(values), in separate columns. How can we achieve this in Polars?\n", "\n", "We need to create a pairing between the fruits and the drinks :\n", "```\n", "('Mango', 'Gin'), ('Orange', 'Vodka'), ('Watermelon', None)\n", "```\n", "\n", "We can reuse an idea from [earlier](#group-related-columns), by pairing the relevant columns, using structs:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (9, 6)
CityStateFruitDrinkPoundsOunces
strstrstrstri64i64
"Houston""Texas""Mango""Gin"416
"Austin""Texas""Mango""Gin"10200
"Hoover""Alabama""Mango""Gin"9034
"Houston""Texas""Orange""Vodka"1020
"Austin""Texas""Orange""Vodka"833
"Hoover""Alabama""Orange""Vodka"1418
"Houston""Texas""Watermelon"null40null
"Austin""Texas""Watermelon"null99null
"Hoover""Alabama""Watermelon"null43null
" ], "text/plain": [ "shape: (9, 6)\n", "┌─────────┬─────────┬────────────┬───────┬────────┬────────┐\n", "│ City ┆ State ┆ Fruit ┆ Drink ┆ Pounds ┆ Ounces │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str ┆ i64 ┆ i64 │\n", "╞═════════╪═════════╪════════════╪═══════╪════════╪════════╡\n", "│ Houston ┆ Texas ┆ Mango ┆ Gin ┆ 4 ┆ 16 │\n", "│ Austin ┆ Texas ┆ Mango ┆ Gin ┆ 10 ┆ 200 │\n", "│ Hoover ┆ Alabama ┆ Mango ┆ Gin ┆ 90 ┆ 34 │\n", "│ Houston ┆ Texas ┆ Orange ┆ Vodka ┆ 10 ┆ 20 │\n", "│ Austin ┆ Texas ┆ Orange ┆ Vodka ┆ 8 ┆ 33 │\n", "│ Hoover ┆ Alabama ┆ Orange ┆ Vodka ┆ 14 ┆ 18 │\n", "│ Houston ┆ Texas ┆ Watermelon ┆ null ┆ 40 ┆ null │\n", "│ Austin ┆ Texas ┆ Watermelon ┆ null ┆ 99 ┆ null │\n", "│ Hoover ┆ Alabama ┆ Watermelon ┆ null ┆ 43 ┆ null │\n", "└─────────┴─────────┴────────────┴───────┴────────┴────────┘" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = [\"City\", \"State\"]\n", "expression = [\n", " pl.struct(Pounds=\"Mango\", Ounces=\"Gin\").alias(\"Mango|Gin\"),\n", " pl.struct(Pounds=\"Orange\", Ounces=\"Vodka\").alias(\"Orange|Vodka\"),\n", " pl.struct(Pounds=\"Watermelon\").alias(\"Watermelon\"),\n", "]\n", "(\n", " df.select(index + expression)\n", " .unpivot(index=index)\n", " .unnest(\"value\")\n", " .with_columns(\n", " variable=pl.col(\"variable\")\n", " .str.split(\"|\")\n", " .list.to_struct(fields=[\"Fruit\", \"Drink\"])\n", " )\n", " .unnest(\"variable\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the steps:\n", "\n", "- create the structs to pair relevant columns - preprocessing\n", "- unpivot\n", "- unnest - post processing\n", "- string splitting - post processing\n", "- unnest - post processing\n" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-30T10:57:18.265994Z", "iopub.status.busy": "2020-07-30T10:57:18.265751Z", "iopub.status.idle": "2020-07-30T10:57:18.283104Z", "shell.execute_reply": "2020-07-30T10:57:18.280684Z", "shell.execute_reply.started": "2020-07-30T10:57:18.265970Z" } }, "source": [ "## **Summary**\n", "\n", "This blog post shows various ways to reshape data into long form, in Polars. There are scenarios however, where we may need to tweak our approach to significantly improve performance. In another [blog post](Reshape-Data-in-Polars-Wide-to_Long-Part-II.ipynb), we'll see how to make this reshaping more efficient for such conditions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "\n" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "blogger", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }