Reshape Data in Polars Efficiently from Wide to Long Form - Part I#
Introduction#
You’ve got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows), to make it tidy, or for ease of analysis. In polars, this is achieved with the unpivot method. There are scenarios however where some processing has to be done before and/or after unpivoting. This article shows how to reshape data for such scenarios.
Most of the examples here are based on the tidyr pivoting vignette.
Polars’ eager API is used in the examples below; for more performance, or in production mode, it is recommended to use the lazy API.
import polars as pl
import polars.selectors as cs
import sys
print("polars version :", pl.__version__)
print("python version :", sys.version)
polars version : 1.7.1
python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
String data in column names#
relig_income = pl.read_csv("./Data_files/relig_income.csv")
relig_income
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k | Don't know/refused |
---|---|---|---|---|---|---|---|---|---|---|
str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
"Agnostic" | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 | 96 |
"Atheist" | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 | 76 |
"Buddhist" | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 | 54 |
"Catholic" | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 | 1489 |
"Don’t know/refused" | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 | 116 |
… | … | … | … | … | … | … | … | … | … | … |
"Orthodox" | 13 | 17 | 23 | 32 | 32 | 47 | 38 | 42 | 46 | 73 |
"Other Christian" | 9 | 7 | 11 | 13 | 13 | 14 | 18 | 14 | 12 | 18 |
"Other Faiths" | 20 | 33 | 40 | 46 | 49 | 63 | 46 | 40 | 41 | 71 |
"Other World Religions" | 5 | 2 | 3 | 4 | 2 | 7 | 3 | 4 | 4 | 8 |
"Unaffiliated" | 217 | 299 | 374 | 365 | 341 | 528 | 407 | 321 | 258 | 597 |
The relig_income dataset stores counts based on a survey which (among other things) asked people about their religion and annual income. This dataset contains three variables:
religion, stored in the rows,
income spread across the column names, and
count stored in the cell values.
We can flip this into long form, and ensure each variable has its own column - this can be achieved with the unpivot method:
relig_income.unpivot(index="religion", variable_name="income", value_name="count")
religion | income | count |
---|---|---|
str | str | i64 |
"Agnostic" | "<$10k" | 27 |
"Atheist" | "<$10k" | 12 |
"Buddhist" | "<$10k" | 27 |
"Catholic" | "<$10k" | 418 |
"Don’t know/refused" | "<$10k" | 15 |
… | … | … |
"Orthodox" | "Don't know/refused" | 73 |
"Other Christian" | "Don't know/refused" | 18 |
"Other Faiths" | "Don't know/refused" | 71 |
"Other World Religions" | "Don't know/refused" | 8 |
"Unaffiliated" | "Don't know/refused" | 597 |
No preprocessing required here.
Numeric data in column names#
billboard = pl.read_csv("./Data_files/billboard.csv", null_values="NA")
billboard = billboard.with_columns(cs.starts_with("wk").cast(pl.Int32))
billboard
year | artist | track | time | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 | wk6 | wk7 | wk8 | wk9 | wk10 | wk11 | wk12 | wk13 | wk14 | wk15 | wk16 | wk17 | wk18 | wk19 | wk20 | wk21 | wk22 | wk23 | wk24 | wk25 | wk26 | wk27 | wk28 | wk29 | wk30 | wk31 | wk32 | … | wk40 | wk41 | wk42 | wk43 | wk44 | wk45 | wk46 | wk47 | wk48 | wk49 | wk50 | wk51 | wk52 | wk53 | wk54 | wk55 | wk56 | wk57 | wk58 | wk59 | wk60 | wk61 | wk62 | wk63 | wk64 | wk65 | wk66 | wk67 | wk68 | wk69 | wk70 | wk71 | wk72 | wk73 | wk74 | wk75 | wk76 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | … | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 |
2000 | "2 Pac" | "Baby Don't Cry (Keep..." | "4:22" | "2000-02-26" | 87 | 82 | 72 | 77 | 87 | 94 | 99 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "2Ge+her" | "The Hardest Part Of ..." | "3:15" | "2000-09-02" | 91 | 87 | 92 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "3 Doors Down" | "Kryptonite" | "3:53" | "2000-04-08" | 81 | 70 | 68 | 67 | 66 | 57 | 54 | 53 | 51 | 51 | 51 | 51 | 47 | 44 | 38 | 28 | 22 | 18 | 18 | 14 | 12 | 7 | 6 | 6 | 6 | 5 | 5 | 4 | 4 | 4 | 4 | 3 | … | 15 | 14 | 13 | 14 | 16 | 17 | 21 | 22 | 24 | 28 | 33 | 42 | 42 | 49 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "3 Doors Down" | "Loser" | "4:24" | "2000-10-21" | 76 | 76 | 72 | 69 | 67 | 65 | 55 | 59 | 62 | 61 | 61 | 59 | 61 | 66 | 72 | 76 | 75 | 67 | 73 | 70 | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "504 Boyz" | "Wobble Wobble" | "3:35" | "2000-04-15" | 57 | 34 | 25 | 17 | 17 | 31 | 36 | 49 | 53 | 57 | 64 | 70 | 75 | 76 | 78 | 85 | 92 | 96 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2000 | "Yankee Grey" | "Another Nine Minutes" | "3:10" | "2000-04-29" | 86 | 83 | 77 | 74 | 83 | 79 | 88 | 95 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "Yearwood, Trisha" | "Real Live Woman" | "3:55" | "2000-04-01" | 85 | 83 | 83 | 82 | 81 | 91 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "Ying Yang Twins" | "Whistle While You Tw..." | "4:19" | "2000-03-18" | 95 | 94 | 91 | 85 | 84 | 78 | 74 | 78 | 85 | 89 | 97 | 96 | 99 | 99 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "Zombie Nation" | "Kernkraft 400" | "3:30" | "2000-09-02" | 99 | 99 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "matchbox twenty" | "Bent" | "4:12" | "2000-04-29" | 60 | 37 | 29 | 24 | 22 | 21 | 18 | 16 | 13 | 12 | 8 | 6 | 1 | 2 | 3 | 2 | 2 | 3 | 4 | 5 | 4 | 4 | 6 | 9 | 12 | 13 | 19 | 20 | 20 | 24 | 29 | 28 | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
The billboard dataset records the billboard rank of songs in the year 2000. It has a form similar to the relig_income data, but the data encoded in the column names is really a number, not a string.
Let’s unpivot the data into a tidy form:
(
billboard.unpivot(
index=~cs.starts_with("wk"), variable_name="week", value_name="rank"
)
# irrelevant nulls
.drop_nulls(subset="rank")
)
year | artist | track | time | date.entered | week | rank |
---|---|---|---|---|---|---|
i64 | str | str | str | str | str | i32 |
2000 | "2 Pac" | "Baby Don't Cry (Keep..." | "4:22" | "2000-02-26" | "wk1" | 87 |
2000 | "2Ge+her" | "The Hardest Part Of ..." | "3:15" | "2000-09-02" | "wk1" | 91 |
2000 | "3 Doors Down" | "Kryptonite" | "3:53" | "2000-04-08" | "wk1" | 81 |
2000 | "3 Doors Down" | "Loser" | "4:24" | "2000-10-21" | "wk1" | 76 |
2000 | "504 Boyz" | "Wobble Wobble" | "3:35" | "2000-04-15" | "wk1" | 57 |
… | … | … | … | … | … | … |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | "wk63" | 50 |
2000 | "Lonestar" | "Amazed" | "4:25" | "1999-06-05" | "wk63" | 45 |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | "wk64" | 50 |
2000 | "Lonestar" | "Amazed" | "4:25" | "1999-06-05" | "wk64" | 50 |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | "wk65" | 49 |
It would be nice to easily determine how long each song stayed in the charts, but to do that, we’ll need to convert the week variable to an integer.
This is where the post processing comes into play after the unpivoting - extract the number from the string column, and cast the extracted column to an integer dtype:
(
billboard.unpivot(
index=~cs.starts_with("wk"), variable_name="week", value_name="rank"
)
# irrelevant nulls
.drop_nulls(subset="rank").with_columns(
week=pl.col("week").str.slice(-1).cast(pl.Int16)
)
)
year | artist | track | time | date.entered | week | rank |
---|---|---|---|---|---|---|
i64 | str | str | str | str | i16 | i32 |
2000 | "2 Pac" | "Baby Don't Cry (Keep..." | "4:22" | "2000-02-26" | 1 | 87 |
2000 | "2Ge+her" | "The Hardest Part Of ..." | "3:15" | "2000-09-02" | 1 | 91 |
2000 | "3 Doors Down" | "Kryptonite" | "3:53" | "2000-04-08" | 1 | 81 |
2000 | "3 Doors Down" | "Loser" | "4:24" | "2000-10-21" | 1 | 76 |
2000 | "504 Boyz" | "Wobble Wobble" | "3:35" | "2000-04-15" | 1 | 57 |
… | … | … | … | … | … | … |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 3 | 50 |
2000 | "Lonestar" | "Amazed" | "4:25" | "1999-06-05" | 3 | 45 |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 4 | 50 |
2000 | "Lonestar" | "Amazed" | "4:25" | "1999-06-05" | 4 | 50 |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 5 | 49 |
Many variables in column names#
who = pl.read_csv("./Data_files/who.csv", null_values="NA")
who
country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | new_sp_m5564 | new_sp_m65 | new_sp_f014 | new_sp_f1524 | new_sp_f2534 | new_sp_f3544 | new_sp_f4554 | new_sp_f5564 | new_sp_f65 | new_sn_m014 | new_sn_m1524 | new_sn_m2534 | new_sn_m3544 | new_sn_m4554 | new_sn_m5564 | new_sn_m65 | new_sn_f014 | new_sn_f1524 | new_sn_f2534 | new_sn_f3544 | new_sn_f4554 | new_sn_f5564 | new_sn_f65 | new_ep_m014 | new_ep_m1524 | new_ep_m2534 | new_ep_m3544 | new_ep_m4554 | new_ep_m5564 | new_ep_m65 | new_ep_f014 | new_ep_f1524 | new_ep_f2534 | new_ep_f3544 | new_ep_f4554 | new_ep_f5564 | new_ep_f65 | newrel_m014 | newrel_m1524 | newrel_m2534 | newrel_m3544 | newrel_m4554 | newrel_m5564 | newrel_m65 | newrel_f014 | newrel_f1524 | newrel_f2534 | newrel_f3544 | newrel_f4554 | newrel_f5564 | newrel_f65 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
"Afghanistan" | "AF" | "AFG" | 1980 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Afghanistan" | "AF" | "AFG" | 1981 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Afghanistan" | "AF" | "AFG" | 1982 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Afghanistan" | "AF" | "AFG" | 1983 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Afghanistan" | "AF" | "AFG" | 1984 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
"Zimbabwe" | "ZW" | "ZWE" | 2009 | 125 | 578 | null | 3471 | 681 | 293 | 192 | 180 | 873 | null | 3028 | 419 | 229 | 126 | 1560 | 860 | null | 6496 | 1655 | 882 | 861 | 1425 | 1334 | null | 7023 | 1551 | 729 | 514 | 244 | 266 | 0 | 1922 | 491 | 231 | 223 | 210 | 394 | 0 | 1944 | 438 | 182 | 138 | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Zimbabwe" | "ZW" | "ZWE" | 2010 | 150 | 710 | 2208 | 1682 | 761 | 350 | 252 | 173 | 974 | 2185 | 1283 | 490 | 265 | 171 | 1826 | 821 | 3342 | 3270 | 1545 | 882 | 864 | 1732 | 1282 | 4013 | 2851 | 1377 | 789 | 563 | 270 | 243 | 902 | 868 | 418 | 229 | 192 | 220 | 319 | 1058 | 677 | 338 | 181 | 146 | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Zimbabwe" | "ZW" | "ZWE" | 2011 | 152 | 784 | 2467 | 2071 | 780 | 377 | 278 | 174 | 1084 | 2161 | 1386 | 448 | 274 | 160 | 1364 | 596 | 2473 | 2813 | 1264 | 702 | 728 | 1271 | 947 | 2754 | 2216 | 962 | 587 | 495 | 250 | 195 | 746 | 796 | 342 | 172 | 172 | 209 | 318 | 802 | 640 | 284 | 137 | 129 | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Zimbabwe" | "ZW" | "ZWE" | 2012 | 120 | 783 | 2421 | 2086 | 796 | 360 | 271 | 173 | 939 | 2053 | 1286 | 483 | 231 | 161 | 1169 | 613 | 2302 | 2657 | 1154 | 708 | 796 | 1008 | 888 | 2287 | 1957 | 829 | 516 | 432 | 233 | 214 | 658 | 789 | 331 | 178 | 182 | 208 | 319 | 710 | 579 | 228 | 140 | 143 | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Zimbabwe" | "ZW" | "ZWE" | 2013 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 1315 | 1642 | 5331 | 5363 | 2349 | 1206 | 1208 | 1252 | 2069 | 4649 | 3526 | 1453 | 811 | 725 |
A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the who dataset above.
country, iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to newrel_f65 encode four variables in their names:
the new_/new prefix indicates these are counts of new cases. This dataset only contains new cases, so we’ll ignore it here because it’s constant.
sp/rel/ep describe how the case was diagnosed.
m/f gives the gender.
014/1524/2535/3544/4554/65 supplies the age range.
Sticking to the same pattern as before, we unpivot into long form; however, after unpivoting, we extract the relevant fields into new columns using Polars’ efficient string methods - for this use case a regex is a natural fit:
regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
expression = pl.col("variable").str.extract_groups(regex)
(
who.unpivot(
index=["country", "iso2", "iso3", "year"],
value_name="count",
)
.with_columns(variable=expression)
.unnest("variable")
)
country | iso2 | iso3 | year | diagnosis | gender | age | count |
---|---|---|---|---|---|---|---|
str | str | str | i64 | str | str | str | i64 |
"Afghanistan" | "AF" | "AFG" | 1980 | "sp" | "m" | "014" | null |
"Afghanistan" | "AF" | "AFG" | 1981 | "sp" | "m" | "014" | null |
"Afghanistan" | "AF" | "AFG" | 1982 | "sp" | "m" | "014" | null |
"Afghanistan" | "AF" | "AFG" | 1983 | "sp" | "m" | "014" | null |
"Afghanistan" | "AF" | "AFG" | 1984 | "sp" | "m" | "014" | null |
… | … | … | … | … | … | … | … |
"Zimbabwe" | "ZW" | "ZWE" | 2009 | "rel" | "f" | "65" | null |
"Zimbabwe" | "ZW" | "ZWE" | 2010 | "rel" | "f" | "65" | null |
"Zimbabwe" | "ZW" | "ZWE" | 2011 | "rel" | "f" | "65" | null |
"Zimbabwe" | "ZW" | "ZWE" | 2012 | "rel" | "f" | "65" | null |
"Zimbabwe" | "ZW" | "ZWE" | 2013 | "rel" | "f" | "65" | 725 |
Multiple observations per row#
So far, we have been working with data frames that have one observation per row, but many important pivoting problems involve multiple observations per row. You can usually recognise this case because the name of the column that you want to appear in the output is part of the column name in the input. In this section, you’ll learn how to pivot this sort of data.
household = pl.read_csv("./Data_files/household.csv", null_values="NA")
household
family | dob_child1 | dob_child2 | name_child1 | name_child2 |
---|---|---|---|---|
i64 | str | str | str | str |
1 | "1998-11-26" | "2000-01-29" | "Susan" | "Jose" |
2 | "1996-06-22" | null | "Mark" | null |
3 | "2002-07-11" | "2004-04-05" | "Sam" | "Seth" |
4 | "2004-10-10" | "2009-08-27" | "Craig" | "Khai" |
5 | "2000-12-05" | "2005-02-28" | "Parker" | "Gracie" |
Note that we have two pieces of information (or values) for each child: their name and their dob (date of birth). These need to go into separate columns in the result.
Steps involved:
unpivot into long form
string split into individual columns.
pivot the dataframe, to keep
dob
andname
as headers
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
household.unpivot(index="family")
.drop_nulls("value")
.with_columns(variable=expression)
.unnest("variable")
.pivot(index=["family", "child"], on="header", values="value")
)
family | child | dob | name |
---|---|---|---|
i64 | str | str | str |
1 | "child1" | "1998-11-26" | "Susan" |
2 | "child1" | "1996-06-22" | "Mark" |
3 | "child1" | "2002-07-11" | "Sam" |
4 | "child1" | "2004-10-10" | "Craig" |
5 | "child1" | "2000-12-05" | "Parker" |
1 | "child2" | "2000-01-29" | "Jose" |
3 | "child2" | "2004-04-05" | "Seth" |
4 | "child2" | "2009-08-27" | "Khai" |
5 | "child2" | "2005-02-28" | "Gracie" |
In the household data above, we had just one header to retain from the column names; what if we had multiple?
manufacturer = pl.DataFrame(
{
"Sony | TV | Model | value": ["A222", "A234", "A4345"],
"Sony | TV | Quantity | value": [5, 5, 4],
"Sony | TV | Max-quant | value": [10, 9, 9],
"Panasonic | TV | Model | value": ["T232", "S3424", "X3421"],
"Panasonic | TV | Quantity | value": [1, 5, 1],
"Panasonic | TV | Max-quant | value": [10, 12, 11],
"Sanyo | Radio | Model | value": ["S111", "S1s1", "S1s2"],
"Sanyo | Radio | Quantity | value": [4, 2, 4],
"Sanyo | Radio | Max-quant | value": [9, 9, 10],
}
)
manufacturer
Sony | TV | Model | value | Sony | TV | Quantity | value | Sony | TV | Max-quant | value | Panasonic | TV | Model | value | Panasonic | TV | Quantity | value | Panasonic | TV | Max-quant | value | Sanyo | Radio | Model | value | Sanyo | Radio | Quantity | value | Sanyo | Radio | Max-quant | value |
---|---|---|---|---|---|---|---|---|
str | i64 | i64 | str | i64 | i64 | str | i64 | i64 |
"A222" | 5 | 10 | "T232" | 1 | 10 | "S111" | 4 | 9 |
"A234" | 5 | 9 | "S3424" | 5 | 12 | "S1s1" | 2 | 9 |
"A4345" | 4 | 9 | "X3421" | 1 | 11 | "S1s2" | 4 | 10 |
Again, we stick to the same pattern as with the household
reshaping - unpivot into long form, split into individual columns, and finally pivot the relevant column:
expression = pl.col("variable").str.split("|")
expression = expression.list.to_struct(
fields=["Manufacturer", "Device", "header", "counts"]
)
(
manufacturer.unpivot(index=None)
.with_columns(variable=expression)
.unnest("variable")
.with_columns(cs.exclude("value").str.strip_chars())
.with_row_index(name="index")
# ensure the pivoting operation does not encounter duplicate entries
.with_columns(pl.cum_count("index").over("Manufacturer", "Device", "header"))
.pivot(index=["index", "Manufacturer", "Device"], on="header", values="value")
.drop("index")
.with_columns(cs.starts_with("Q", "Max").as_expr().cast(pl.Int16))
)
Manufacturer | Device | Model | Quantity | Max-quant |
---|---|---|---|---|
str | str | str | i16 | i16 |
"Sony" | "TV" | "A222" | 5 | 10 |
"Sony" | "TV" | "A234" | 5 | 9 |
"Sony" | "TV" | "A4345" | 4 | 9 |
"Panasonic" | "TV" | "T232" | 1 | 10 |
"Panasonic" | "TV" | "S3424" | 5 | 12 |
"Panasonic" | "TV" | "X3421" | 1 | 11 |
"Sanyo" | "Radio" | "S111" | 4 | 9 |
"Sanyo" | "Radio" | "S1s1" | 2 | 9 |
"Sanyo" | "Radio" | "S1s2" | 4 | 10 |
What if our header is actually composed of different parts of a column name? This is best explained with an example:
df_mean = pl.DataFrame(
{
"x_1_mean": [10],
"x_2_mean": [20],
"y_1_mean": [30],
"y_2_mean": [40],
"unit": [50],
}
)
df_mean
x_1_mean | x_2_mean | y_1_mean | y_2_mean | unit |
---|---|---|---|---|
i64 | i64 | i64 | i64 | i64 |
10 | 20 | 30 | 40 | 50 |
For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.
Steps:
unpivot into long form
split string and unnest into individucal columns
combine the relevant columns into a single header column
pivot the dataframe to get the desired output
out = df_mean.unpivot(index="unit")
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["letter", "num", "agg"])
out = out.with_columns(variable=expression).unnest("variable")
expression = pl.concat_str("letter", "agg").alias("header")
(
out.select(cs.exclude("letter", "agg"), expression)
.pivot(index=["num", "unit"], on="header", values="value")
.cast({"num": pl.Int8})
)
num | unit | xmean | ymean |
---|---|---|---|
i8 | i64 | i64 | i64 |
1 | 50 | 10 | 30 |
2 | 50 | 20 | 40 |
What if we want multiple header columns and multiple value columns? Let’s have a look at this scenario:
df = pl.DataFrame(
{
"City": ["Houston", "Austin", "Hoover"],
"State": ["Texas", "Texas", "Alabama"],
"Name": ["Aria", "Penelope", "Niko"],
"Mango": [4, 10, 90],
"Orange": [10, 8, 14],
"Watermelon": [40, 99, 43],
"Gin": [16, 200, 34],
"Vodka": [20, 33, 18],
},
)
df
City | State | Name | Mango | Orange | Watermelon | Gin | Vodka |
---|---|---|---|---|---|---|---|
str | str | str | i64 | i64 | i64 | i64 | i64 |
"Houston" | "Texas" | "Aria" | 4 | 10 | 40 | 16 | 20 |
"Austin" | "Texas" | "Penelope" | 10 | 8 | 99 | 200 | 33 |
"Hoover" | "Alabama" | "Niko" | 90 | 14 | 43 | 34 | 18 |
The desired output is below:
City State Fruit Pounds Drink Ounces
0 Houston Texas Mango 4 Gin 16.0
1 Austin Texas Mango 10 Gin 200.0
2 Hoover Alabama Mango 90 Gin 34.0
3 Houston Texas Orange 10 Vodka 20.0
4 Austin Texas Orange 8 Vodka 33.0
5 Hoover Alabama Orange 14 Vodka 18.0
6 Houston Texas Watermelon 40 nan NaN
7 Austin Texas Watermelon 99 nan NaN
8 Hoover Alabama Watermelon 43 nan NaN
From the above expected output, the goal is to pair the Fruits(headers) with Drinks(values), in separate columns. How can we achieve this in Polars?
We need to create a pairing between the fruits and the drinks :
('Mango', 'Gin'), ('Orange', 'Vodka'), ('Watermelon', None)
We can reuse an idea from earlier, by pairing the relevant columns, using structs:
index = ["City", "State"]
expression = [
pl.struct(Pounds="Mango", Ounces="Gin").alias("Mango|Gin"),
pl.struct(Pounds="Orange", Ounces="Vodka").alias("Orange|Vodka"),
pl.struct(Pounds="Watermelon").alias("Watermelon"),
]
(
df.select(index + expression)
.unpivot(index=index)
.unnest("value")
.with_columns(
variable=pl.col("variable")
.str.split("|")
.list.to_struct(fields=["Fruit", "Drink"])
)
.unnest("variable")
)
City | State | Fruit | Drink | Pounds | Ounces |
---|---|---|---|---|---|
str | str | str | str | i64 | i64 |
"Houston" | "Texas" | "Mango" | "Gin" | 4 | 16 |
"Austin" | "Texas" | "Mango" | "Gin" | 10 | 200 |
"Hoover" | "Alabama" | "Mango" | "Gin" | 90 | 34 |
"Houston" | "Texas" | "Orange" | "Vodka" | 10 | 20 |
"Austin" | "Texas" | "Orange" | "Vodka" | 8 | 33 |
"Hoover" | "Alabama" | "Orange" | "Vodka" | 14 | 18 |
"Houston" | "Texas" | "Watermelon" | null | 40 | null |
"Austin" | "Texas" | "Watermelon" | null | 99 | null |
"Hoover" | "Alabama" | "Watermelon" | null | 43 | null |
Note the steps:
create the structs to pair relevant columns - preprocessing
unpivot
unnest - post processing
string splitting - post processing
unnest - post processing
Summary#
This blog post shows various ways to reshape data into long form, in Polars. There are scenarios however, where we may need to tweak our approach to significantly improve performance. In another blog post, we’ll see how to make this reshaping more efficient for such conditions.
Comments#