Reshape Data in Polars Efficiently from Wide to Long Form - Part III#
Introduction#
This post continues on from a previous post, where I highlighted steps to significantly improve performance when reshaping a DataFrame from wide to long form. The solutions proferred in the blog post however were quite verbose, compared to the naive solutions, and were tailored to the specific problems.
In this article I introduce two functions - pivot_longer and pivot_longer_spec - from the pyjanitor library. These functions, which are ports of the same named functions in R, offer conciseness, while still being performant.
I am a contributor to the pyjanitor library.
We will reuse most of the examples from a previous blogpost to highlight the various usages for pivot_longer and pivot_longer_spec. First things first, let’s install and import the pyjanitor library:
# pip install pyjanitor
import janitor as jn
import janitor.polars
import polars as pl
import polars.selectors as cs
from janitor.polars import pivot_longer_spec
pl.__version__
'1.7.1'
jn.__version__
'0.29.1'
String data in column names#
relig_income = pl.read_csv("./Data_files/relig_income.csv")
relig_income
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k | Don't know/refused |
---|---|---|---|---|---|---|---|---|---|---|
str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
"Agnostic" | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 | 96 |
"Atheist" | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 | 76 |
"Buddhist" | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 | 54 |
"Catholic" | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 | 1489 |
"Don’t know/refused" | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 | 116 |
… | … | … | … | … | … | … | … | … | … | … |
"Orthodox" | 13 | 17 | 23 | 32 | 32 | 47 | 38 | 42 | 46 | 73 |
"Other Christian" | 9 | 7 | 11 | 13 | 13 | 14 | 18 | 14 | 12 | 18 |
"Other Faiths" | 20 | 33 | 40 | 46 | 49 | 63 | 46 | 40 | 41 | 71 |
"Other World Religions" | 5 | 2 | 3 | 4 | 2 | 7 | 3 | 4 | 4 | 8 |
"Unaffiliated" | 217 | 299 | 374 | 365 | 341 | 528 | 407 | 321 | 258 | 597 |
This is an easy use case - pivot_longer simply calls unpivot under the hood:
relig_income.pivot_longer(
index="religion",
# all columns will be collated
# under names_to
names_to="income",
# all values in the columns
# will be collated into values_to
values_to="count",
)
religion | income | count |
---|---|---|
str | str | i64 |
"Agnostic" | "<$10k" | 27 |
"Atheist" | "<$10k" | 12 |
"Buddhist" | "<$10k" | 27 |
"Catholic" | "<$10k" | 418 |
"Don’t know/refused" | "<$10k" | 15 |
… | … | … |
"Orthodox" | "Don't know/refused" | 73 |
"Other Christian" | "Don't know/refused" | 18 |
"Other Faiths" | "Don't know/refused" | 71 |
"Other World Religions" | "Don't know/refused" | 8 |
"Unaffiliated" | "Don't know/refused" | 597 |
Numeric data in column names#
billboard = pl.read_csv("./Data_files/billboard.csv", null_values="NA")
billboard = billboard.with_columns(cs.starts_with("wk").cast(pl.Int32))
billboard
year | artist | track | time | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 | wk6 | wk7 | wk8 | wk9 | wk10 | wk11 | wk12 | wk13 | wk14 | wk15 | wk16 | wk17 | wk18 | wk19 | wk20 | wk21 | wk22 | wk23 | wk24 | wk25 | wk26 | wk27 | wk28 | wk29 | wk30 | wk31 | wk32 | … | wk40 | wk41 | wk42 | wk43 | wk44 | wk45 | wk46 | wk47 | wk48 | wk49 | wk50 | wk51 | wk52 | wk53 | wk54 | wk55 | wk56 | wk57 | wk58 | wk59 | wk60 | wk61 | wk62 | wk63 | wk64 | wk65 | wk66 | wk67 | wk68 | wk69 | wk70 | wk71 | wk72 | wk73 | wk74 | wk75 | wk76 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | … | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 |
2000 | "2 Pac" | "Baby Don't Cry (Keep..." | "4:22" | "2000-02-26" | 87 | 82 | 72 | 77 | 87 | 94 | 99 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "2Ge+her" | "The Hardest Part Of ..." | "3:15" | "2000-09-02" | 91 | 87 | 92 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "3 Doors Down" | "Kryptonite" | "3:53" | "2000-04-08" | 81 | 70 | 68 | 67 | 66 | 57 | 54 | 53 | 51 | 51 | 51 | 51 | 47 | 44 | 38 | 28 | 22 | 18 | 18 | 14 | 12 | 7 | 6 | 6 | 6 | 5 | 5 | 4 | 4 | 4 | 4 | 3 | … | 15 | 14 | 13 | 14 | 16 | 17 | 21 | 22 | 24 | 28 | 33 | 42 | 42 | 49 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "3 Doors Down" | "Loser" | "4:24" | "2000-10-21" | 76 | 76 | 72 | 69 | 67 | 65 | 55 | 59 | 62 | 61 | 61 | 59 | 61 | 66 | 72 | 76 | 75 | 67 | 73 | 70 | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "504 Boyz" | "Wobble Wobble" | "3:35" | "2000-04-15" | 57 | 34 | 25 | 17 | 17 | 31 | 36 | 49 | 53 | 57 | 64 | 70 | 75 | 76 | 78 | 85 | 92 | 96 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2000 | "Yankee Grey" | "Another Nine Minutes" | "3:10" | "2000-04-29" | 86 | 83 | 77 | 74 | 83 | 79 | 88 | 95 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "Yearwood, Trisha" | "Real Live Woman" | "3:55" | "2000-04-01" | 85 | 83 | 83 | 82 | 81 | 91 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "Ying Yang Twins" | "Whistle While You Tw..." | "4:19" | "2000-03-18" | 95 | 94 | 91 | 85 | 84 | 78 | 74 | 78 | 85 | 89 | 97 | 96 | 99 | 99 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "Zombie Nation" | "Kernkraft 400" | "3:30" | "2000-09-02" | 99 | 99 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "matchbox twenty" | "Bent" | "4:12" | "2000-04-29" | 60 | 37 | 29 | 24 | 22 | 21 | 18 | 16 | 13 | 12 | 8 | 6 | 1 | 2 | 3 | 2 | 2 | 3 | 4 | 5 | 4 | 4 | 6 | 9 | 12 | 13 | 19 | 20 | 20 | 24 | 29 | 28 | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
The aim here is to flip into long form, extract the numbers from the wk
columns and convert to integer dtype. Let’s see below how pivot_longer handles this:
(
billboard.pivot_longer(
index=["year", "artist", "track", "time", "date.entered"],
names_to="week",
values_to="rank",
names_pattern=r".+(\d)",
names_transform=pl.col("week").cast(pl.Int32),
).drop_nulls(subset="rank")
)
year | artist | track | time | date.entered | week | rank |
---|---|---|---|---|---|---|
i64 | str | str | str | str | i32 | i32 |
2000 | "2 Pac" | "Baby Don't Cry (Keep..." | "4:22" | "2000-02-26" | 1 | 87 |
2000 | "2Ge+her" | "The Hardest Part Of ..." | "3:15" | "2000-09-02" | 1 | 91 |
2000 | "3 Doors Down" | "Kryptonite" | "3:53" | "2000-04-08" | 1 | 81 |
2000 | "3 Doors Down" | "Loser" | "4:24" | "2000-10-21" | 1 | 76 |
2000 | "504 Boyz" | "Wobble Wobble" | "3:35" | "2000-04-15" | 1 | 57 |
… | … | … | … | … | … | … |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 3 | 50 |
2000 | "Lonestar" | "Amazed" | "4:25" | "1999-06-05" | 3 | 45 |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 4 | 50 |
2000 | "Lonestar" | "Amazed" | "4:25" | "1999-06-05" | 4 | 50 |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 5 | 49 |
Two things to note here:
names_pattern
offers a flexible way to extract/separate labels within columns. In this case we are only interested in the integers attached to thewk
columns, and the regex easily achieves this. Only the groups are retained; the non-groups are discarded. The other option of splitting columns is thenames_sep
parameter.names_transform
parameter offers a way to convert the data type of the flipped columns. It uses a polars expression to achieve this.
You can use the names_sep
parameter to achieve something similar to the above; the regex though is a better fit.
(
billboard.pivot_longer(
index=["year", "artist", "track", "time", "date.entered"],
names_to=["", "week"],
values_to="rank",
names_sep="wk",
names_transform=pl.col("week").cast(pl.Int32),
)
.drop("")
.drop_nulls(subset="rank")
)
year | artist | track | time | date.entered | week | rank |
---|---|---|---|---|---|---|
i64 | str | str | str | str | i32 | i32 |
2000 | "2 Pac" | "Baby Don't Cry (Keep..." | "4:22" | "2000-02-26" | 1 | 87 |
2000 | "2Ge+her" | "The Hardest Part Of ..." | "3:15" | "2000-09-02" | 1 | 91 |
2000 | "3 Doors Down" | "Kryptonite" | "3:53" | "2000-04-08" | 1 | 81 |
2000 | "3 Doors Down" | "Loser" | "4:24" | "2000-10-21" | 1 | 76 |
2000 | "504 Boyz" | "Wobble Wobble" | "3:35" | "2000-04-15" | 1 | 57 |
… | … | … | … | … | … | … |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 63 | 50 |
2000 | "Lonestar" | "Amazed" | "4:25" | "1999-06-05" | 63 | 45 |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 64 | 50 |
2000 | "Lonestar" | "Amazed" | "4:25" | "1999-06-05" | 64 | 50 |
2000 | "Creed" | "Higher" | "5:16" | "1999-09-11" | 65 | 49 |
Many variables in column names#
who = pl.read_csv("./Data_files/who.csv", null_values="NA")
who
country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | new_sp_m5564 | new_sp_m65 | new_sp_f014 | new_sp_f1524 | new_sp_f2534 | new_sp_f3544 | new_sp_f4554 | new_sp_f5564 | new_sp_f65 | new_sn_m014 | new_sn_m1524 | new_sn_m2534 | new_sn_m3544 | new_sn_m4554 | new_sn_m5564 | new_sn_m65 | new_sn_f014 | new_sn_f1524 | new_sn_f2534 | new_sn_f3544 | new_sn_f4554 | new_sn_f5564 | new_sn_f65 | new_ep_m014 | new_ep_m1524 | new_ep_m2534 | new_ep_m3544 | new_ep_m4554 | new_ep_m5564 | new_ep_m65 | new_ep_f014 | new_ep_f1524 | new_ep_f2534 | new_ep_f3544 | new_ep_f4554 | new_ep_f5564 | new_ep_f65 | newrel_m014 | newrel_m1524 | newrel_m2534 | newrel_m3544 | newrel_m4554 | newrel_m5564 | newrel_m65 | newrel_f014 | newrel_f1524 | newrel_f2534 | newrel_f3544 | newrel_f4554 | newrel_f5564 | newrel_f65 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
"Afghanistan" | "AF" | "AFG" | 1980 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Afghanistan" | "AF" | "AFG" | 1981 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Afghanistan" | "AF" | "AFG" | 1982 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Afghanistan" | "AF" | "AFG" | 1983 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Afghanistan" | "AF" | "AFG" | 1984 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
"Zimbabwe" | "ZW" | "ZWE" | 2009 | 125 | 578 | null | 3471 | 681 | 293 | 192 | 180 | 873 | null | 3028 | 419 | 229 | 126 | 1560 | 860 | null | 6496 | 1655 | 882 | 861 | 1425 | 1334 | null | 7023 | 1551 | 729 | 514 | 244 | 266 | 0 | 1922 | 491 | 231 | 223 | 210 | 394 | 0 | 1944 | 438 | 182 | 138 | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Zimbabwe" | "ZW" | "ZWE" | 2010 | 150 | 710 | 2208 | 1682 | 761 | 350 | 252 | 173 | 974 | 2185 | 1283 | 490 | 265 | 171 | 1826 | 821 | 3342 | 3270 | 1545 | 882 | 864 | 1732 | 1282 | 4013 | 2851 | 1377 | 789 | 563 | 270 | 243 | 902 | 868 | 418 | 229 | 192 | 220 | 319 | 1058 | 677 | 338 | 181 | 146 | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Zimbabwe" | "ZW" | "ZWE" | 2011 | 152 | 784 | 2467 | 2071 | 780 | 377 | 278 | 174 | 1084 | 2161 | 1386 | 448 | 274 | 160 | 1364 | 596 | 2473 | 2813 | 1264 | 702 | 728 | 1271 | 947 | 2754 | 2216 | 962 | 587 | 495 | 250 | 195 | 746 | 796 | 342 | 172 | 172 | 209 | 318 | 802 | 640 | 284 | 137 | 129 | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Zimbabwe" | "ZW" | "ZWE" | 2012 | 120 | 783 | 2421 | 2086 | 796 | 360 | 271 | 173 | 939 | 2053 | 1286 | 483 | 231 | 161 | 1169 | 613 | 2302 | 2657 | 1154 | 708 | 796 | 1008 | 888 | 2287 | 1957 | 829 | 516 | 432 | 233 | 214 | 658 | 789 | 331 | 178 | 182 | 208 | 319 | 710 | 579 | 228 | 140 | 143 | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
"Zimbabwe" | "ZW" | "ZWE" | 2013 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 1315 | 1642 | 5331 | 5363 | 2349 | 1206 | 1208 | 1252 | 2069 | 4649 | 3526 | 1453 | 811 | 725 |
The aim is to separate the columns into diagnosis
, gender
and age
columns. We can reuse the idea in the previous exercise and pass a regular expression to the names_pattern
parameter:
who.pivot_longer(
index=["country", "iso2", "iso3", "year"],
names_to=("diagnosis", "gender", "age"),
names_pattern=r"new_?(.*)_(.)(.*)",
values_to="count",
)
country | iso2 | iso3 | year | diagnosis | gender | age | count |
---|---|---|---|---|---|---|---|
str | str | str | i64 | str | str | str | i64 |
"Afghanistan" | "AF" | "AFG" | 1980 | "sp" | "m" | "014" | null |
"Afghanistan" | "AF" | "AFG" | 1981 | "sp" | "m" | "014" | null |
"Afghanistan" | "AF" | "AFG" | 1982 | "sp" | "m" | "014" | null |
"Afghanistan" | "AF" | "AFG" | 1983 | "sp" | "m" | "014" | null |
"Afghanistan" | "AF" | "AFG" | 1984 | "sp" | "m" | "014" | null |
… | … | … | … | … | … | … | … |
"Zimbabwe" | "ZW" | "ZWE" | 2009 | "rel" | "f" | "65" | null |
"Zimbabwe" | "ZW" | "ZWE" | 2010 | "rel" | "f" | "65" | null |
"Zimbabwe" | "ZW" | "ZWE" | 2011 | "rel" | "f" | "65" | null |
"Zimbabwe" | "ZW" | "ZWE" | 2012 | "rel" | "f" | "65" | null |
"Zimbabwe" | "ZW" | "ZWE" | 2013 | "rel" | "f" | "65" | 725 |
Multiple observations per row#
household = pl.read_csv("./Data_files/household.csv", null_values="NA")
household
family | dob_child1 | dob_child2 | name_child1 | name_child2 |
---|---|---|---|---|
i64 | str | str | str | str |
1 | "1998-11-26" | "2000-01-29" | "Susan" | "Jose" |
2 | "1996-06-22" | null | "Mark" | null |
3 | "2002-07-11" | "2004-04-05" | "Sam" | "Seth" |
4 | "2004-10-10" | "2009-08-27" | "Craig" | "Khai" |
5 | "2000-12-05" | "2005-02-28" | "Parker" | "Gracie" |
In the DataFrame above, we wish to separate the name
and dob
into separate columns. How does pivot_longer handle this?
household.pivot_longer(index="family", names_to=(".value", "child"), names_sep="_")
family | child | dob | name |
---|---|---|---|
i64 | str | str | str |
1 | "child1" | "1998-11-26" | "Susan" |
2 | "child1" | "1996-06-22" | "Mark" |
3 | "child1" | "2002-07-11" | "Sam" |
4 | "child1" | "2004-10-10" | "Craig" |
5 | "child1" | "2000-12-05" | "Parker" |
1 | "child2" | "2000-01-29" | "Jose" |
2 | "child2" | null | null |
3 | "child2" | "2004-04-05" | "Seth" |
4 | "child2" | "2009-08-27" | "Khai" |
5 | "child2" | "2005-02-28" | "Gracie" |
If you want to retain a part of the column as a header, use the .value
placeholder; the paired label remains as a header, while the unpaired parts, if any, are flipped into rows.
Let’s break it down a bit. The name_to
parameter is (.value, child)
which pairs with (dob, child1)
, (dob, child2)
, (name, child1)
, (name, child2)
, after it has been split by the names_sep
argument. .value
will be paired with dob
and name
, while child
will be paired with child1
and child2
. In tabular form, it can be portrayed as below:
.value names_sep child
dob _ child1
dob _ child2
name _ child1
name _ child2
Let’s look at another example:
manufacturer = pl.DataFrame(
{
"Sony | TV | Model | value": ["A222", "A234", "A4345"],
"Sony | TV | Quantity | value": [5, 5, 4],
"Sony | TV | Max-quant | value": [10, 9, 9],
"Panasonic | TV | Model | value": ["T232", "S3424", "X3421"],
"Panasonic | TV | Quantity | value": [1, 5, 1],
"Panasonic | TV | Max-quant | value": [10, 12, 11],
"Sanyo | Radio | Model | value": ["S111", "S1s1", "S1s2"],
"Sanyo | Radio | Quantity | value": [4, 2, 4],
"Sanyo | Radio | Max-quant | value": [9, 9, 10],
}
)
manufacturer
Sony | TV | Model | value | Sony | TV | Quantity | value | Sony | TV | Max-quant | value | Panasonic | TV | Model | value | Panasonic | TV | Quantity | value | Panasonic | TV | Max-quant | value | Sanyo | Radio | Model | value | Sanyo | Radio | Quantity | value | Sanyo | Radio | Max-quant | value |
---|---|---|---|---|---|---|---|---|
str | i64 | i64 | str | i64 | i64 | str | i64 | i64 |
"A222" | 5 | 10 | "T232" | 1 | 10 | "S111" | 4 | 9 |
"A234" | 5 | 9 | "S3424" | 5 | 12 | "S1s1" | 2 | 9 |
"A4345" | 4 | 9 | "X3421" | 1 | 11 | "S1s2" | 4 | 10 |
We wish to retain Model
, Quantity
and Max-quant
as headers, and flip the remaining parts of the columns:
(
manufacturer.pivot_longer(
index=None,
names_to=("Manufacturer", "Device", ".value", "value"),
names_sep=" | ",
).drop("value")
)
Manufacturer | Device | Model | Quantity | Max-quant |
---|---|---|---|---|
str | str | str | i64 | i64 |
"Sanyo" | "Radio" | "S111" | 4 | 9 |
"Sanyo" | "Radio" | "S1s1" | 2 | 9 |
"Sanyo" | "Radio" | "S1s2" | 4 | 10 |
"Panasonic" | "TV" | "T232" | 1 | 10 |
"Panasonic" | "TV" | "S3424" | 5 | 12 |
"Panasonic" | "TV" | "X3421" | 1 | 11 |
"Sony" | "TV" | "A222" | 5 | 10 |
"Sony" | "TV" | "A234" | 5 | 9 |
"Sony" | "TV" | "A4345" | 4 | 9 |
(
manufacturer.pivot_longer(
index=None,
names_to=("Manufacturer", "Device", ".value"),
names_pattern=r"([a-zA-Z]+)\s?\|\s?([a-zA-Z]+)\s?\|\s?([a-zA-Z-]+)\s?\|.*",
)
)
Manufacturer | Device | Model | Quantity | Max-quant |
---|---|---|---|---|
str | str | str | i64 | i64 |
"Sony" | "TV" | "A222" | 5 | 10 |
"Sony" | "TV" | "A234" | 5 | 9 |
"Sony" | "TV" | "A4345" | 4 | 9 |
"Sanyo" | "Radio" | "S111" | 4 | 9 |
"Sanyo" | "Radio" | "S1s1" | 2 | 9 |
"Sanyo" | "Radio" | "S1s2" | 4 | 10 |
"Panasonic" | "TV" | "T232" | 1 | 10 |
"Panasonic" | "TV" | "S3424" | 5 | 12 |
"Panasonic" | "TV" | "X3421" | 1 | 11 |
What if our header is actually composed of different parts of a column name? This is best explained with an example:
df_mean = pl.DataFrame(
{
"x_1_mean": [10],
"x_2_mean": [20],
"y_1_mean": [30],
"y_2_mean": [40],
"unit": [50],
}
)
df_mean
x_1_mean | x_2_mean | y_1_mean | y_2_mean | unit |
---|---|---|---|---|
i64 | i64 | i64 | i64 | i64 |
10 | 20 | 30 | 40 | 50 |
For the DataFrame above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.
With pivot_longer we pass multiple .value
to handle this (you can pass as many .value
as required for the reshaping):
(
df_mean.pivot_longer(
index="unit",
names_to=(".value", "number", ".value"),
names_sep="_",
names_transform=pl.col("number").cast(pl.Int8),
)
)
unit | number | xmean | ymean |
---|---|---|---|
i64 | i8 | i64 | i64 |
50 | 1 | 10 | 30 |
50 | 2 | 20 | 40 |
Describe the output with pivot_longer_spec#
pivot_longer offers a powerful, flexible and concise way to reshape data from wide to long form. However, there are situations where pivot_longer is inadequate for the transformation, or you need more control over the reshaping. This is where pivot_longer_spec shines - you describe how the new DataFrame should look, based on the source DataFrame.
Let’s look at an example, adapted from data.table vignette:
iris = pl.DataFrame(
{
"Sepal.Length": [5.1, 5.9],
"Sepal.Width": [3.5, 3.0],
"Petal.Length": [1.4, 5.1],
"Petal.Width": [0.2, 1.8],
"Species": ["setosa", "virginica"],
}
)
iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | str |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
For the DataFrame above, we wish to consolidate the columns into part
and dimensions
- the Sepal/Petal
prefixes go into the part
column, while Width/Length
go into the dimension
column. Let’s translate the specification into a table form:
source_column_name new_column_name part dimension
Sepal.Length measurement Sepal Length
Sepal.Width measurement Sepal Width
Petal.Length measurement Petal Length
Petal.Width measurement Petal Width
The table above shows a clear relation between the source and target; let’s take this relationship and create a polars DataFrame of this specification:
spec = {
".name": ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"],
".value": ["measurement", "measurement", "measurement", "measurement"],
"part": ["Sepal", "Sepal", "Petal", "Petal"],
"dimension": ["Length", "Width", "Length", "Width"],
}
spec = pl.DataFrame(spec)
spec
.name | .value | part | dimension |
---|---|---|---|
str | str | str | str |
"Sepal.Length" | "measurement" | "Sepal" | "Length" |
"Sepal.Width" | "measurement" | "Sepal" | "Width" |
"Petal.Length" | "measurement" | "Petal" | "Length" |
"Petal.Width" | "measurement" | "Petal" | "Width" |
pivot_longer_spec takes the specification above and returns the expected new DataFrame:
pivot_longer_spec(df=iris, spec=spec)
Species | part | dimension | measurement |
---|---|---|---|
str | str | str | f64 |
"setosa" | "Petal" | "Length" | 1.4 |
"virginica" | "Petal" | "Length" | 5.1 |
"setosa" | "Sepal" | "Width" | 3.5 |
"virginica" | "Sepal" | "Width" | 3.0 |
"setosa" | "Petal" | "Width" | 0.2 |
"virginica" | "Petal" | "Width" | 1.8 |
"setosa" | "Sepal" | "Length" | 5.1 |
"virginica" | "Sepal" | "Length" | 5.9 |
Let’s take it further - the Sepal/Petal prefixes go into a part
column, while the Width/Length
suffixes remain as headers:
source_column_name new_column_name part
Sepal.Length Length Sepal
Sepal.Width Width Sepal
Petal.Length Length Petal
Petal.Width Width Petal
Again, the relationship is clear. Let’s create the spec, and then unpivot:
spec = {
".name": ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"],
"part": ["Sepal", "Sepal", "Petal", "Petal"],
".value": ["Length", "Width", "Length", "Width"],
}
spec = pl.DataFrame(spec)
spec
.name | part | .value |
---|---|---|
str | str | str |
"Sepal.Length" | "Sepal" | "Length" |
"Sepal.Width" | "Sepal" | "Width" |
"Petal.Length" | "Petal" | "Length" |
"Petal.Width" | "Petal" | "Width" |
pivot_longer_spec(df=iris, spec=spec)
Species | part | Length | Width |
---|---|---|---|
str | str | f64 | f64 |
"setosa" | "Petal" | 1.4 | 0.2 |
"virginica" | "Petal" | 5.1 | 1.8 |
"setosa" | "Sepal" | 5.1 | 3.5 |
"virginica" | "Sepal" | 5.9 | 3.0 |
The idea is simple - you know your data well, so describe the new DataFrame based on the current DataFrame, and pivot_longer_spec will take care of the transformation.
Performance#
pivot_longer is not just simple, convenient, and straightforward, it is efficient as well - We can refer to part II of this series to compare speeds:
from zipfile import ZipFile
zip_file = ZipFile("./Data_files/household_large.csv.zip")
zip_file = zip_file.read("household_large.csv")
household_large = pl.read_csv(zip_file, null_values="NA")
household_large
family | name_child0 | dob_child0 | name_child1 | dob_child1 | name_child2 | dob_child2 | name_child3 | dob_child3 | name_child4 | dob_child4 | name_child5 | dob_child5 | name_child6 | dob_child6 | name_child7 | dob_child7 | name_child8 | dob_child8 | name_child9 | dob_child9 | name_child10 | dob_child10 | name_child11 | dob_child11 | name_child12 | dob_child12 | name_child13 | dob_child13 | name_child14 | dob_child14 | name_child15 | dob_child15 | name_child16 | dob_child16 | name_child17 | dob_child17 | name_child18 | dob_child18 | name_child19 | dob_child19 | name_child20 | dob_child20 | name_child21 | dob_child21 | name_child22 | dob_child22 | name_child23 | dob_child23 | name_child24 | dob_child24 | name_child25 | dob_child25 | name_child26 | dob_child26 | name_child27 | dob_child27 | name_child28 | dob_child28 | name_child29 | dob_child29 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str |
0 | "Brittany Mcdonald" | "1989-08-24" | "Karen Mendoza MD" | "1938-12-01" | "Stephanie Casey" | "1959-08-07" | "Rhonda Hartman" | "1984-01-17" | "Elizabeth Sampson" | "1947-01-25" | "Jacqueline Barrett" | "1949-02-15" | "Rebecca Harvey" | "1974-12-03" | "John Hopkins" | "2009-11-27" | "Diane Ali" | "1993-03-04" | "Emily Jones" | "1933-01-27" | "Rickey Bates" | "1938-06-03" | "Walter Stuart" | "1971-12-04" | "Megan Smith" | "1992-06-05" | "Amanda Morris" | "1976-02-01" | "Jennifer Greer" | "1971-12-12" | "Megan Haney" | "1935-09-11" | "Alyssa Garcia" | "1963-09-03" | "Daniel Turner" | "1938-11-17" | "Brittany Simmons" | "1919-11-21" | "Kurt Schwartz" | "1956-02-27" | "Tanya Brandt" | "1943-08-16" | "Jennifer White" | "1960-08-27" | "Eddie Hamilton" | "1988-01-15" | "Sean Chavez" | "1968-12-08" | "Rhonda White" | "1948-08-30" | "Linda Hicks" | "1936-08-03" | "Brittany Beard" | "1996-03-08" | "Nicole Schmidt" | "1948-12-06" | "Nancy Clark" | "1914-12-12" | "Dr. David Walters" | "1922-10-24" |
1 | "Diana Welch" | "1989-02-11" | "Brandon Pittman" | "2019-09-02" | "Alicia Smith" | "1923-06-27" | "Sarah Dean DVM" | "2001-02-10" | "Katherine Landry" | "1935-06-26" | "Sarah Hughes" | "1911-01-19" | "Amanda Peters" | "1922-06-02" | "Shawn Sanders" | "1936-11-17" | "David White" | "1998-02-25" | "Justin Mccormick" | "1994-03-06" | "James Kelly" | "2005-02-13" | "Kimberly Lynch MD" | "1986-02-18" | "Kimberly Hall" | "2001-07-23" | "Susan Anderson" | "1908-10-26" | "Anthony Hall" | "2012-10-10" | "Lisa Gardner" | "1939-08-17" | "Andrew Burke" | "1981-07-13" | "Brittney Evans" | "1993-06-06" | "Taylor Wood" | "1951-07-29" | "Tanya Taylor" | "2019-02-01" | "Ashley Schultz" | "1993-04-15" | "Amanda Sanchez" | "1923-07-14" | "Harry Stevens" | "1986-11-10" | "David Flores" | "1931-09-16" | "Timothy Gray" | "2017-05-07" | "Jay Ramsey" | "1936-05-08" | "Charles Bell" | "2014-12-05" | "Jacob Williams" | "1963-04-17" | "Kimberly Gonzalez" | "1991-11-15" | "Monica Newton" | "1923-10-27" |
2 | "Jamie Richards" | "1973-11-04" | "Kristin Dunn" | "1973-04-02" | "Kathy Vasquez" | "1966-03-20" | "William Johnson" | "2006-03-08" | "Anthony Brewer" | "1987-06-21" | "James Hendricks" | "1981-03-29" | "Emily Torres" | "1990-11-09" | "Jennifer Santos" | "1982-02-13" | "Travis Snyder" | "2018-09-04" | "Misty Hudson" | "1974-01-03" | "Mathew Price" | "1965-07-09" | "Charles Anthony" | "2003-07-20" | "Sean Cook" | "1919-03-26" | "Krista Hall" | "1951-02-12" | "Sarah Hurst" | "1962-09-22" | "Angela May" | "2009-11-07" | "Antonio Ayers" | "1980-12-11" | "Kathryn Beck" | "1944-05-27" | "Barbara Davis" | "1929-04-19" | "Vanessa Gonzalez" | "1944-06-13" | "Tina Benjamin" | "1911-02-01" | "Anthony Wade" | "1938-10-10" | "Rachel Green" | "2001-02-26" | "Laura Wright" | "1950-01-11" | "Kenneth Stewart" | "1976-03-10" | "Megan Glass" | "1987-10-25" | "Jennifer King" | "1962-01-26" | "Cheyenne Ramirez" | "2009-12-21" | "Hayden Williamson" | "1993-12-27" | "Thomas Buchanan" | "2017-06-23" |
3 | "Michele Reed" | "1994-11-05" | "John Perry" | "1915-07-08" | "Benjamin Garcia" | "2006-03-12" | "Mrs. Carolyn Chavez" | "1999-08-03" | "Bailey Stevens" | "2000-04-09" | "Gilbert Lynch" | "1978-10-06" | "Jennifer Burton" | "2000-12-13" | "Jose Nguyen" | "2020-11-15" | "Yvonne Wall" | "1910-10-16" | "Jacob Owens" | "1943-11-14" | "William Gibson" | "1935-09-11" | "Nicholas Garcia" | "2014-12-28" | "Jamie Diaz" | "1952-10-05" | "Brian Burns" | "1934-03-03" | "Ryan Taylor" | "1978-11-15" | "Ashley Miller" | "1915-01-14" | "Francisco Brewer" | "2006-10-21" | "Melissa Ramos" | "1966-05-01" | "Antonio Moore" | "1914-11-28" | "Darin Russell" | "1994-07-29" | "Eric Santos" | "1928-02-18" | "Mariah Cunningham" | "1958-07-21" | "Deanna White" | "1914-08-30" | "Peggy Williams" | "1973-03-17" | "Michael Richardson" | "1968-02-20" | "Debra Haney" | "1972-04-02" | "Alexander Roberts" | "1977-04-14" | "Derrick Martin" | "1999-09-02" | "Teresa Stephens" | "1928-04-11" | "Andrew Murphy" | "1910-03-21" |
4 | "Steven Burgess" | "1997-12-24" | "Robert Willis" | "2011-04-18" | "Amy Forbes" | "1943-06-26" | "Chase Gutierrez" | "1976-06-21" | "Kevin Gonzales" | "1977-10-03" | "Michael Lyons" | "1973-08-06" | "Brian Carlson" | "1964-07-16" | "Jonathan Brown" | "1983-07-06" | "Ryan Lewis" | "2011-09-30" | "Brandon Drake" | "1995-07-27" | "Christine Mueller" | "1945-07-15" | "Joseph Sanders" | "1995-01-22" | "Patricia Robertson" | "1953-01-29" | "Kevin Young" | "1962-08-12" | "Amber Gray" | "1992-05-30" | "Laura Powers" | "1934-10-28" | "Carrie Ross" | "1974-11-03" | "Anthony Taylor" | "2002-03-31" | "Bill Hawkins" | "1923-01-26" | "Erin Mcbride" | "2009-01-12" | "Roger Meza" | "1934-12-10" | "Sandra Davis" | "1918-11-05" | "Kristie Whitehead" | "1979-12-01" | "James Anderson" | "1954-02-02" | "Sara Aguilar" | "2008-03-09" | "Hector Reynolds" | "1990-08-17" | "Cynthia Klein" | "1931-12-16" | "Richard Copeland" | "1978-11-24" | "Rickey Hudson" | "1926-09-29" | "Carmen Smith" | "1963-11-05" |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
99995 | "Donna Vazquez" | "1938-12-26" | "William Curry" | "1943-03-27" | "Brian Green" | "1941-08-20" | "Cathy Krause" | "1910-10-22" | "Christopher Johnson" | "2006-10-22" | "Amanda Anderson" | "1910-07-21" | "Paul Casey" | "1965-02-23" | "Franklin Strickland" | "2004-03-26" | "Brian Murphy" | "1951-06-04" | "Jennifer Martinez" | "1921-08-12" | "Tony Long" | "1928-04-01" | "Anthony Garcia" | "2015-12-03" | "Todd Gonzales" | "2001-10-31" | "Justin Anderson" | "2021-05-04" | "Chad Garrett" | "1939-06-03" | "Mr. Christopher Harper" | "2002-07-22" | "Jennifer Flores" | "1933-06-17" | "Alexis Porter" | "2020-08-27" | "Jason Pitts" | "1969-03-11" | "Adam Sanchez" | "1959-02-02" | "Angela Cooper" | "1972-02-16" | "Elizabeth Thomas" | "1959-10-25" | "Mark Aguilar" | "1989-03-28" | "Chad Chapman" | "1919-03-01" | "Jessica Doyle" | "1938-04-14" | "Kim Miller" | "2017-08-04" | "Angela Hughes" | "1930-10-31" | "Jamie Thompson" | "2017-02-28" | "Brandy Frederick" | "1932-07-31" | "Patrick Solomon" | "1920-03-12" |
99996 | "Monica Davis" | "1987-10-07" | "Laura Donovan" | "1992-07-30" | "Morgan Garcia" | "1997-01-09" | "Stacey Chambers" | "1944-01-10" | "Stephen Fields" | "1935-10-03" | "Ann Barnes" | "1996-01-04" | "Nicholas Jones" | "1967-02-01" | "Karen Dunn" | "2012-07-21" | "Sheryl Hill" | "1989-02-21" | "Kyle Fox" | "2005-01-04" | "Mark Adkins" | "1981-12-01" | "Spencer Blankenship" | "1934-04-05" | "Catherine Smith" | "2008-12-06" | "Andrea Walker" | "1940-07-01" | "Ricky Barr" | "1912-11-17" | "Desiree Parker" | "1939-11-15" | "Amanda Ruiz" | "2018-08-09" | "Sharon Miller" | "2019-05-05" | "Larry Smith" | "1981-02-24" | "James Harris" | "1990-12-14" | "Justin Smith" | "1994-10-18" | "Tamara Roman" | "1921-04-01" | "Bridget Serrano" | "1974-05-13" | "Ann Melendez" | "1986-07-06" | "Susan Huff" | "2018-09-05" | "Antonio Juarez" | "1995-08-11" | "Ashley Lamb" | "1955-05-31" | "Patrick Morse" | "1932-08-08" | "Thomas Gregory" | "1963-04-11" | "Steven Smith" | "1987-09-22" |
99997 | "Vanessa Byrd" | "1967-03-17" | "Tammy Mcpherson" | "1987-11-06" | "Jacob Schwartz" | "2019-07-07" | "Barry Smith" | "1996-03-17" | "Samuel Martinez" | "1980-02-27" | "Erin Flores" | "1959-04-12" | "Melissa Perez" | "2001-07-28" | "Victoria Wilson" | "1975-02-12" | "Timothy Ramirez" | "1912-08-05" | "Kyle Johnson" | "2015-04-01" | "David Roberts" | "2001-10-01" | "Michael Alexander" | "1922-10-18" | "Robert Green" | "2004-03-13" | "Brent Nelson" | "1973-02-24" | "Curtis Smith" | "1978-11-04" | "Walter Lyons" | "2002-03-03" | "Angela Johnson" | "2015-06-24" | "Amanda Smith" | "1913-01-08" | "Jacqueline Smith" | "2012-01-07" | "Brian Ho" | "1948-12-11" | "Robert Lopez" | "1967-02-04" | "Charles Rogers" | "1996-04-25" | "Marcus Dixon" | "1952-03-11" | "Alexis Moore" | "1952-07-26" | "Joshua Robles" | "1956-11-12" | "Rachel Hull" | "1979-05-28" | "Jessica Smith" | "2006-01-17" | "Joshua Hernandez" | "2005-06-26" | "Melanie Crawford" | "1927-07-12" | "Scott Carlson" | "1950-05-23" |
99998 | "Gary Harmon" | "2002-05-09" | "William Reeves" | "1960-01-23" | "Linda Cohen" | "1953-05-31" | "Damon Soto" | "1948-11-15" | "Christina Hull" | "1992-04-07" | "Sean Boone" | "2022-08-10" | "Benjamin Graham" | "1971-12-30" | "Daniel Pacheco" | "1926-08-16" | "Jennifer Hopkins" | "2016-04-15" | "Anne Kelly" | "1988-12-03" | "Cristina Garcia" | "1997-05-03" | "Lindsay Wright" | "1979-12-03" | "Teresa Berg" | "1953-11-01" | "Lisa Murphy" | "1930-09-15" | "Jacob Hernandez" | "2017-04-25" | "Amber Campbell" | "1983-10-31" | "Ashley Mitchell" | "1962-09-07" | "Jacqueline Singleton" | "1970-06-14" | "Melinda Miller" | "1978-05-08" | "Brenda Gibbs" | "1964-08-13" | "Rebecca Cooper" | "2022-10-04" | "Jonathan Kelly" | "1935-11-19" | "Stephen Sanchez" | "2010-05-15" | "Adam Stark" | "1911-07-12" | "Denise George" | "1930-02-08" | "Brenda Lee" | "2003-09-22" | "Tracey Nelson" | "1910-09-10" | "Greg Esparza" | "2000-07-07" | "Tanner Bell" | "1928-01-15" | "Veronica Collins" | "1934-09-23" |
99999 | "Sheila Elliott" | "1965-08-06" | "Charlene Lee" | "2014-06-22" | "Gabriella Brown" | "2015-03-17" | "Joseph Warren" | "2015-03-18" | "Martin Frederick" | "1992-05-01" | "Cameron Pruitt" | "1992-12-19" | "Jennifer Wilson" | "1929-02-17" | "Scott Soto" | "2014-02-18" | "Lisa Oneill" | "1909-10-27" | "Joy Jones" | "1963-11-07" | "Diana Rodriguez" | "1951-12-27" | "Rachel Estes" | "1986-01-22" | "Susan Smith" | "1985-01-30" | "Cody Perez" | "1932-02-02" | "Pamela Stevenson DDS" | "2020-07-06" | "Katelyn Oconnor" | "1975-11-13" | "James Blair" | "1976-03-29" | "Betty Rowe" | "1991-03-26" | "William Briggs" | "1932-10-01" | "James Hudson" | "1998-01-17" | "Shawn Martinez" | "2013-11-03" | "Carla Shelton" | "2018-10-05" | "Emily Burke" | "1965-04-03" | "Matthew Brown" | "1932-06-30" | "Mrs. Jill Rogers MD" | "2000-05-07" | "Martin Schmidt" | "1977-11-22" | "Cody Collins" | "1994-08-01" | "Kevin Duncan" | "2009-10-15" | "Jerry Brown" | "2004-01-29" | "Nancy Benjamin" | "1966-05-18" |
%%timeit
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
household_large.unpivot(index="family")
.drop_nulls("value")
.with_columns(variable=expression)
.unnest("variable")
.pivot(index=["family", "child"], on="header", values="value")
)
1.2 s ± 8.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
index = "family"
columns = pl.Series("header", household_large.columns)
mask = columns.is_in([index]).not_()
columns = columns.filter(mask)
columns_ = columns.str.split("_")
columns_ = columns_.list.to_struct(fields=["header", "child"])
columns_ = columns_.struct.unnest()
columns_ = columns_.with_columns(columns=columns)
grouped = columns_.group_by("child")
grouped = grouped.agg(pl.all())
expression = [index]
zipped = zip(
grouped.get_column("header").to_list(),
grouped.get_column("columns").to_list(),
)
for num, (heads, column_names) in enumerate(zipped):
expr_ = dict(zip(heads, column_names))
expr_ = pl.struct(**expr_).alias(str(num))
expression.append(expr_)
out = household_large.select(expression)
out = out.select(pl.implode('*'))
out = out.unpivot(index=index)
out = out.with_columns(grouped.get_column('child'))
out = out.drop('variable')
out = out.explode(['family','value'])
out = out.unnest('value')
out
81.3 ms ± 4.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit household_large.pivot_longer(index='family', names_to=('.value','child'), names_sep='_')
55.8 ms ± 959 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
spec = (
pl.DataFrame({".name": household_large.columns[1:]})
.with_columns(
other=pl.col(".name").str.split("_").list.to_struct(fields=[".value", "child"])
)
.unnest("other")
)
pivot_longer_spec(df=household_large, spec=spec)
56.5 ms ± 747 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Your mileage may vary with these speed tests.
Summary#
This blog post shows various ways to reshape data into long form, using pivot_longer
and pivot_longer_spec
. They are meant to be concise, straightforward and efficient.
Comments#