Reshape Data in Polars Efficiently from Wide to Long Form - Part III#

Introduction#

This post continues on from a previous post, where I highlighted steps to significantly improve performance when reshaping a DataFrame from wide to long form. The solutions proferred in the blog post however were quite verbose, compared to the naive solutions, and were tailored to the specific problems.

In this article I introduce two functions - pivot_longer and pivot_longer_spec - from the pyjanitor library. These functions, which are ports of the same named functions in R, offer conciseness, while still being performant.

I am a contributor to the pyjanitor library.

We will reuse most of the examples from a previous blogpost to highlight the various usages for pivot_longer and pivot_longer_spec. First things first, let’s install and import the pyjanitor library:

# pip install pyjanitor
import janitor as jn
import janitor.polars
import polars as pl
import polars.selectors as cs
from janitor.polars import pivot_longer_spec
pl.__version__
'1.7.1'
jn.__version__
'0.29.1'

String data in column names#

relig_income = pl.read_csv("./Data_files/relig_income.csv")
relig_income
shape: (18, 11)
religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k$75-100k$100-150k>150kDon't know/refused
stri64i64i64i64i64i64i64i64i64i64
"Agnostic"27346081761371221098496
"Atheist"12273752357073597476
"Buddhist"27213034335862395354
"Catholic"41861773267063811169497926331489
"Don’t know/refused"151415111035211718116
"Orthodox"13172332324738424673
"Other Christian"971113131418141218
"Other Faiths"20334046496346404171
"Other World Religions"5234273448
"Unaffiliated"217299374365341528407321258597

This is an easy use case - pivot_longer simply calls unpivot under the hood:

relig_income.pivot_longer(
    index="religion",
    # all columns will be collated
    # under names_to
    names_to="income",
    # all values in the columns
    # will be collated into values_to
    values_to="count",
)
shape: (180, 3)
religionincomecount
strstri64
"Agnostic""<$10k"27
"Atheist""<$10k"12
"Buddhist""<$10k"27
"Catholic""<$10k"418
"Don’t know/refused""<$10k"15
"Orthodox""Don't know/refused"73
"Other Christian""Don't know/refused"18
"Other Faiths""Don't know/refused"71
"Other World Religions""Don't know/refused"8
"Unaffiliated""Don't know/refused"597

Numeric data in column names#

billboard = pl.read_csv("./Data_files/billboard.csv", null_values="NA")
billboard = billboard.with_columns(cs.starts_with("wk").cast(pl.Int32))
billboard
shape: (317, 81)
yearartisttracktimedate.enteredwk1wk2wk3wk4wk5wk6wk7wk8wk9wk10wk11wk12wk13wk14wk15wk16wk17wk18wk19wk20wk21wk22wk23wk24wk25wk26wk27wk28wk29wk30wk31wk32wk40wk41wk42wk43wk44wk45wk46wk47wk48wk49wk50wk51wk52wk53wk54wk55wk56wk57wk58wk59wk60wk61wk62wk63wk64wk65wk66wk67wk68wk69wk70wk71wk72wk73wk74wk75wk76
i64strstrstrstri32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26"87827277879499nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02"918792nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"3 Doors Down""Kryptonite""3:53""2000-04-08"817068676657545351515151474438282218181412766655444431514131416172122242833424249nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"3 Doors Down""Loser""4:24""2000-10-21"7676726967655559626161596166727675677370nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15"573425171731364953576470757678859296nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Yankee Grey""Another Nine Minutes""3:10""2000-04-29"8683777483798895nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Yearwood, Trisha""Real Live Woman""3:55""2000-04-01"858383828191nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Ying Yang Twins""Whistle While You Tw...""4:19""2000-03-18"9594918584787478858997969999nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Zombie Nation""Kernkraft 400""3:30""2000-09-02"9999nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"matchbox twenty""Bent""4:12""2000-04-29"60372924222118161312861232234544691213192020242928nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

The aim here is to flip into long form, extract the numbers from the wk columns and convert to integer dtype. Let’s see below how pivot_longer handles this:

(
    billboard.pivot_longer(
        index=["year", "artist", "track", "time", "date.entered"],
        names_to="week",
        values_to="rank",
        names_pattern=r".+(\d)",
        names_transform=pl.col("week").cast(pl.Int32),
    ).drop_nulls(subset="rank")
)
shape: (5_307, 7)
yearartisttracktimedate.enteredweekrank
i64strstrstrstri32i32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26"187
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02"191
2000"3 Doors Down""Kryptonite""3:53""2000-04-08"181
2000"3 Doors Down""Loser""4:24""2000-10-21"176
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15"157
2000"Creed""Higher""5:16""1999-09-11"350
2000"Lonestar""Amazed""4:25""1999-06-05"345
2000"Creed""Higher""5:16""1999-09-11"450
2000"Lonestar""Amazed""4:25""1999-06-05"450
2000"Creed""Higher""5:16""1999-09-11"549

Two things to note here:

  • names_pattern offers a flexible way to extract/separate labels within columns. In this case we are only interested in the integers attached to the wk columns, and the regex easily achieves this. Only the groups are retained; the non-groups are discarded. The other option of splitting columns is the names_sep parameter.

  • names_transform parameter offers a way to convert the data type of the flipped columns. It uses a polars expression to achieve this.

You can use the names_sep parameter to achieve something similar to the above; the regex though is a better fit.

(
    billboard.pivot_longer(
        index=["year", "artist", "track", "time", "date.entered"],
        names_to=["", "week"],
        values_to="rank",
        names_sep="wk",
        names_transform=pl.col("week").cast(pl.Int32),
    )
    .drop("")
    .drop_nulls(subset="rank")
)
shape: (5_307, 7)
yearartisttracktimedate.enteredweekrank
i64strstrstrstri32i32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26"187
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02"191
2000"3 Doors Down""Kryptonite""3:53""2000-04-08"181
2000"3 Doors Down""Loser""4:24""2000-10-21"176
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15"157
2000"Creed""Higher""5:16""1999-09-11"6350
2000"Lonestar""Amazed""4:25""1999-06-05"6345
2000"Creed""Higher""5:16""1999-09-11"6450
2000"Lonestar""Amazed""4:25""1999-06-05"6450
2000"Creed""Higher""5:16""1999-09-11"6549

Many variables in column names#

who = pl.read_csv("./Data_files/who.csv", null_values="NA")
who
shape: (7_240, 60)
countryiso2iso3yearnew_sp_m014new_sp_m1524new_sp_m2534new_sp_m3544new_sp_m4554new_sp_m5564new_sp_m65new_sp_f014new_sp_f1524new_sp_f2534new_sp_f3544new_sp_f4554new_sp_f5564new_sp_f65new_sn_m014new_sn_m1524new_sn_m2534new_sn_m3544new_sn_m4554new_sn_m5564new_sn_m65new_sn_f014new_sn_f1524new_sn_f2534new_sn_f3544new_sn_f4554new_sn_f5564new_sn_f65new_ep_m014new_ep_m1524new_ep_m2534new_ep_m3544new_ep_m4554new_ep_m5564new_ep_m65new_ep_f014new_ep_f1524new_ep_f2534new_ep_f3544new_ep_f4554new_ep_f5564new_ep_f65newrel_m014newrel_m1524newrel_m2534newrel_m3544newrel_m4554newrel_m5564newrel_m65newrel_f014newrel_f1524newrel_f2534newrel_f3544newrel_f4554newrel_f5564newrel_f65
strstrstri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
"Afghanistan""AF""AFG"1980nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1981nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1982nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1983nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1984nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2009125578null3471681293192180873null30284192291261560860null6496165588286114251334null702315517295142442660192249123122321039401944438182138nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"201015071022081682761350252173974218512834902651711826821334232701545882864173212824013285113777895632702439028684182291922203191058677338181146nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2011152784246720717803772781741084216113864482741601364596247328131264702728127194727542216962587495250195746796342172172209318802640284137129nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"201212078324212086796360271173939205312864832311611169613230226571154708796100888822871957829516432233214658789331178182208319710579228140143nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2013nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull131516425331536323491206120812522069464935261453811725

The aim is to separate the columns into diagnosis, gender and age columns. We can reuse the idea in the previous exercise and pass a regular expression to the names_pattern parameter:

who.pivot_longer(
    index=["country", "iso2", "iso3", "year"],
    names_to=("diagnosis", "gender", "age"),
    names_pattern=r"new_?(.*)_(.)(.*)",
    values_to="count",
)
shape: (405_440, 8)
countryiso2iso3yeardiagnosisgenderagecount
strstrstri64strstrstri64
"Afghanistan""AF""AFG"1980"sp""m""014"null
"Afghanistan""AF""AFG"1981"sp""m""014"null
"Afghanistan""AF""AFG"1982"sp""m""014"null
"Afghanistan""AF""AFG"1983"sp""m""014"null
"Afghanistan""AF""AFG"1984"sp""m""014"null
"Zimbabwe""ZW""ZWE"2009"rel""f""65"null
"Zimbabwe""ZW""ZWE"2010"rel""f""65"null
"Zimbabwe""ZW""ZWE"2011"rel""f""65"null
"Zimbabwe""ZW""ZWE"2012"rel""f""65"null
"Zimbabwe""ZW""ZWE"2013"rel""f""65"725

Multiple observations per row#

household = pl.read_csv("./Data_files/household.csv", null_values="NA")
household
shape: (5, 5)
familydob_child1dob_child2name_child1name_child2
i64strstrstrstr
1"1998-11-26""2000-01-29""Susan""Jose"
2"1996-06-22"null"Mark"null
3"2002-07-11""2004-04-05""Sam""Seth"
4"2004-10-10""2009-08-27""Craig""Khai"
5"2000-12-05""2005-02-28""Parker""Gracie"

In the DataFrame above, we wish to separate the name and dob into separate columns. How does pivot_longer handle this?

household.pivot_longer(index="family", names_to=(".value", "child"), names_sep="_")
shape: (10, 4)
familychilddobname
i64strstrstr
1"child1""1998-11-26""Susan"
2"child1""1996-06-22""Mark"
3"child1""2002-07-11""Sam"
4"child1""2004-10-10""Craig"
5"child1""2000-12-05""Parker"
1"child2""2000-01-29""Jose"
2"child2"nullnull
3"child2""2004-04-05""Seth"
4"child2""2009-08-27""Khai"
5"child2""2005-02-28""Gracie"

If you want to retain a part of the column as a header, use the .value placeholder; the paired label remains as a header, while the unpaired parts, if any, are flipped into rows.

Let’s break it down a bit. The name_to parameter is (.value, child) which pairs with (dob, child1), (dob, child2), (name, child1), (name, child2), after it has been split by the names_sep argument. .value will be paired with dob and name, while child will be paired with child1 and child2. In tabular form, it can be portrayed as below:

.value   names_sep     child
dob        _           child1
dob        _           child2
name       _           child1
name       _           child2

Let’s look at another example:

manufacturer = pl.DataFrame(
    {
        "Sony | TV | Model | value": ["A222", "A234", "A4345"],
        "Sony | TV | Quantity | value": [5, 5, 4],
        "Sony | TV | Max-quant | value": [10, 9, 9],
        "Panasonic | TV | Model | value": ["T232", "S3424", "X3421"],
        "Panasonic | TV | Quantity | value": [1, 5, 1],
        "Panasonic | TV | Max-quant | value": [10, 12, 11],
        "Sanyo | Radio | Model | value": ["S111", "S1s1", "S1s2"],
        "Sanyo | Radio | Quantity | value": [4, 2, 4],
        "Sanyo | Radio | Max-quant | value": [9, 9, 10],
    }
)

manufacturer
shape: (3, 9)
Sony | TV | Model | valueSony | TV | Quantity | valueSony | TV | Max-quant | valuePanasonic | TV | Model | valuePanasonic | TV | Quantity | valuePanasonic | TV | Max-quant | valueSanyo | Radio | Model | valueSanyo | Radio | Quantity | valueSanyo | Radio | Max-quant | value
stri64i64stri64i64stri64i64
"A222"510"T232"110"S111"49
"A234"59"S3424"512"S1s1"29
"A4345"49"X3421"111"S1s2"410

We wish to retain Model, Quantity and Max-quant as headers, and flip the remaining parts of the columns:

(
    manufacturer.pivot_longer(
        index=None,
        names_to=("Manufacturer", "Device", ".value", "value"),
        names_sep=" | ",
    ).drop("value")
)
shape: (9, 5)
ManufacturerDeviceModelQuantityMax-quant
strstrstri64i64
"Sanyo""Radio""S111"49
"Sanyo""Radio""S1s1"29
"Sanyo""Radio""S1s2"410
"Panasonic""TV""T232"110
"Panasonic""TV""S3424"512
"Panasonic""TV""X3421"111
"Sony""TV""A222"510
"Sony""TV""A234"59
"Sony""TV""A4345"49
(
    manufacturer.pivot_longer(
        index=None,
        names_to=("Manufacturer", "Device", ".value"),
        names_pattern=r"([a-zA-Z]+)\s?\|\s?([a-zA-Z]+)\s?\|\s?([a-zA-Z-]+)\s?\|.*",
    )
)
shape: (9, 5)
ManufacturerDeviceModelQuantityMax-quant
strstrstri64i64
"Sony""TV""A222"510
"Sony""TV""A234"59
"Sony""TV""A4345"49
"Sanyo""Radio""S111"49
"Sanyo""Radio""S1s1"29
"Sanyo""Radio""S1s2"410
"Panasonic""TV""T232"110
"Panasonic""TV""S3424"512
"Panasonic""TV""X3421"111

What if our header is actually composed of different parts of a column name? This is best explained with an example:

df_mean = pl.DataFrame(
    {
        "x_1_mean": [10],
        "x_2_mean": [20],
        "y_1_mean": [30],
        "y_2_mean": [40],
        "unit": [50],
    }
)

df_mean
shape: (1, 5)
x_1_meanx_2_meany_1_meany_2_meanunit
i64i64i64i64i64
1020304050

For the DataFrame above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.

With pivot_longer we pass multiple .value to handle this (you can pass as many .value as required for the reshaping):

(
    df_mean.pivot_longer(
        index="unit",
        names_to=(".value", "number", ".value"),
        names_sep="_",
        names_transform=pl.col("number").cast(pl.Int8),
    )
)
shape: (2, 4)
unitnumberxmeanymean
i64i8i64i64
5011030
5022040

Describe the output with pivot_longer_spec#

pivot_longer offers a powerful, flexible and concise way to reshape data from wide to long form. However, there are situations where pivot_longer is inadequate for the transformation, or you need more control over the reshaping. This is where pivot_longer_spec shines - you describe how the new DataFrame should look, based on the source DataFrame.

Let’s look at an example, adapted from data.table vignette:

iris = pl.DataFrame(
    {
        "Sepal.Length": [5.1, 5.9],
        "Sepal.Width": [3.5, 3.0],
        "Petal.Length": [1.4, 5.1],
        "Petal.Width": [0.2, 1.8],
        "Species": ["setosa", "virginica"],
    }
)

iris
shape: (2, 5)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
f64f64f64f64str
5.13.51.40.2"setosa"
5.93.05.11.8"virginica"

For the DataFrame above, we wish to consolidate the columns into part and dimensions - the Sepal/Petal prefixes go into the part column, while Width/Length go into the dimension column. Let’s translate the specification into a table form:

source_column_name     new_column_name           part                 dimension
Sepal.Length           measurement               Sepal                 Length
Sepal.Width            measurement               Sepal                 Width
Petal.Length           measurement               Petal                 Length
Petal.Width            measurement               Petal                 Width                   

The table above shows a clear relation between the source and target; let’s take this relationship and create a polars DataFrame of this specification:

spec = {
    ".name": ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"],
    ".value": ["measurement", "measurement", "measurement", "measurement"],
    "part": ["Sepal", "Sepal", "Petal", "Petal"],
    "dimension": ["Length", "Width", "Length", "Width"],
}
spec = pl.DataFrame(spec)
spec
shape: (4, 4)
.name.valuepartdimension
strstrstrstr
"Sepal.Length""measurement""Sepal""Length"
"Sepal.Width""measurement""Sepal""Width"
"Petal.Length""measurement""Petal""Length"
"Petal.Width""measurement""Petal""Width"

pivot_longer_spec takes the specification above and returns the expected new DataFrame:

pivot_longer_spec(df=iris, spec=spec)
shape: (8, 4)
Speciespartdimensionmeasurement
strstrstrf64
"setosa""Petal""Length"1.4
"virginica""Petal""Length"5.1
"setosa""Sepal""Width"3.5
"virginica""Sepal""Width"3.0
"setosa""Petal""Width"0.2
"virginica""Petal""Width"1.8
"setosa""Sepal""Length"5.1
"virginica""Sepal""Length"5.9

Let’s take it further - the Sepal/Petal prefixes go into a part column, while the Width/Length suffixes remain as headers:

source_column_name     new_column_name           part                 
Sepal.Length           Length                   Sepal                 
Sepal.Width            Width                    Sepal                 
Petal.Length           Length                   Petal                 
Petal.Width            Width                    Petal                                    

Again, the relationship is clear. Let’s create the spec, and then unpivot:

spec = {
    ".name": ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"],
    "part": ["Sepal", "Sepal", "Petal", "Petal"],
    ".value": ["Length", "Width", "Length", "Width"],
}
spec = pl.DataFrame(spec)
spec
shape: (4, 3)
.namepart.value
strstrstr
"Sepal.Length""Sepal""Length"
"Sepal.Width""Sepal""Width"
"Petal.Length""Petal""Length"
"Petal.Width""Petal""Width"
pivot_longer_spec(df=iris, spec=spec)
shape: (4, 4)
SpeciespartLengthWidth
strstrf64f64
"setosa""Petal"1.40.2
"virginica""Petal"5.11.8
"setosa""Sepal"5.13.5
"virginica""Sepal"5.93.0

The idea is simple - you know your data well, so describe the new DataFrame based on the current DataFrame, and pivot_longer_spec will take care of the transformation.

Performance#

pivot_longer is not just simple, convenient, and straightforward, it is efficient as well - We can refer to part II of this series to compare speeds:

from zipfile import ZipFile

zip_file = ZipFile("./Data_files/household_large.csv.zip")
zip_file = zip_file.read("household_large.csv")
household_large = pl.read_csv(zip_file, null_values="NA")
household_large
shape: (100_000, 61)
familyname_child0dob_child0name_child1dob_child1name_child2dob_child2name_child3dob_child3name_child4dob_child4name_child5dob_child5name_child6dob_child6name_child7dob_child7name_child8dob_child8name_child9dob_child9name_child10dob_child10name_child11dob_child11name_child12dob_child12name_child13dob_child13name_child14dob_child14name_child15dob_child15name_child16dob_child16name_child17dob_child17name_child18dob_child18name_child19dob_child19name_child20dob_child20name_child21dob_child21name_child22dob_child22name_child23dob_child23name_child24dob_child24name_child25dob_child25name_child26dob_child26name_child27dob_child27name_child28dob_child28name_child29dob_child29
i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
0"Brittany Mcdonald""1989-08-24""Karen Mendoza MD""1938-12-01""Stephanie Casey""1959-08-07""Rhonda Hartman""1984-01-17""Elizabeth Sampson""1947-01-25""Jacqueline Barrett""1949-02-15""Rebecca Harvey""1974-12-03""John Hopkins""2009-11-27""Diane Ali""1993-03-04""Emily Jones""1933-01-27""Rickey Bates""1938-06-03""Walter Stuart""1971-12-04""Megan Smith""1992-06-05""Amanda Morris""1976-02-01""Jennifer Greer""1971-12-12""Megan Haney""1935-09-11""Alyssa Garcia""1963-09-03""Daniel Turner""1938-11-17""Brittany Simmons""1919-11-21""Kurt Schwartz""1956-02-27""Tanya Brandt""1943-08-16""Jennifer White""1960-08-27""Eddie Hamilton""1988-01-15""Sean Chavez""1968-12-08""Rhonda White""1948-08-30""Linda Hicks""1936-08-03""Brittany Beard""1996-03-08""Nicole Schmidt""1948-12-06""Nancy Clark""1914-12-12""Dr. David Walters""1922-10-24"
1"Diana Welch""1989-02-11""Brandon Pittman""2019-09-02""Alicia Smith""1923-06-27""Sarah Dean DVM""2001-02-10""Katherine Landry""1935-06-26""Sarah Hughes""1911-01-19""Amanda Peters""1922-06-02""Shawn Sanders""1936-11-17""David White""1998-02-25""Justin Mccormick""1994-03-06""James Kelly""2005-02-13""Kimberly Lynch MD""1986-02-18""Kimberly Hall""2001-07-23""Susan Anderson""1908-10-26""Anthony Hall""2012-10-10""Lisa Gardner""1939-08-17""Andrew Burke""1981-07-13""Brittney Evans""1993-06-06""Taylor Wood""1951-07-29""Tanya Taylor""2019-02-01""Ashley Schultz""1993-04-15""Amanda Sanchez""1923-07-14""Harry Stevens""1986-11-10""David Flores""1931-09-16""Timothy Gray""2017-05-07""Jay Ramsey""1936-05-08""Charles Bell""2014-12-05""Jacob Williams""1963-04-17""Kimberly Gonzalez""1991-11-15""Monica Newton""1923-10-27"
2"Jamie Richards""1973-11-04""Kristin Dunn""1973-04-02""Kathy Vasquez""1966-03-20""William Johnson""2006-03-08""Anthony Brewer""1987-06-21""James Hendricks""1981-03-29""Emily Torres""1990-11-09""Jennifer Santos""1982-02-13""Travis Snyder""2018-09-04""Misty Hudson""1974-01-03""Mathew Price""1965-07-09""Charles Anthony""2003-07-20""Sean Cook""1919-03-26""Krista Hall""1951-02-12""Sarah Hurst""1962-09-22""Angela May""2009-11-07""Antonio Ayers""1980-12-11""Kathryn Beck""1944-05-27""Barbara Davis""1929-04-19""Vanessa Gonzalez""1944-06-13""Tina Benjamin""1911-02-01""Anthony Wade""1938-10-10""Rachel Green""2001-02-26""Laura Wright""1950-01-11""Kenneth Stewart""1976-03-10""Megan Glass""1987-10-25""Jennifer King""1962-01-26""Cheyenne Ramirez""2009-12-21""Hayden Williamson""1993-12-27""Thomas Buchanan""2017-06-23"
3"Michele Reed""1994-11-05""John Perry""1915-07-08""Benjamin Garcia""2006-03-12""Mrs. Carolyn Chavez""1999-08-03""Bailey Stevens""2000-04-09""Gilbert Lynch""1978-10-06""Jennifer Burton""2000-12-13""Jose Nguyen""2020-11-15""Yvonne Wall""1910-10-16""Jacob Owens""1943-11-14""William Gibson""1935-09-11""Nicholas Garcia""2014-12-28""Jamie Diaz""1952-10-05""Brian Burns""1934-03-03""Ryan Taylor""1978-11-15""Ashley Miller""1915-01-14""Francisco Brewer""2006-10-21""Melissa Ramos""1966-05-01""Antonio Moore""1914-11-28""Darin Russell""1994-07-29""Eric Santos""1928-02-18""Mariah Cunningham""1958-07-21""Deanna White""1914-08-30""Peggy Williams""1973-03-17""Michael Richardson""1968-02-20""Debra Haney""1972-04-02""Alexander Roberts""1977-04-14""Derrick Martin""1999-09-02""Teresa Stephens""1928-04-11""Andrew Murphy""1910-03-21"
4"Steven Burgess""1997-12-24""Robert Willis""2011-04-18""Amy Forbes""1943-06-26""Chase Gutierrez""1976-06-21""Kevin Gonzales""1977-10-03""Michael Lyons""1973-08-06""Brian Carlson""1964-07-16""Jonathan Brown""1983-07-06""Ryan Lewis""2011-09-30""Brandon Drake""1995-07-27""Christine Mueller""1945-07-15""Joseph Sanders""1995-01-22""Patricia Robertson""1953-01-29""Kevin Young""1962-08-12""Amber Gray""1992-05-30""Laura Powers""1934-10-28""Carrie Ross""1974-11-03""Anthony Taylor""2002-03-31""Bill Hawkins""1923-01-26""Erin Mcbride""2009-01-12""Roger Meza""1934-12-10""Sandra Davis""1918-11-05""Kristie Whitehead""1979-12-01""James Anderson""1954-02-02""Sara Aguilar""2008-03-09""Hector Reynolds""1990-08-17""Cynthia Klein""1931-12-16""Richard Copeland""1978-11-24""Rickey Hudson""1926-09-29""Carmen Smith""1963-11-05"
99995"Donna Vazquez""1938-12-26""William Curry""1943-03-27""Brian Green""1941-08-20""Cathy Krause""1910-10-22""Christopher Johnson""2006-10-22""Amanda Anderson""1910-07-21""Paul Casey""1965-02-23""Franklin Strickland""2004-03-26""Brian Murphy""1951-06-04""Jennifer Martinez""1921-08-12""Tony Long""1928-04-01""Anthony Garcia""2015-12-03""Todd Gonzales""2001-10-31""Justin Anderson""2021-05-04""Chad Garrett""1939-06-03""Mr. Christopher Harper""2002-07-22""Jennifer Flores""1933-06-17""Alexis Porter""2020-08-27""Jason Pitts""1969-03-11""Adam Sanchez""1959-02-02""Angela Cooper""1972-02-16""Elizabeth Thomas""1959-10-25""Mark Aguilar""1989-03-28""Chad Chapman""1919-03-01""Jessica Doyle""1938-04-14""Kim Miller""2017-08-04""Angela Hughes""1930-10-31""Jamie Thompson""2017-02-28""Brandy Frederick""1932-07-31""Patrick Solomon""1920-03-12"
99996"Monica Davis""1987-10-07""Laura Donovan""1992-07-30""Morgan Garcia""1997-01-09""Stacey Chambers""1944-01-10""Stephen Fields""1935-10-03""Ann Barnes""1996-01-04""Nicholas Jones""1967-02-01""Karen Dunn""2012-07-21""Sheryl Hill""1989-02-21""Kyle Fox""2005-01-04""Mark Adkins""1981-12-01""Spencer Blankenship""1934-04-05""Catherine Smith""2008-12-06""Andrea Walker""1940-07-01""Ricky Barr""1912-11-17""Desiree Parker""1939-11-15""Amanda Ruiz""2018-08-09""Sharon Miller""2019-05-05""Larry Smith""1981-02-24""James Harris""1990-12-14""Justin Smith""1994-10-18""Tamara Roman""1921-04-01""Bridget Serrano""1974-05-13""Ann Melendez""1986-07-06""Susan Huff""2018-09-05""Antonio Juarez""1995-08-11""Ashley Lamb""1955-05-31""Patrick Morse""1932-08-08""Thomas Gregory""1963-04-11""Steven Smith""1987-09-22"
99997"Vanessa Byrd""1967-03-17""Tammy Mcpherson""1987-11-06""Jacob Schwartz""2019-07-07""Barry Smith""1996-03-17""Samuel Martinez""1980-02-27""Erin Flores""1959-04-12""Melissa Perez""2001-07-28""Victoria Wilson""1975-02-12""Timothy Ramirez""1912-08-05""Kyle Johnson""2015-04-01""David Roberts""2001-10-01""Michael Alexander""1922-10-18""Robert Green""2004-03-13""Brent Nelson""1973-02-24""Curtis Smith""1978-11-04""Walter Lyons""2002-03-03""Angela Johnson""2015-06-24""Amanda Smith""1913-01-08""Jacqueline Smith""2012-01-07""Brian Ho""1948-12-11""Robert Lopez""1967-02-04""Charles Rogers""1996-04-25""Marcus Dixon""1952-03-11""Alexis Moore""1952-07-26""Joshua Robles""1956-11-12""Rachel Hull""1979-05-28""Jessica Smith""2006-01-17""Joshua Hernandez""2005-06-26""Melanie Crawford""1927-07-12""Scott Carlson""1950-05-23"
99998"Gary Harmon""2002-05-09""William Reeves""1960-01-23""Linda Cohen""1953-05-31""Damon Soto""1948-11-15""Christina Hull""1992-04-07""Sean Boone""2022-08-10""Benjamin Graham""1971-12-30""Daniel Pacheco""1926-08-16""Jennifer Hopkins""2016-04-15""Anne Kelly""1988-12-03""Cristina Garcia""1997-05-03""Lindsay Wright""1979-12-03""Teresa Berg""1953-11-01""Lisa Murphy""1930-09-15""Jacob Hernandez""2017-04-25""Amber Campbell""1983-10-31""Ashley Mitchell""1962-09-07""Jacqueline Singleton""1970-06-14""Melinda Miller""1978-05-08""Brenda Gibbs""1964-08-13""Rebecca Cooper""2022-10-04""Jonathan Kelly""1935-11-19""Stephen Sanchez""2010-05-15""Adam Stark""1911-07-12""Denise George""1930-02-08""Brenda Lee""2003-09-22""Tracey Nelson""1910-09-10""Greg Esparza""2000-07-07""Tanner Bell""1928-01-15""Veronica Collins""1934-09-23"
99999"Sheila Elliott""1965-08-06""Charlene Lee""2014-06-22""Gabriella Brown""2015-03-17""Joseph Warren""2015-03-18""Martin Frederick""1992-05-01""Cameron Pruitt""1992-12-19""Jennifer Wilson""1929-02-17""Scott Soto""2014-02-18""Lisa Oneill""1909-10-27""Joy Jones""1963-11-07""Diana Rodriguez""1951-12-27""Rachel Estes""1986-01-22""Susan Smith""1985-01-30""Cody Perez""1932-02-02""Pamela Stevenson DDS""2020-07-06""Katelyn Oconnor""1975-11-13""James Blair""1976-03-29""Betty Rowe""1991-03-26""William Briggs""1932-10-01""James Hudson""1998-01-17""Shawn Martinez""2013-11-03""Carla Shelton""2018-10-05""Emily Burke""1965-04-03""Matthew Brown""1932-06-30""Mrs. Jill Rogers MD""2000-05-07""Martin Schmidt""1977-11-22""Cody Collins""1994-08-01""Kevin Duncan""2009-10-15""Jerry Brown""2004-01-29""Nancy Benjamin""1966-05-18"
%%timeit
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household_large.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)
1.2 s ± 8.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
index = "family"
columns = pl.Series("header", household_large.columns)
mask = columns.is_in([index]).not_()
columns = columns.filter(mask)
columns_ = columns.str.split("_")
columns_ = columns_.list.to_struct(fields=["header", "child"])
columns_ = columns_.struct.unnest()
columns_ = columns_.with_columns(columns=columns)
grouped = columns_.group_by("child")
grouped = grouped.agg(pl.all())
expression = [index]
zipped = zip(
    grouped.get_column("header").to_list(),
    grouped.get_column("columns").to_list(),
)
for num, (heads, column_names) in enumerate(zipped):
    expr_ = dict(zip(heads, column_names))
    expr_ = pl.struct(**expr_).alias(str(num))
    expression.append(expr_)
out = household_large.select(expression)
out = out.select(pl.implode('*'))
out = out.unpivot(index=index)
out = out.with_columns(grouped.get_column('child'))
out = out.drop('variable')
out = out.explode(['family','value'])
out = out.unnest('value')
out
81.3 ms ± 4.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit household_large.pivot_longer(index='family', names_to=('.value','child'), names_sep='_')
55.8 ms ± 959 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
spec = (
    pl.DataFrame({".name": household_large.columns[1:]})
    .with_columns(
        other=pl.col(".name").str.split("_").list.to_struct(fields=[".value", "child"])
    )
    .unnest("other")
)
pivot_longer_spec(df=household_large, spec=spec)
56.5 ms ± 747 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Your mileage may vary with these speed tests.

Summary#

This blog post shows various ways to reshape data into long form, using pivot_longer and pivot_longer_spec. They are meant to be concise, straightforward and efficient.

Comments#