Pivot_longer : Reshape Data in Pandas Efficiently and with Ease from Wide to Long#

Introduction#

You’ve got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows). In pandas, depending on the form of the data, this can be achieved with either pd.stack, pd.melt, pd.wide_to_long, and sometimes require a bit of data massaging before arriving at the final, desired form.

This article suggests a simple, efficient, opinionated way, that resolves most wide to long transformations, within a single function, using the pivot_longer function from pyjanitor.

Let’s look at an example, based on real data - this is a subset of the who data from R:

import pandas as pd
import numpy as np
import sys

print(
    " pandas version :",
    pd.__version__,
    "\n",
    "numpy version :",
    np.__version__,
    "\n",
    "python version :",
    sys.version,
)
 pandas version : 2.2.2 
 numpy version : 2.0.2 
 python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
df
id new_sp_m5564 newrel_f65
0 1 2 3

Looking at the dataframe above, we have multiple variables crammed into the column names. Using the definitions as described here:

    - new_/new prefix indicates these are counts of new cases
    - sp/rel/ep describe how the case was diagnosed
    - m/f gives the gender
    - 5564/65 supplies the age range

Our goal here is to separate the dataframe into individual columns of count, diagnosis,gender, age. How do we pull this off, with the current tools in Pandas?

pd.wide_to_long is not applicable here, as there is no unique separator. Our best option would be to use a regular expression to pull out the individual columns.

With pd.stack:

# split the columns using a regex, before stacking:
DF = df.set_index("id")
regex = r"new_?(.+)_(.)(\d+)"
columns = DF.columns.str.extract(regex)
new_names = ["diagnosis", "gender", "age"]
columns = pd.MultiIndex.from_frame(columns, names=new_names)
DF.columns = columns
DF.stack(new_names, future_stack=True).rename("count").reset_index()
id diagnosis gender age count
0 1 sp m 5564 2
1 1 rel f 65 3

With pd.melt:

DF = df.melt("id")
out = DF.variable.str.extract(regex).set_axis(new_names, axis=1)
pd.concat([DF.drop(columns="variable"), out], axis=1)
id value diagnosis gender age
0 1 2 sp m 5564
1 1 3 rel f 65

Note the extra steps we had to take to get to our final form. However, we know the pattern (a regex that matches the columns and the sub groups in the columns that we wish to extract), so why not take advantage of that? This is where pivot_longer from pyjanitor shines:

# pip install pyjanitor
import janitor

df.pivot_longer(index="id", names_to=new_names, names_pattern=regex)
id diagnosis gender age value
0 1 sp m 5564 2
1 1 rel f 65 3

Straight to the point. If there is a pattern, pivot_longer will figure out what to do and deliver the results. It is a simple, efficient, reshaping abstraction.

Basics#

Let’s walk through some of the parameters in pivot_longer:

  • index : These are the columns that will not be transformed to long form - the values are repeated as necessary.

  • column_names: These are the columns that will be transformed to long form.

  • names_to : These will be the new column names of the reshaped DataFrame.

  • values_to: These will contain the values collated from the original DataFrame.

  • names_sep: extract individual columns with a separator.

  • names_pattern: extract individual columns with a regular expression, or a list of regular expressions.

  • names_transform: efficiently convert the columns aggregated from the column_names from string to another dtype, as long as it is supported by pd.astype

Keep in mind that pivot_longer works on patterns - deciphering what pattern to use is the key to its simplicity.

The rest of the article outlines the various patterns that pivot_longer works with.

Pattern 1 - Melt all Columns#

This is basic and is similar to what you get with pd.melt.

Let’s use another realistic dataset - relig_income - the original example is from here:

url = "https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv"
religion = pd.read_csv(url)
religion.head()
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don't know/refused
0 Agnostic 27 34 60 81 76 137 122 109 84 96
1 Atheist 12 27 37 52 35 70 73 59 74 76
2 Buddhist 27 21 30 34 33 58 62 39 53 54
3 Catholic 418 617 732 670 638 1116 949 792 633 1489
4 Don’t know/refused 15 14 15 11 10 35 21 17 18 116
religion.pivot_longer(index="religion", names_to="income", values_to="count")
religion income count
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Don’t know/refused <$10k 15
... ... ... ...
175 Orthodox Don't know/refused 73
176 Other Christian Don't know/refused 18
177 Other Faiths Don't know/refused 71
178 Other World Religions Don't know/refused 8
179 Unaffiliated Don't know/refused 597

180 rows × 3 columns

If there are a lot of columns, we can dynamically select the index, or column_names, using the janitor.select syntax. Let’s apply it to another dataset related to billboard charts - original example is here:

url = "https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/billboard.csv"
billboard = pd.read_csv(url)
billboard.head()
year artist track time date.entered wk1 wk2 wk3 wk4 wk5 ... wk67 wk68 wk69 wk70 wk71 wk72 wk73 wk74 wk75 wk76
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 87 82.0 72.0 77.0 87.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87.0 92.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70.0 68.0 67.0 66.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2000 3 Doors Down Loser 4:24 2000-10-21 76 76.0 72.0 69.0 67.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57 34.0 25.0 17.0 17.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 81 columns

The columns that will be flipped to long form are the columns that start with wk - these will be passed to the column_names parameter. That is a lot of columns. Let’s see how pivot_longer simplifies this:

billboard.pivot_longer(column_names="wk*", names_to="week")
year artist track time date.entered week value
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1 87.0
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 wk1 91.0
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 wk1 81.0
3 2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76.0
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 wk1 57.0
... ... ... ... ... ... ... ...
24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 wk76 NaN
24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 wk76 NaN
24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 wk76 NaN
24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 wk76 NaN
24091 2000 matchbox twenty Bent 4:12 2000-04-29 wk76 NaN

24092 rows × 7 columns

If we want to use the index parameter instead, that’s easy as well:

billboard.pivot_longer(index=slice("year", "date.entered"), names_to="week")
year artist track time date.entered week value
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1 87.0
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 wk1 91.0
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 wk1 81.0
3 2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76.0
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 wk1 57.0
... ... ... ... ... ... ... ...
24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 wk76 NaN
24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 wk76 NaN
24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 wk76 NaN
24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 wk76 NaN
24091 2000 matchbox twenty Bent 4:12 2000-04-29 wk76 NaN

24092 rows × 7 columns

Pattern 2A - Melt into multiple columns - new column names#

There are scenarios where multiple variables are crammed into the column names - we had a look at that already with the who example.

Let’s look at an example, adapted from data.table vignette:

iris = pd.DataFrame(
    {
        "Sepal.Length": [5.1, 5.9],
        "Sepal.Width": [3.5, 3.0],
        "Petal.Length": [1.4, 5.1],
        "Petal.Width": [0.2, 1.8],
        "Species": ["setosa", "virginica"],
    }
)

iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica

For the data above, we wish to consolidate the columns into part and dimensions - the Sepal/Petal prefixes go into the part column, while Width/Length go into the dimension column.

In this case, we can use either the names_sep or the names_pattern parameter - for this particular scenario, the names_sep option is a better fit:

iris.pivot_longer(index="Species", names_to=["part", "dimension"], names_sep=r".")
Species part dimension value
0 setosa Sepal Length 5.1
1 virginica Sepal Length 5.9
2 setosa Sepal Width 3.5
3 virginica Sepal Width 3.0
4 setosa Petal Length 1.4
5 virginica Petal Length 5.1
6 setosa Petal Width 0.2
7 virginica Petal Width 1.8

Using names_pattern:

iris.pivot_longer(
    index="Species", names_to=["part", "dimension"], names_pattern=r"(.+)\.(.+)"
)
Species part dimension value
0 setosa Sepal Length 5.1
1 virginica Sepal Length 5.9
2 setosa Sepal Width 3.5
3 virginica Sepal Width 3.0
4 setosa Petal Length 1.4
5 virginica Petal Length 5.1
6 setosa Petal Width 0.2
7 virginica Petal Width 1.8

Pattern 2B - Melt into multiple columns - Use variables within the column names as new names#

Instead of new column names, what if we wanted to use existing variables within the column names?

Let’s look at an example adapted from data.table vignette:

df = pd.DataFrame(
    {
        "family": [1, 2, 3, 4, 5],
        "dob_child1": [
            "1998-11-26",
            "1996-06-22",
            "2002-07-11",
            "2004-10-10",
            "2000-12-05",
        ],
        "dob_child2": [
            "2000-01-29",
            np.nan,
            "2004-04-05",
            "2009-08-27",
            "2005-02-28",
        ],
        "gender_child1": [1, 2, 2, 1, 2],
        "gender_child2": [2.0, np.nan, 2.0, 1.0, 1.0],
    }
)

df
family dob_child1 dob_child2 gender_child1 gender_child2
0 1 1998-11-26 2000-01-29 1 2.0
1 2 1996-06-22 NaN 2 NaN
2 3 2002-07-11 2004-04-05 2 2.0
3 4 2004-10-10 2009-08-27 1 1.0
4 5 2000-12-05 2005-02-28 2 1.0

For the data above, we wish to keep dob and gender as column names, while moving child1 and child2 to a new column. How does pivot_longer reshape this?

df.pivot_longer(index="family", names_to=(".value", "child"), names_sep="_")
family child dob gender
0 1 child1 1998-11-26 1.0
1 2 child1 1996-06-22 2.0
2 3 child1 2002-07-11 2.0
3 4 child1 2004-10-10 1.0
4 5 child1 2000-12-05 2.0
5 1 child2 2000-01-29 2.0
6 2 child2 NaN NaN
7 3 child2 2004-04-05 2.0
8 4 child2 2009-08-27 1.0
9 5 child2 2005-02-28 1.0

For patterns such as this, where a part of the column name is to be retained as header in the new dataframe, the .value placeholder comes in handy. The .value placeholder tells pivot_longer - hey, this variable in the column name is to be retained as a header. pivot_longer gets the message and acts accordingly.

Take column label dob_child1 - the first part is dob and the second part is child1. This pairs with (.value, child), meaning dob stays as header, and the rest gets lumped into the child column. The names_sep value - _ - determines how the column is split into the new columns.

With names_pattern we can have more control on the output, selecting just the numbers in the child column:

df.pivot_longer(
    index="family", names_to=(".value", "child"), names_pattern=r"(.+)_child(\d)"
)
family child dob gender
0 1 1 1998-11-26 1.0
1 2 1 1996-06-22 2.0
2 3 1 2002-07-11 2.0
3 4 1 2004-10-10 1.0
4 5 1 2000-12-05 2.0
5 1 2 2000-01-29 2.0
6 2 2 NaN NaN
7 3 2 2004-04-05 2.0
8 4 2 2009-08-27 1.0
9 5 2 2005-02-28 1.0

We can also efficiently convert the data type of child to integer, using the names_transform option:

df.pivot_longer(
    index="family",
    names_to=(".value", "child"),
    names_pattern=r"(.+)_child(\d)",
    names_transform=np.int8,
)
family child dob gender
0 1 1 1998-11-26 1.0
1 2 1 1996-06-22 2.0
2 3 1 2002-07-11 2.0
3 4 1 2004-10-10 1.0
4 5 1 2000-12-05 2.0
5 1 2 2000-01-29 2.0
6 2 2 NaN NaN
7 3 2 2004-04-05 2.0
8 4 2 2009-08-27 1.0
9 5 2 2005-02-28 1.0

The flexibility of .value means you can place it anywhere - you are not limited by position - as long as the right argument is passed to the names_sep or names_pattern parameter, your output should be fine.

Let’s apply this to the iris dataset to test the flexibility of .value:

iris.pivot_longer(index="Species", names_to=("part", ".value"), names_sep=".")
Species part Length Width
0 setosa Sepal 5.1 3.5
1 virginica Sepal 5.9 3.0
2 setosa Petal 1.4 0.2
3 virginica Petal 5.1 1.8

You can use multiple .value if required - let’s apply this to a question from Stack Overflow:

df = pd.DataFrame(
    [
        {
            "ID": 1,
            "DateRange1Start": "1/1/90",
            "DateRange1End": "3/1/90",
            "Value1": 4.4,
            "DateRange2Start": "4/5/91",
            "DateRange2End": "6/7/91",
            "Value2": 6.2,
            "DateRange3Start": "5/5/95",
            "DateRange3End": "6/6/96",
            "Value3": 3.3,
        }
    ]
)

df
ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
0 1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
df.pivot_longer(index="ID", names_to=(".value", ".value"), names_pattern=r"(.+)\d(.*)")
ID DateRangeStart DateRangeEnd Value
0 1 1/1/90 3/1/90 4.4
1 1 4/5/91 6/7/91 6.2
2 1 5/5/95 6/6/96 3.3

Again, as long as the pattern can be identified, pivot_longer will take care of the rest.

One more example, adapted from Stack Overflow:

df = pd.DataFrame(
    {
        "Sony | TV | Model | value": {0: "A222", 1: "A234", 2: "A4345"},
        "Sony | TV | Quantity | value": {0: 5, 1: 5, 2: 4},
        "Sony | TV | Max-quant | value": {0: 10, 1: 9, 2: 9},
        "Panasonic | TV | Model | value": {0: "T232", 1: "S3424", 2: "X3421"},
        "Panasonic | TV | Quantity | value": {0: 1, 1: 5, 2: 1},
        "Panasonic | TV | Max-quant | value": {0: 10, 1: 12, 2: 11},
        "Sanyo | Radio | Model | value": {0: "S111", 1: "S1s1", 2: "S1s2"},
        "Sanyo | Radio | Quantity | value": {0: 4, 1: 2, 2: 4},
        "Sanyo | Radio | Max-quant | value": {0: 9, 1: 9, 2: 10},
    }
)

df
Sony | TV | Model | value Sony | TV | Quantity | value Sony | TV | Max-quant | value Panasonic | TV | Model | value Panasonic | TV | Quantity | value Panasonic | TV | Max-quant | value Sanyo | Radio | Model | value Sanyo | Radio | Quantity | value Sanyo | Radio | Max-quant | value
0 A222 5 10 T232 1 10 S111 4 9
1 A234 5 9 S3424 5 12 S1s1 2 9
2 A4345 4 9 X3421 1 11 S1s2 4 10
df.pivot_longer(
    names_to=("Manufacturer", "Device", ".value"),
    names_pattern=r"(.+)\s\|(.+)\s\|(.+)\s\|.*",
)
Manufacturer Device Model Quantity Max-quant
0 Sony TV A222 5 10
1 Sony TV A234 5 9
2 Sony TV A4345 4 9
3 Panasonic TV T232 1 10
4 Panasonic TV S3424 5 12
5 Panasonic TV X3421 1 11
6 Sanyo Radio S111 4 9
7 Sanyo Radio S1s1 2 9
8 Sanyo Radio S1s2 4 10

Pattern 3A - Group similar columns together#

This is best explained with an example:

df_mean = pd.DataFrame(
    {
        "x_1_mean": [10],
        "x_2_mean": [20],
        "y_1_mean": [30],
        "y_2_mean": [40],
        "unit": [50],
    }
)

df_mean
x_1_mean x_2_mean y_1_mean y_2_mean unit
0 10 20 30 40 50

For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.

For this pattern, we pass a list of names to names_to, while passing a list of regular expressions to names_pattern:

df_mean.pivot_longer(
    index="unit", names_to=["x_mean", "y_mean"], names_pattern=["x", "y"]
)
unit x_mean y_mean
0 50 10 30
1 50 20 40

Using a list of regular expressions for names_pattern implies that you want to group similar items under specified columns in names_to.

Basically, it is a pairing. the first x_1_mean pairs with the first y* mean which it encounters, which is y_1_mean, same goes for the next x*, which is x_2_mean, it pairs with y_2_mean, since that is the next available y*. The pairing is on a first come first serve basis.

Let’s look at another example:

url = "https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie_altered.csv"
movies = pd.read_csv(url)
movies.head()
title rating year duration director_1 director_fb_likes_1 actor_1 actor_2 actor_3 actor_fb_likes_1 actor_fb_likes_2 actor_fb_likes_3
0 Avatar PG-13 2009.0 178.0 James Cameron 0.0 CCH Pounder Joel David Moore Wes Studi 1000.0 936.0 855.0
1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 Gore Verbinski 563.0 Johnny Depp Orlando Bloom Jack Davenport 40000.0 5000.0 1000.0
2 Spectre PG-13 2015.0 148.0 Sam Mendes 0.0 Christoph Waltz Rory Kinnear Stephanie Sigman 11000.0 393.0 161.0
3 The Dark Knight Rises PG-13 2012.0 164.0 Christopher Nolan 22000.0 Tom Hardy Christian Bale Joseph Gordon-Levitt 27000.0 23000.0 23000.0
4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN Doug Walker 131.0 Doug Walker Rob Walker NaN 131.0 12.0 NaN

The goal for the dataframe above is to group the actors into another column, and the facebook likes into a separate column. The pattern here is obvious - we are grouping similar columns, so we pass a list of regular expressions to names_pattern, with an equal number of names to names_to:

movies.pivot_longer(
    index=slice("title", "duration"),
    names_to=["director", "director_fb_likes", "actor", "actor_fb_likes"],
    names_pattern=["director_\\d$", "director_fb_likes", "actor_\\d$", "actor.+"],
)
title rating year duration director director_fb_likes actor actor_fb_likes
0 Avatar PG-13 2009.0 178.0 James Cameron 0.0 CCH Pounder 1000.0
1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 Gore Verbinski 563.0 Johnny Depp 40000.0
2 Spectre PG-13 2015.0 148.0 Sam Mendes 0.0 Christoph Waltz 11000.0
3 The Dark Knight Rises PG-13 2012.0 164.0 Christopher Nolan 22000.0 Tom Hardy 27000.0
4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN Doug Walker 131.0 Doug Walker 131.0
... ... ... ... ... ... ... ... ...
14743 Signed Sealed Delivered NaN 2013.0 87.0 NaN NaN Crystal Lowe 319.0
14744 The Following TV-14 NaN 43.0 NaN NaN Sam Underwood 319.0
14745 A Plague So Pleasant NaN 2013.0 76.0 NaN NaN David Chandler 0.0
14746 Shanghai Calling PG-13 2012.0 100.0 NaN NaN Eliza Coupe 490.0
14747 My Date with Drew PG 2004.0 90.0 NaN NaN Jon Gunn 16.0

14748 rows × 8 columns

Based on the same first come first serve approach, director_1 is paired with director_fb_likes_1, actor_1 is paired with actor_fb_likes_1; the pairing continues for the rest.

This could also have been solved using the .value approach:

(
    movies.pivot_longer(
        index=slice("title", "duration"), names_to=".value", names_pattern="(.+)_\\d"
    )
)
title rating year duration director director_fb_likes actor actor_fb_likes
0 Avatar PG-13 2009.0 178.0 James Cameron 0.0 CCH Pounder 1000.0
1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 Gore Verbinski 563.0 Johnny Depp 40000.0
2 Spectre PG-13 2015.0 148.0 Sam Mendes 0.0 Christoph Waltz 11000.0
3 The Dark Knight Rises PG-13 2012.0 164.0 Christopher Nolan 22000.0 Tom Hardy 27000.0
4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN Doug Walker 131.0 Doug Walker 131.0
... ... ... ... ... ... ... ... ...
14743 Signed Sealed Delivered NaN 2013.0 87.0 NaN NaN Crystal Lowe 319.0
14744 The Following TV-14 NaN 43.0 NaN NaN Sam Underwood 319.0
14745 A Plague So Pleasant NaN 2013.0 76.0 NaN NaN David Chandler 0.0
14746 Shanghai Calling PG-13 2012.0 100.0 NaN NaN Eliza Coupe 490.0
14747 My Date with Drew PG 2004.0 90.0 NaN NaN Jon Gunn 16.0

14748 rows × 8 columns

Again, it is all about identifying the pattern, and using whichever option is more suitable/convenient. One more example:

from pandas import NA

treatments = dict(
    id=range(1, 6),
    A=("A", NA, "A", NA, NA),
    A_date=(1, NA, 2, NA, NA),
    B=(NA, "B", "B", NA, NA),
    B_date=(NA, 3, 2, NA, NA),
    other=(NA, NA, NA, "C", "D"),
    other_date=(NA, NA, NA, 1, 5),
)
treatments = pd.DataFrame(treatments)
treatments
id A A_date B B_date other other_date
0 1 A 1 <NA> <NA> <NA> <NA>
1 2 <NA> <NA> B 3 <NA> <NA>
2 3 A 2 B 2 <NA> <NA>
3 4 <NA> <NA> <NA> <NA> C 1
4 5 <NA> <NA> <NA> <NA> D 5

The goal is to transform the dataframe into three columns, an id column, a treatment column, and a date column. Since we are grouping similar columns, we pass a list of regular expressions to names_pattern:

treatments.pivot_longer(
    index="id",
    names_to=["treatment", "date"],
    names_pattern=["A$|B$|other$", ".+date$"],
    dropna=True,
)
id treatment date
0 1 A 1
1 3 A 2
2 2 B 3
3 3 B 2
4 4 C 1
5 5 D 5

Another way is to access the dates before the treatments, by reversing the order of the regular expressions:

treatments.pivot_longer(
    index="id",
    names_to=["date", "treatment"],
    names_pattern=[".+date$", ".+"],
    dropna=True,
)
id treatment date
0 1 A 1
1 3 A 2
2 2 B 3
3 3 B 2
4 4 C 1
5 5 D 5

Pattern 3B - Group similar columns and values together#

This is best illustrated with an example:

df = pd.DataFrame(
    {
        "City": ["Houston", "Austin", "Hoover"],
        "State": ["Texas", "Texas", "Alabama"],
        "Name": ["Aria", "Penelope", "Niko"],
        "Mango": [4, 10, 90],
        "Orange": [10, 8, 14],
        "Watermelon": [40, 99, 43],
        "Gin": [16, 200, 34],
        "Vodka": [20, 33, 18],
    },
    columns=["City", "State", "Name", "Mango", "Orange", "Watermelon", "Gin", "Vodka"],
)


df
City State Name Mango Orange Watermelon Gin Vodka
0 Houston Texas Aria 4 10 40 16 20
1 Austin Texas Penelope 10 8 99 200 33
2 Hoover Alabama Niko 90 14 43 34 18

The goal in this case is to group similar columns and values together - group the drinks into one column, group the fruits into another column. When that is done, group the values for the drinks into one column, and the values for the fruits into another column. That’s the pattern, and for this pattern we continue with the idea from earlier, by passing a list of regular expressions to names_pattern, an equal number of names to names_to, and an equal number of names to values_to:

df.pivot_longer(
    index=["City", "State"],
    column_names=slice("Mango", "Vodka"),
    names_to=("Fruit", "Drink"),
    values_to=("Pounds", "Ounces"),
    names_pattern=[r"M|O|W", r"G|V"],
)
City State Fruit Drink Pounds Ounces
0 Houston Texas Mango Gin 4 16.0
1 Austin Texas Mango Gin 10 200.0
2 Hoover Alabama Mango Gin 90 34.0
3 Houston Texas Orange Vodka 10 20.0
4 Austin Texas Orange Vodka 8 33.0
5 Hoover Alabama Orange Vodka 14 18.0
6 Houston Texas Watermelon None 40 NaN
7 Austin Texas Watermelon None 99 NaN
8 Hoover Alabama Watermelon None 43 NaN

Simple, straightforward, to the point, and efficient. Let’s look at another example:

df = {
    "State": ["CA", "CA", "FL", "FL"],
    "ItemN": [1, 2, 3, 4],
    "x1": [6, 7, 3, 9],
    "x2": [4, 3, 2, 4],
    "x3": [3, 1, 1, 2],
    "y1": [7, 15, 5, 16],
    "y2": [5, 10, 3, 14],
    "y3": [3, 5, 2, 12],
    "z1": [11, 4, 13, 14],
    "z2": [5, 2, 7, 5],
    "z3": [1, 1, 2, 4],
}

df = pd.DataFrame(df)

df
State ItemN x1 x2 x3 y1 y2 y3 z1 z2 z3
0 CA 1 6 4 3 7 5 3 11 5 1
1 CA 2 7 3 1 15 10 5 4 2 1
2 FL 3 3 2 1 5 3 2 13 7 2
3 FL 4 9 4 2 16 14 12 14 5 4

The goal is to group all the x columns into one, the y columns into one column, and group the associated values into individual columns as well. We are grouping similar columns and values - that’s the pattern - again we pass a list of regular expressions to names_pattern, a list of names to names_to, and a list of names to values_to:

df.pivot_longer(
    index=["State", "ItemN"],
    names_to=["x", "y", "z"],
    values_to=["xvalue", "yvalue", "zvalue"],
    names_pattern=["x", "y", "z"],
)
State ItemN x y z xvalue yvalue zvalue
0 CA 1 x1 y1 z1 6 7 11
1 CA 2 x1 y1 z1 7 15 4
2 FL 3 x1 y1 z1 3 5 13
3 FL 4 x1 y1 z1 9 16 14
4 CA 1 x2 y2 z2 4 5 5
5 CA 2 x2 y2 z2 3 10 2
6 FL 3 x2 y2 z2 2 3 7
7 FL 4 x2 y2 z2 4 14 5
8 CA 1 x3 y3 z3 3 3 1
9 CA 2 x3 y3 z3 1 5 1
10 FL 3 x3 y3 z3 1 2 2
11 FL 4 x3 y3 z3 2 12 4

Performance#

pivot_longer is not just simple, convenient, and straightforward, it is efficient as well:

%timeit religion.pivot_longer(index = 'religion', names_to = 'income', values_to = 'count')
488 μs ± 18 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit religion.melt(id_vars = 'religion', var_name = 'income', value_name = 'count')
573 μs ± 3.03 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Sample data, using 1000 columns (you can personally increase the number of columns to 1e6 or 1e8, depending on the power of your CPU):

sparse_wide = pd.read_csv("Data_files/sparse-wide.csv")
sparse_wide.head()
id grp x1 x2 x3 x4 x5 x6 x7 x8 ... x991 x992 x993 x994 x995 x996 x997 x998 x999 x1000
0 1 1000 A I A M O E F D ... C L G I R X O W L L
1 2 528 I X U Q B X I J ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 151 J Q H F B X B F ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 343 P A K K N R E E ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 460 L D G X I O C W ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1002 columns

%timeit sparse_wide.melt(id_vars = ['id', 'grp'])
50 ms ± 313 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sparse_wide.pivot_longer(index = ['id', 'grp'])
15.4 ms ± 65 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let’s test on the movies dataset:

wl = pd.wide_to_long(
    movies,
    stubnames=["director", "director_fb_likes", "actor", "actor_fb_likes"],
    i=["title", "rating", "year", "duration"],
    j="num",
    sep="_",
)
wl
director director_fb_likes actor actor_fb_likes
title rating year duration num
Avatar PG-13 2009.0 178.0 1 James Cameron 0.0 CCH Pounder 1000.0
2 NaN NaN Joel David Moore 936.0
3 NaN NaN Wes Studi 855.0
Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Gore Verbinski 563.0 Johnny Depp 40000.0
2 NaN NaN Orlando Bloom 5000.0
... ... ... ... ... ... ... ... ...
Shanghai Calling PG-13 2012.0 100.0 2 NaN NaN Daniel Henney 719.0
3 NaN NaN Eliza Coupe 490.0
My Date with Drew PG 2004.0 90.0 1 Jon Gunn 16.0 John August 86.0
2 NaN NaN Brian Herzlinger 23.0
3 NaN NaN Jon Gunn 16.0

14748 rows × 4 columns

A more efficient approach than pd.wide_to_long, using pd.stack:

index = ["title", "rating", "year", "duration"]
vid = movies.set_index(index)
vid.columns = vid.columns.str.rsplit("_", n=1, expand=True)
vid.columns.names = [None, "num"]
vid = vid.stack(level="num", future_stack=True)
vid
director director_fb_likes actor actor_fb_likes
title rating year duration num
Avatar PG-13 2009.0 178.0 1 James Cameron 0.0 CCH Pounder 1000.0
2 NaN NaN Joel David Moore 936.0
3 NaN NaN Wes Studi 855.0
Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Gore Verbinski 563.0 Johnny Depp 40000.0
2 NaN NaN Orlando Bloom 5000.0
... ... ... ... ... ... ... ... ...
Shanghai Calling PG-13 2012.0 100.0 2 NaN NaN Daniel Henney 719.0
3 NaN NaN Eliza Coupe 490.0
My Date with Drew PG 2004.0 90.0 1 Jon Gunn 16.0 John August 86.0
2 NaN NaN Brian Herzlinger 23.0
3 NaN NaN Jon Gunn 16.0

14748 rows × 4 columns

out = movies.pivot_longer(
    index=slice("title", "duration"),
    names_to=".value",
    names_pattern="(.+)_\\d",
    sort_by_appearance=True,
)

out
title rating year duration director director_fb_likes actor actor_fb_likes
0 Avatar PG-13 2009.0 178.0 James Cameron 0.0 CCH Pounder 1000.0
1 Avatar PG-13 2009.0 178.0 NaN NaN Joel David Moore 936.0
2 Avatar PG-13 2009.0 178.0 NaN NaN Wes Studi 855.0
3 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 Gore Verbinski 563.0 Johnny Depp 40000.0
4 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 NaN NaN Orlando Bloom 5000.0
... ... ... ... ... ... ... ... ...
14743 Shanghai Calling PG-13 2012.0 100.0 NaN NaN Daniel Henney 719.0
14744 Shanghai Calling PG-13 2012.0 100.0 NaN NaN Eliza Coupe 490.0
14745 My Date with Drew PG 2004.0 90.0 Jon Gunn 16.0 John August 86.0
14746 My Date with Drew PG 2004.0 90.0 NaN NaN Brian Herzlinger 23.0
14747 My Date with Drew PG 2004.0 90.0 NaN NaN Jon Gunn 16.0

14748 rows × 8 columns

wl = wl.droplevel("num").reset_index()
wl.equals(out)
True
vid = (
    vid.droplevel("num")
    .loc[:, ["director", "director_fb_likes", "actor", "actor_fb_likes"]]
    .reset_index()
)

wl.equals(vid)
True
vid.equals(out)
True
%%timeit 
pd.wide_to_long(
    movies, 
    stubnames=['director', 'director_fb_likes', 'actor', 'actor_fb_likes'], 
    i = ['title', 'rating', 'year', 'duration'], 
    j = 'num', 
    sep='_')
51.8 ms ± 293 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
index = ['title', 'rating', 'year', 'duration']
vid = movies.set_index(index)
vid.columns = vid.columns.str.rsplit("_", n = 1, expand = True)
vid.stack(level = 1, future_stack=True)
5.15 ms ± 174 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
(movies
.pivot_longer(
    index = slice('title', 'duration'),
    names_to = ".value",
    names_pattern = "(.+)_\\d",
    sort_by_appearance = True)
)
1.66 ms ± 56.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
movies.pivot_longer(
    index = slice('title', 'duration'),
    names_to = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes'],
    names_pattern = ['director_\\d$', 'director_fb_likes', 'actor_\\d$', 'actor.+'],
    sort_by_appearance = True
)
1.9 ms ± 65.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Your mileage may vary with these speed tests.

Summary#

This blog post shows various ways to reshape data into long form, using pivot_longer, based on some common patterns. Once the pattern is identified, use the appropriate syntax for pivot_longer. It is meant to be simple, straightforward, and efficient. For more control, have a look at the pivot_longer_spec

Comments#