Pivot_longer : Reshape Data in Pandas Efficiently and with Ease from Wide to Long#
Introduction#
You’ve got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows). In pandas, depending on the form of the data, this can be achieved with either pd.stack
, pd.melt
, pd.wide_to_long
, and sometimes require a bit of data massaging before arriving at the final, desired form.
This article suggests a simple, efficient, opinionated way, that resolves most wide to long transformations, within a single function, using the pivot_longer function from pyjanitor.
Let’s look at an example, based on real data - this is a subset of the who data from R:
import pandas as pd
import numpy as np
import sys
print(
" pandas version :",
pd.__version__,
"\n",
"numpy version :",
np.__version__,
"\n",
"python version :",
sys.version,
)
pandas version : 2.2.2
numpy version : 2.0.2
python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
df
id | new_sp_m5564 | newrel_f65 | |
---|---|---|---|
0 | 1 | 2 | 3 |
Looking at the dataframe above, we have multiple variables crammed into the column names. Using the definitions as described here:
- new_/new prefix indicates these are counts of new cases
- sp/rel/ep describe how the case was diagnosed
- m/f gives the gender
- 5564/65 supplies the age range
Our goal here is to separate the dataframe into individual columns of count
, diagnosis
,gender
, age
. How do we pull this off, with the current tools in Pandas?
pd.wide_to_long
is not applicable here, as there is no unique separator. Our best option would be to use a regular expression to pull out the individual columns.
With pd.stack
:
# split the columns using a regex, before stacking:
DF = df.set_index("id")
regex = r"new_?(.+)_(.)(\d+)"
columns = DF.columns.str.extract(regex)
new_names = ["diagnosis", "gender", "age"]
columns = pd.MultiIndex.from_frame(columns, names=new_names)
DF.columns = columns
DF.stack(new_names, future_stack=True).rename("count").reset_index()
id | diagnosis | gender | age | count | |
---|---|---|---|---|---|
0 | 1 | sp | m | 5564 | 2 |
1 | 1 | rel | f | 65 | 3 |
With pd.melt
:
DF = df.melt("id")
out = DF.variable.str.extract(regex).set_axis(new_names, axis=1)
pd.concat([DF.drop(columns="variable"), out], axis=1)
id | value | diagnosis | gender | age | |
---|---|---|---|---|---|
0 | 1 | 2 | sp | m | 5564 |
1 | 1 | 3 | rel | f | 65 |
Note the extra steps we had to take to get to our final form. However, we know the pattern (a regex that matches the columns and the sub groups in the columns that we wish to extract), so why not take advantage of that? This is where pivot_longer from pyjanitor shines:
# pip install pyjanitor
import janitor
df.pivot_longer(index="id", names_to=new_names, names_pattern=regex)
id | diagnosis | gender | age | value | |
---|---|---|---|---|---|
0 | 1 | sp | m | 5564 | 2 |
1 | 1 | rel | f | 65 | 3 |
Straight to the point. If there is a pattern, pivot_longer will figure out what to do and deliver the results. It is a simple, efficient, reshaping abstraction.
Basics#
Let’s walk through some of the parameters in pivot_longer:
index : These are the columns that will not be transformed to long form - the values are repeated as necessary.
column_names: These are the columns that will be transformed to long form.
names_to : These will be the new column names of the reshaped DataFrame.
values_to: These will contain the values collated from the original DataFrame.
names_sep: extract individual columns with a separator.
names_pattern: extract individual columns with a regular expression, or a list of regular expressions.
names_transform: efficiently convert the columns aggregated from the
column_names
from string to another dtype, as long as it is supported bypd.astype
Keep in mind that pivot_longer works on patterns - deciphering what pattern to use is the key to its simplicity.
The rest of the article outlines the various patterns that pivot_longer works with.
Pattern 1 - Melt all Columns#
This is basic and is similar to what you get with pd.melt
.
Let’s use another realistic dataset - relig_income - the original example is from here:
url = "https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv"
religion = pd.read_csv(url)
religion.head()
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k | Don't know/refused | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 | 96 |
1 | Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 | 76 |
2 | Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 | 54 |
3 | Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 | 1489 |
4 | Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 | 116 |
religion.pivot_longer(index="religion", names_to="income", values_to="count")
religion | income | count | |
---|---|---|---|
0 | Agnostic | <$10k | 27 |
1 | Atheist | <$10k | 12 |
2 | Buddhist | <$10k | 27 |
3 | Catholic | <$10k | 418 |
4 | Don’t know/refused | <$10k | 15 |
... | ... | ... | ... |
175 | Orthodox | Don't know/refused | 73 |
176 | Other Christian | Don't know/refused | 18 |
177 | Other Faiths | Don't know/refused | 71 |
178 | Other World Religions | Don't know/refused | 8 |
179 | Unaffiliated | Don't know/refused | 597 |
180 rows × 3 columns
If there are a lot of columns, we can dynamically select the index
, or column_names
, using the janitor.select syntax. Let’s apply it to another dataset related to billboard charts - original example is here:
url = "https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/billboard.csv"
billboard = pd.read_csv(url)
billboard.head()
year | artist | track | time | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 | ... | wk67 | wk68 | wk69 | wk70 | wk71 | wk72 | wk73 | wk74 | wk75 | wk76 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 2 Pac | Baby Don't Cry (Keep... | 4:22 | 2000-02-26 | 87 | 82.0 | 72.0 | 77.0 | 87.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2000 | 2Ge+her | The Hardest Part Of ... | 3:15 | 2000-09-02 | 91 | 87.0 | 92.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2000 | 3 Doors Down | Kryptonite | 3:53 | 2000-04-08 | 81 | 70.0 | 68.0 | 67.0 | 66.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2000 | 3 Doors Down | Loser | 4:24 | 2000-10-21 | 76 | 76.0 | 72.0 | 69.0 | 67.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2000 | 504 Boyz | Wobble Wobble | 3:35 | 2000-04-15 | 57 | 34.0 | 25.0 | 17.0 | 17.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 81 columns
The columns that will be flipped to long form are the columns that start with wk
- these will be passed to the column_names
parameter. That is a lot of columns. Let’s see how pivot_longer simplifies this:
billboard.pivot_longer(column_names="wk*", names_to="week")
year | artist | track | time | date.entered | week | value | |
---|---|---|---|---|---|---|---|
0 | 2000 | 2 Pac | Baby Don't Cry (Keep... | 4:22 | 2000-02-26 | wk1 | 87.0 |
1 | 2000 | 2Ge+her | The Hardest Part Of ... | 3:15 | 2000-09-02 | wk1 | 91.0 |
2 | 2000 | 3 Doors Down | Kryptonite | 3:53 | 2000-04-08 | wk1 | 81.0 |
3 | 2000 | 3 Doors Down | Loser | 4:24 | 2000-10-21 | wk1 | 76.0 |
4 | 2000 | 504 Boyz | Wobble Wobble | 3:35 | 2000-04-15 | wk1 | 57.0 |
... | ... | ... | ... | ... | ... | ... | ... |
24087 | 2000 | Yankee Grey | Another Nine Minutes | 3:10 | 2000-04-29 | wk76 | NaN |
24088 | 2000 | Yearwood, Trisha | Real Live Woman | 3:55 | 2000-04-01 | wk76 | NaN |
24089 | 2000 | Ying Yang Twins | Whistle While You Tw... | 4:19 | 2000-03-18 | wk76 | NaN |
24090 | 2000 | Zombie Nation | Kernkraft 400 | 3:30 | 2000-09-02 | wk76 | NaN |
24091 | 2000 | matchbox twenty | Bent | 4:12 | 2000-04-29 | wk76 | NaN |
24092 rows × 7 columns
If we want to use the index
parameter instead, that’s easy as well:
billboard.pivot_longer(index=slice("year", "date.entered"), names_to="week")
year | artist | track | time | date.entered | week | value | |
---|---|---|---|---|---|---|---|
0 | 2000 | 2 Pac | Baby Don't Cry (Keep... | 4:22 | 2000-02-26 | wk1 | 87.0 |
1 | 2000 | 2Ge+her | The Hardest Part Of ... | 3:15 | 2000-09-02 | wk1 | 91.0 |
2 | 2000 | 3 Doors Down | Kryptonite | 3:53 | 2000-04-08 | wk1 | 81.0 |
3 | 2000 | 3 Doors Down | Loser | 4:24 | 2000-10-21 | wk1 | 76.0 |
4 | 2000 | 504 Boyz | Wobble Wobble | 3:35 | 2000-04-15 | wk1 | 57.0 |
... | ... | ... | ... | ... | ... | ... | ... |
24087 | 2000 | Yankee Grey | Another Nine Minutes | 3:10 | 2000-04-29 | wk76 | NaN |
24088 | 2000 | Yearwood, Trisha | Real Live Woman | 3:55 | 2000-04-01 | wk76 | NaN |
24089 | 2000 | Ying Yang Twins | Whistle While You Tw... | 4:19 | 2000-03-18 | wk76 | NaN |
24090 | 2000 | Zombie Nation | Kernkraft 400 | 3:30 | 2000-09-02 | wk76 | NaN |
24091 | 2000 | matchbox twenty | Bent | 4:12 | 2000-04-29 | wk76 | NaN |
24092 rows × 7 columns
Pattern 2A - Melt into multiple columns - new column names#
There are scenarios where multiple variables are crammed into the column names - we had a look at that already with the who
example.
Let’s look at an example, adapted from data.table vignette:
iris = pd.DataFrame(
{
"Sepal.Length": [5.1, 5.9],
"Sepal.Width": [3.5, 3.0],
"Petal.Length": [1.4, 5.1],
"Petal.Width": [0.2, 1.8],
"Species": ["setosa", "virginica"],
}
)
iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
For the data above, we wish to consolidate the columns into part
and dimensions
- the Sepal/Petal
prefixes go into the part
column, while Width/Length
go into the dimension
column.
In this case, we can use either the names_sep
or the names_pattern
parameter - for this particular scenario, the names_sep
option is a better fit:
iris.pivot_longer(index="Species", names_to=["part", "dimension"], names_sep=r".")
Species | part | dimension | value | |
---|---|---|---|---|
0 | setosa | Sepal | Length | 5.1 |
1 | virginica | Sepal | Length | 5.9 |
2 | setosa | Sepal | Width | 3.5 |
3 | virginica | Sepal | Width | 3.0 |
4 | setosa | Petal | Length | 1.4 |
5 | virginica | Petal | Length | 5.1 |
6 | setosa | Petal | Width | 0.2 |
7 | virginica | Petal | Width | 1.8 |
Using names_pattern:
iris.pivot_longer(
index="Species", names_to=["part", "dimension"], names_pattern=r"(.+)\.(.+)"
)
Species | part | dimension | value | |
---|---|---|---|---|
0 | setosa | Sepal | Length | 5.1 |
1 | virginica | Sepal | Length | 5.9 |
2 | setosa | Sepal | Width | 3.5 |
3 | virginica | Sepal | Width | 3.0 |
4 | setosa | Petal | Length | 1.4 |
5 | virginica | Petal | Length | 5.1 |
6 | setosa | Petal | Width | 0.2 |
7 | virginica | Petal | Width | 1.8 |
Pattern 2B - Melt into multiple columns - Use variables within the column names as new names#
Instead of new column names, what if we wanted to use existing variables within the column names?
Let’s look at an example adapted from data.table vignette:
df = pd.DataFrame(
{
"family": [1, 2, 3, 4, 5],
"dob_child1": [
"1998-11-26",
"1996-06-22",
"2002-07-11",
"2004-10-10",
"2000-12-05",
],
"dob_child2": [
"2000-01-29",
np.nan,
"2004-04-05",
"2009-08-27",
"2005-02-28",
],
"gender_child1": [1, 2, 2, 1, 2],
"gender_child2": [2.0, np.nan, 2.0, 1.0, 1.0],
}
)
df
family | dob_child1 | dob_child2 | gender_child1 | gender_child2 | |
---|---|---|---|---|---|
0 | 1 | 1998-11-26 | 2000-01-29 | 1 | 2.0 |
1 | 2 | 1996-06-22 | NaN | 2 | NaN |
2 | 3 | 2002-07-11 | 2004-04-05 | 2 | 2.0 |
3 | 4 | 2004-10-10 | 2009-08-27 | 1 | 1.0 |
4 | 5 | 2000-12-05 | 2005-02-28 | 2 | 1.0 |
For the data above, we wish to keep dob
and gender
as column names, while moving child1
and child2
to a new column. How does pivot_longer reshape this?
df.pivot_longer(index="family", names_to=(".value", "child"), names_sep="_")
family | child | dob | gender | |
---|---|---|---|---|
0 | 1 | child1 | 1998-11-26 | 1.0 |
1 | 2 | child1 | 1996-06-22 | 2.0 |
2 | 3 | child1 | 2002-07-11 | 2.0 |
3 | 4 | child1 | 2004-10-10 | 1.0 |
4 | 5 | child1 | 2000-12-05 | 2.0 |
5 | 1 | child2 | 2000-01-29 | 2.0 |
6 | 2 | child2 | NaN | NaN |
7 | 3 | child2 | 2004-04-05 | 2.0 |
8 | 4 | child2 | 2009-08-27 | 1.0 |
9 | 5 | child2 | 2005-02-28 | 1.0 |
For patterns such as this, where a part of the column name is to be retained as header in the new dataframe, the .value
placeholder comes in handy. The .value
placeholder tells pivot_longer - hey, this variable in the column name is to be retained as a header
. pivot_longer gets the message and acts accordingly.
Take column label dob_child1
- the first part is dob
and the second part is child1
. This pairs with (.value, child)
, meaning dob
stays as header, and the rest gets lumped into the child
column. The names_sep
value - _
- determines how the column is split into the new columns.
With names_pattern
we can have more control on the output, selecting just the numbers in the child
column:
df.pivot_longer(
index="family", names_to=(".value", "child"), names_pattern=r"(.+)_child(\d)"
)
family | child | dob | gender | |
---|---|---|---|---|
0 | 1 | 1 | 1998-11-26 | 1.0 |
1 | 2 | 1 | 1996-06-22 | 2.0 |
2 | 3 | 1 | 2002-07-11 | 2.0 |
3 | 4 | 1 | 2004-10-10 | 1.0 |
4 | 5 | 1 | 2000-12-05 | 2.0 |
5 | 1 | 2 | 2000-01-29 | 2.0 |
6 | 2 | 2 | NaN | NaN |
7 | 3 | 2 | 2004-04-05 | 2.0 |
8 | 4 | 2 | 2009-08-27 | 1.0 |
9 | 5 | 2 | 2005-02-28 | 1.0 |
We can also efficiently convert the data type of child
to integer, using the names_transform
option:
df.pivot_longer(
index="family",
names_to=(".value", "child"),
names_pattern=r"(.+)_child(\d)",
names_transform=np.int8,
)
family | child | dob | gender | |
---|---|---|---|---|
0 | 1 | 1 | 1998-11-26 | 1.0 |
1 | 2 | 1 | 1996-06-22 | 2.0 |
2 | 3 | 1 | 2002-07-11 | 2.0 |
3 | 4 | 1 | 2004-10-10 | 1.0 |
4 | 5 | 1 | 2000-12-05 | 2.0 |
5 | 1 | 2 | 2000-01-29 | 2.0 |
6 | 2 | 2 | NaN | NaN |
7 | 3 | 2 | 2004-04-05 | 2.0 |
8 | 4 | 2 | 2009-08-27 | 1.0 |
9 | 5 | 2 | 2005-02-28 | 1.0 |
The flexibility of .value
means you can place it anywhere - you are not limited by position - as long as the right argument is passed to the names_sep
or names_pattern
parameter, your output should be fine.
Let’s apply this to the iris
dataset to test the flexibility of .value
:
iris.pivot_longer(index="Species", names_to=("part", ".value"), names_sep=".")
Species | part | Length | Width | |
---|---|---|---|---|
0 | setosa | Sepal | 5.1 | 3.5 |
1 | virginica | Sepal | 5.9 | 3.0 |
2 | setosa | Petal | 1.4 | 0.2 |
3 | virginica | Petal | 5.1 | 1.8 |
You can use multiple .value
if required - let’s apply this to a question from Stack Overflow:
df = pd.DataFrame(
[
{
"ID": 1,
"DateRange1Start": "1/1/90",
"DateRange1End": "3/1/90",
"Value1": 4.4,
"DateRange2Start": "4/5/91",
"DateRange2End": "6/7/91",
"Value2": 6.2,
"DateRange3Start": "5/5/95",
"DateRange3End": "6/6/96",
"Value3": 3.3,
}
]
)
df
ID | DateRange1Start | DateRange1End | Value1 | DateRange2Start | DateRange2End | Value2 | DateRange3Start | DateRange3End | Value3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1/1/90 | 3/1/90 | 4.4 | 4/5/91 | 6/7/91 | 6.2 | 5/5/95 | 6/6/96 | 3.3 |
df.pivot_longer(index="ID", names_to=(".value", ".value"), names_pattern=r"(.+)\d(.*)")
ID | DateRangeStart | DateRangeEnd | Value | |
---|---|---|---|---|
0 | 1 | 1/1/90 | 3/1/90 | 4.4 |
1 | 1 | 4/5/91 | 6/7/91 | 6.2 |
2 | 1 | 5/5/95 | 6/6/96 | 3.3 |
Again, as long as the pattern can be identified, pivot_longer
will take care of the rest.
One more example, adapted from Stack Overflow:
df = pd.DataFrame(
{
"Sony | TV | Model | value": {0: "A222", 1: "A234", 2: "A4345"},
"Sony | TV | Quantity | value": {0: 5, 1: 5, 2: 4},
"Sony | TV | Max-quant | value": {0: 10, 1: 9, 2: 9},
"Panasonic | TV | Model | value": {0: "T232", 1: "S3424", 2: "X3421"},
"Panasonic | TV | Quantity | value": {0: 1, 1: 5, 2: 1},
"Panasonic | TV | Max-quant | value": {0: 10, 1: 12, 2: 11},
"Sanyo | Radio | Model | value": {0: "S111", 1: "S1s1", 2: "S1s2"},
"Sanyo | Radio | Quantity | value": {0: 4, 1: 2, 2: 4},
"Sanyo | Radio | Max-quant | value": {0: 9, 1: 9, 2: 10},
}
)
df
Sony | TV | Model | value | Sony | TV | Quantity | value | Sony | TV | Max-quant | value | Panasonic | TV | Model | value | Panasonic | TV | Quantity | value | Panasonic | TV | Max-quant | value | Sanyo | Radio | Model | value | Sanyo | Radio | Quantity | value | Sanyo | Radio | Max-quant | value | |
---|---|---|---|---|---|---|---|---|---|
0 | A222 | 5 | 10 | T232 | 1 | 10 | S111 | 4 | 9 |
1 | A234 | 5 | 9 | S3424 | 5 | 12 | S1s1 | 2 | 9 |
2 | A4345 | 4 | 9 | X3421 | 1 | 11 | S1s2 | 4 | 10 |
df.pivot_longer(
names_to=("Manufacturer", "Device", ".value"),
names_pattern=r"(.+)\s\|(.+)\s\|(.+)\s\|.*",
)
Manufacturer | Device | Model | Quantity | Max-quant | |
---|---|---|---|---|---|
0 | Sony | TV | A222 | 5 | 10 |
1 | Sony | TV | A234 | 5 | 9 |
2 | Sony | TV | A4345 | 4 | 9 |
3 | Panasonic | TV | T232 | 1 | 10 |
4 | Panasonic | TV | S3424 | 5 | 12 |
5 | Panasonic | TV | X3421 | 1 | 11 |
6 | Sanyo | Radio | S111 | 4 | 9 |
7 | Sanyo | Radio | S1s1 | 2 | 9 |
8 | Sanyo | Radio | S1s2 | 4 | 10 |
Pattern 3A - Group similar columns together#
This is best explained with an example:
df_mean = pd.DataFrame(
{
"x_1_mean": [10],
"x_2_mean": [20],
"y_1_mean": [30],
"y_2_mean": [40],
"unit": [50],
}
)
df_mean
x_1_mean | x_2_mean | y_1_mean | y_2_mean | unit | |
---|---|---|---|---|---|
0 | 10 | 20 | 30 | 40 | 50 |
For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the x*
columns into one, same goes for the y*
columns.
For this pattern, we pass a list of names to names_to
, while passing a list of regular expressions to names_pattern
:
df_mean.pivot_longer(
index="unit", names_to=["x_mean", "y_mean"], names_pattern=["x", "y"]
)
unit | x_mean | y_mean | |
---|---|---|---|
0 | 50 | 10 | 30 |
1 | 50 | 20 | 40 |
Using a list of regular expressions for names_pattern
implies that you want to group similar items under specified columns in names_to
.
Basically, it is a pairing. the first x_1_mean
pairs with the first y*
mean which it encounters, which is y_1_mean
, same goes for the next x*
, which is x_2_mean
, it pairs with y_2_mean
, since that is the next available y*
. The pairing is on a first come first serve basis.
Let’s look at another example:
url = "https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie_altered.csv"
movies = pd.read_csv(url)
movies.head()
title | rating | year | duration | director_1 | director_fb_likes_1 | actor_1 | actor_2 | actor_3 | actor_fb_likes_1 | actor_fb_likes_2 | actor_fb_likes_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Avatar | PG-13 | 2009.0 | 178.0 | James Cameron | 0.0 | CCH Pounder | Joel David Moore | Wes Studi | 1000.0 | 936.0 | 855.0 |
1 | Pirates of the Caribbean: At World's End | PG-13 | 2007.0 | 169.0 | Gore Verbinski | 563.0 | Johnny Depp | Orlando Bloom | Jack Davenport | 40000.0 | 5000.0 | 1000.0 |
2 | Spectre | PG-13 | 2015.0 | 148.0 | Sam Mendes | 0.0 | Christoph Waltz | Rory Kinnear | Stephanie Sigman | 11000.0 | 393.0 | 161.0 |
3 | The Dark Knight Rises | PG-13 | 2012.0 | 164.0 | Christopher Nolan | 22000.0 | Tom Hardy | Christian Bale | Joseph Gordon-Levitt | 27000.0 | 23000.0 | 23000.0 |
4 | Star Wars: Episode VII - The Force Awakens | NaN | NaN | NaN | Doug Walker | 131.0 | Doug Walker | Rob Walker | NaN | 131.0 | 12.0 | NaN |
The goal for the dataframe above is to group the actors into another column, and the facebook likes into a separate column. The pattern here is obvious - we are grouping similar columns, so we pass a list of regular expressions to names_pattern
, with an equal number of names to names_to
:
movies.pivot_longer(
index=slice("title", "duration"),
names_to=["director", "director_fb_likes", "actor", "actor_fb_likes"],
names_pattern=["director_\\d$", "director_fb_likes", "actor_\\d$", "actor.+"],
)
title | rating | year | duration | director | director_fb_likes | actor | actor_fb_likes | |
---|---|---|---|---|---|---|---|---|
0 | Avatar | PG-13 | 2009.0 | 178.0 | James Cameron | 0.0 | CCH Pounder | 1000.0 |
1 | Pirates of the Caribbean: At World's End | PG-13 | 2007.0 | 169.0 | Gore Verbinski | 563.0 | Johnny Depp | 40000.0 |
2 | Spectre | PG-13 | 2015.0 | 148.0 | Sam Mendes | 0.0 | Christoph Waltz | 11000.0 |
3 | The Dark Knight Rises | PG-13 | 2012.0 | 164.0 | Christopher Nolan | 22000.0 | Tom Hardy | 27000.0 |
4 | Star Wars: Episode VII - The Force Awakens | NaN | NaN | NaN | Doug Walker | 131.0 | Doug Walker | 131.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
14743 | Signed Sealed Delivered | NaN | 2013.0 | 87.0 | NaN | NaN | Crystal Lowe | 319.0 |
14744 | The Following | TV-14 | NaN | 43.0 | NaN | NaN | Sam Underwood | 319.0 |
14745 | A Plague So Pleasant | NaN | 2013.0 | 76.0 | NaN | NaN | David Chandler | 0.0 |
14746 | Shanghai Calling | PG-13 | 2012.0 | 100.0 | NaN | NaN | Eliza Coupe | 490.0 |
14747 | My Date with Drew | PG | 2004.0 | 90.0 | NaN | NaN | Jon Gunn | 16.0 |
14748 rows × 8 columns
Based on the same first come first serve approach, director_1
is paired with director_fb_likes_1
, actor_1
is paired with actor_fb_likes_1
; the pairing continues for the rest.
This could also have been solved using the .value
approach:
(
movies.pivot_longer(
index=slice("title", "duration"), names_to=".value", names_pattern="(.+)_\\d"
)
)
title | rating | year | duration | director | director_fb_likes | actor | actor_fb_likes | |
---|---|---|---|---|---|---|---|---|
0 | Avatar | PG-13 | 2009.0 | 178.0 | James Cameron | 0.0 | CCH Pounder | 1000.0 |
1 | Pirates of the Caribbean: At World's End | PG-13 | 2007.0 | 169.0 | Gore Verbinski | 563.0 | Johnny Depp | 40000.0 |
2 | Spectre | PG-13 | 2015.0 | 148.0 | Sam Mendes | 0.0 | Christoph Waltz | 11000.0 |
3 | The Dark Knight Rises | PG-13 | 2012.0 | 164.0 | Christopher Nolan | 22000.0 | Tom Hardy | 27000.0 |
4 | Star Wars: Episode VII - The Force Awakens | NaN | NaN | NaN | Doug Walker | 131.0 | Doug Walker | 131.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
14743 | Signed Sealed Delivered | NaN | 2013.0 | 87.0 | NaN | NaN | Crystal Lowe | 319.0 |
14744 | The Following | TV-14 | NaN | 43.0 | NaN | NaN | Sam Underwood | 319.0 |
14745 | A Plague So Pleasant | NaN | 2013.0 | 76.0 | NaN | NaN | David Chandler | 0.0 |
14746 | Shanghai Calling | PG-13 | 2012.0 | 100.0 | NaN | NaN | Eliza Coupe | 490.0 |
14747 | My Date with Drew | PG | 2004.0 | 90.0 | NaN | NaN | Jon Gunn | 16.0 |
14748 rows × 8 columns
Again, it is all about identifying the pattern, and using whichever option is more suitable/convenient. One more example:
from pandas import NA
treatments = dict(
id=range(1, 6),
A=("A", NA, "A", NA, NA),
A_date=(1, NA, 2, NA, NA),
B=(NA, "B", "B", NA, NA),
B_date=(NA, 3, 2, NA, NA),
other=(NA, NA, NA, "C", "D"),
other_date=(NA, NA, NA, 1, 5),
)
treatments = pd.DataFrame(treatments)
treatments
id | A | A_date | B | B_date | other | other_date | |
---|---|---|---|---|---|---|---|
0 | 1 | A | 1 | <NA> | <NA> | <NA> | <NA> |
1 | 2 | <NA> | <NA> | B | 3 | <NA> | <NA> |
2 | 3 | A | 2 | B | 2 | <NA> | <NA> |
3 | 4 | <NA> | <NA> | <NA> | <NA> | C | 1 |
4 | 5 | <NA> | <NA> | <NA> | <NA> | D | 5 |
The goal is to transform the dataframe into three columns, an id
column, a treatment
column, and a date
column. Since we are grouping similar columns, we pass a list of regular expressions to names_pattern
:
treatments.pivot_longer(
index="id",
names_to=["treatment", "date"],
names_pattern=["A$|B$|other$", ".+date$"],
dropna=True,
)
id | treatment | date | |
---|---|---|---|
0 | 1 | A | 1 |
1 | 3 | A | 2 |
2 | 2 | B | 3 |
3 | 3 | B | 2 |
4 | 4 | C | 1 |
5 | 5 | D | 5 |
Another way is to access the dates before the treatments, by reversing the order of the regular expressions:
treatments.pivot_longer(
index="id",
names_to=["date", "treatment"],
names_pattern=[".+date$", ".+"],
dropna=True,
)
id | treatment | date | |
---|---|---|---|
0 | 1 | A | 1 |
1 | 3 | A | 2 |
2 | 2 | B | 3 |
3 | 3 | B | 2 |
4 | 4 | C | 1 |
5 | 5 | D | 5 |
Pattern 3B - Group similar columns and values together#
This is best illustrated with an example:
df = pd.DataFrame(
{
"City": ["Houston", "Austin", "Hoover"],
"State": ["Texas", "Texas", "Alabama"],
"Name": ["Aria", "Penelope", "Niko"],
"Mango": [4, 10, 90],
"Orange": [10, 8, 14],
"Watermelon": [40, 99, 43],
"Gin": [16, 200, 34],
"Vodka": [20, 33, 18],
},
columns=["City", "State", "Name", "Mango", "Orange", "Watermelon", "Gin", "Vodka"],
)
df
City | State | Name | Mango | Orange | Watermelon | Gin | Vodka | |
---|---|---|---|---|---|---|---|---|
0 | Houston | Texas | Aria | 4 | 10 | 40 | 16 | 20 |
1 | Austin | Texas | Penelope | 10 | 8 | 99 | 200 | 33 |
2 | Hoover | Alabama | Niko | 90 | 14 | 43 | 34 | 18 |
The goal in this case is to group similar columns and values together - group the drinks into one column, group the fruits into another column. When that is done, group the values for the drinks into one column, and the values for the fruits into another column. That’s the pattern, and for this pattern we continue with the idea from earlier, by passing a list of regular expressions to names_pattern
, an equal number of names to names_to
, and an equal number of names to values_to
:
df.pivot_longer(
index=["City", "State"],
column_names=slice("Mango", "Vodka"),
names_to=("Fruit", "Drink"),
values_to=("Pounds", "Ounces"),
names_pattern=[r"M|O|W", r"G|V"],
)
City | State | Fruit | Drink | Pounds | Ounces | |
---|---|---|---|---|---|---|
0 | Houston | Texas | Mango | Gin | 4 | 16.0 |
1 | Austin | Texas | Mango | Gin | 10 | 200.0 |
2 | Hoover | Alabama | Mango | Gin | 90 | 34.0 |
3 | Houston | Texas | Orange | Vodka | 10 | 20.0 |
4 | Austin | Texas | Orange | Vodka | 8 | 33.0 |
5 | Hoover | Alabama | Orange | Vodka | 14 | 18.0 |
6 | Houston | Texas | Watermelon | None | 40 | NaN |
7 | Austin | Texas | Watermelon | None | 99 | NaN |
8 | Hoover | Alabama | Watermelon | None | 43 | NaN |
Simple, straightforward, to the point, and efficient. Let’s look at another example:
df = {
"State": ["CA", "CA", "FL", "FL"],
"ItemN": [1, 2, 3, 4],
"x1": [6, 7, 3, 9],
"x2": [4, 3, 2, 4],
"x3": [3, 1, 1, 2],
"y1": [7, 15, 5, 16],
"y2": [5, 10, 3, 14],
"y3": [3, 5, 2, 12],
"z1": [11, 4, 13, 14],
"z2": [5, 2, 7, 5],
"z3": [1, 1, 2, 4],
}
df = pd.DataFrame(df)
df
State | ItemN | x1 | x2 | x3 | y1 | y2 | y3 | z1 | z2 | z3 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | CA | 1 | 6 | 4 | 3 | 7 | 5 | 3 | 11 | 5 | 1 |
1 | CA | 2 | 7 | 3 | 1 | 15 | 10 | 5 | 4 | 2 | 1 |
2 | FL | 3 | 3 | 2 | 1 | 5 | 3 | 2 | 13 | 7 | 2 |
3 | FL | 4 | 9 | 4 | 2 | 16 | 14 | 12 | 14 | 5 | 4 |
The goal is to group all the x
columns into one, the y
columns into one column, and group the associated values into individual columns as well. We are grouping similar columns and values - that’s the pattern - again we pass a list of regular expressions to names_pattern
, a list of names to names_to
, and a list of names to values_to
:
df.pivot_longer(
index=["State", "ItemN"],
names_to=["x", "y", "z"],
values_to=["xvalue", "yvalue", "zvalue"],
names_pattern=["x", "y", "z"],
)
State | ItemN | x | y | z | xvalue | yvalue | zvalue | |
---|---|---|---|---|---|---|---|---|
0 | CA | 1 | x1 | y1 | z1 | 6 | 7 | 11 |
1 | CA | 2 | x1 | y1 | z1 | 7 | 15 | 4 |
2 | FL | 3 | x1 | y1 | z1 | 3 | 5 | 13 |
3 | FL | 4 | x1 | y1 | z1 | 9 | 16 | 14 |
4 | CA | 1 | x2 | y2 | z2 | 4 | 5 | 5 |
5 | CA | 2 | x2 | y2 | z2 | 3 | 10 | 2 |
6 | FL | 3 | x2 | y2 | z2 | 2 | 3 | 7 |
7 | FL | 4 | x2 | y2 | z2 | 4 | 14 | 5 |
8 | CA | 1 | x3 | y3 | z3 | 3 | 3 | 1 |
9 | CA | 2 | x3 | y3 | z3 | 1 | 5 | 1 |
10 | FL | 3 | x3 | y3 | z3 | 1 | 2 | 2 |
11 | FL | 4 | x3 | y3 | z3 | 2 | 12 | 4 |
Performance#
pivot_longer is not just simple, convenient, and straightforward, it is efficient as well:
%timeit religion.pivot_longer(index = 'religion', names_to = 'income', values_to = 'count')
488 μs ± 18 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit religion.melt(id_vars = 'religion', var_name = 'income', value_name = 'count')
573 μs ± 3.03 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Sample data, using 1000 columns (you can personally increase the number of columns to 1e6 or 1e8, depending on the power of your CPU):
sparse_wide = pd.read_csv("Data_files/sparse-wide.csv")
sparse_wide.head()
id | grp | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | ... | x991 | x992 | x993 | x994 | x995 | x996 | x997 | x998 | x999 | x1000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1000 | A | I | A | M | O | E | F | D | ... | C | L | G | I | R | X | O | W | L | L |
1 | 2 | 528 | I | X | U | Q | B | X | I | J | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | 151 | J | Q | H | F | B | X | B | F | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4 | 343 | P | A | K | K | N | R | E | E | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5 | 460 | L | D | G | X | I | O | C | W | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 1002 columns
%timeit sparse_wide.melt(id_vars = ['id', 'grp'])
50 ms ± 313 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sparse_wide.pivot_longer(index = ['id', 'grp'])
15.4 ms ± 65 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Let’s test on the movies
dataset:
wl = pd.wide_to_long(
movies,
stubnames=["director", "director_fb_likes", "actor", "actor_fb_likes"],
i=["title", "rating", "year", "duration"],
j="num",
sep="_",
)
wl
director | director_fb_likes | actor | actor_fb_likes | |||||
---|---|---|---|---|---|---|---|---|
title | rating | year | duration | num | ||||
Avatar | PG-13 | 2009.0 | 178.0 | 1 | James Cameron | 0.0 | CCH Pounder | 1000.0 |
2 | NaN | NaN | Joel David Moore | 936.0 | ||||
3 | NaN | NaN | Wes Studi | 855.0 | ||||
Pirates of the Caribbean: At World's End | PG-13 | 2007.0 | 169.0 | 1 | Gore Verbinski | 563.0 | Johnny Depp | 40000.0 |
2 | NaN | NaN | Orlando Bloom | 5000.0 | ||||
... | ... | ... | ... | ... | ... | ... | ... | ... |
Shanghai Calling | PG-13 | 2012.0 | 100.0 | 2 | NaN | NaN | Daniel Henney | 719.0 |
3 | NaN | NaN | Eliza Coupe | 490.0 | ||||
My Date with Drew | PG | 2004.0 | 90.0 | 1 | Jon Gunn | 16.0 | John August | 86.0 |
2 | NaN | NaN | Brian Herzlinger | 23.0 | ||||
3 | NaN | NaN | Jon Gunn | 16.0 |
14748 rows × 4 columns
A more efficient approach than pd.wide_to_long
, using pd.stack
:
index = ["title", "rating", "year", "duration"]
vid = movies.set_index(index)
vid.columns = vid.columns.str.rsplit("_", n=1, expand=True)
vid.columns.names = [None, "num"]
vid = vid.stack(level="num", future_stack=True)
vid
director | director_fb_likes | actor | actor_fb_likes | |||||
---|---|---|---|---|---|---|---|---|
title | rating | year | duration | num | ||||
Avatar | PG-13 | 2009.0 | 178.0 | 1 | James Cameron | 0.0 | CCH Pounder | 1000.0 |
2 | NaN | NaN | Joel David Moore | 936.0 | ||||
3 | NaN | NaN | Wes Studi | 855.0 | ||||
Pirates of the Caribbean: At World's End | PG-13 | 2007.0 | 169.0 | 1 | Gore Verbinski | 563.0 | Johnny Depp | 40000.0 |
2 | NaN | NaN | Orlando Bloom | 5000.0 | ||||
... | ... | ... | ... | ... | ... | ... | ... | ... |
Shanghai Calling | PG-13 | 2012.0 | 100.0 | 2 | NaN | NaN | Daniel Henney | 719.0 |
3 | NaN | NaN | Eliza Coupe | 490.0 | ||||
My Date with Drew | PG | 2004.0 | 90.0 | 1 | Jon Gunn | 16.0 | John August | 86.0 |
2 | NaN | NaN | Brian Herzlinger | 23.0 | ||||
3 | NaN | NaN | Jon Gunn | 16.0 |
14748 rows × 4 columns
out = movies.pivot_longer(
index=slice("title", "duration"),
names_to=".value",
names_pattern="(.+)_\\d",
sort_by_appearance=True,
)
out
title | rating | year | duration | director | director_fb_likes | actor | actor_fb_likes | |
---|---|---|---|---|---|---|---|---|
0 | Avatar | PG-13 | 2009.0 | 178.0 | James Cameron | 0.0 | CCH Pounder | 1000.0 |
1 | Avatar | PG-13 | 2009.0 | 178.0 | NaN | NaN | Joel David Moore | 936.0 |
2 | Avatar | PG-13 | 2009.0 | 178.0 | NaN | NaN | Wes Studi | 855.0 |
3 | Pirates of the Caribbean: At World's End | PG-13 | 2007.0 | 169.0 | Gore Verbinski | 563.0 | Johnny Depp | 40000.0 |
4 | Pirates of the Caribbean: At World's End | PG-13 | 2007.0 | 169.0 | NaN | NaN | Orlando Bloom | 5000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
14743 | Shanghai Calling | PG-13 | 2012.0 | 100.0 | NaN | NaN | Daniel Henney | 719.0 |
14744 | Shanghai Calling | PG-13 | 2012.0 | 100.0 | NaN | NaN | Eliza Coupe | 490.0 |
14745 | My Date with Drew | PG | 2004.0 | 90.0 | Jon Gunn | 16.0 | John August | 86.0 |
14746 | My Date with Drew | PG | 2004.0 | 90.0 | NaN | NaN | Brian Herzlinger | 23.0 |
14747 | My Date with Drew | PG | 2004.0 | 90.0 | NaN | NaN | Jon Gunn | 16.0 |
14748 rows × 8 columns
wl = wl.droplevel("num").reset_index()
wl.equals(out)
True
vid = (
vid.droplevel("num")
.loc[:, ["director", "director_fb_likes", "actor", "actor_fb_likes"]]
.reset_index()
)
wl.equals(vid)
True
vid.equals(out)
True
%%timeit
pd.wide_to_long(
movies,
stubnames=['director', 'director_fb_likes', 'actor', 'actor_fb_likes'],
i = ['title', 'rating', 'year', 'duration'],
j = 'num',
sep='_')
51.8 ms ± 293 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
index = ['title', 'rating', 'year', 'duration']
vid = movies.set_index(index)
vid.columns = vid.columns.str.rsplit("_", n = 1, expand = True)
vid.stack(level = 1, future_stack=True)
5.15 ms ± 174 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
(movies
.pivot_longer(
index = slice('title', 'duration'),
names_to = ".value",
names_pattern = "(.+)_\\d",
sort_by_appearance = True)
)
1.66 ms ± 56.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
movies.pivot_longer(
index = slice('title', 'duration'),
names_to = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes'],
names_pattern = ['director_\\d$', 'director_fb_likes', 'actor_\\d$', 'actor.+'],
sort_by_appearance = True
)
1.9 ms ± 65.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Your mileage may vary with these speed tests.
Summary#
This blog post shows various ways to reshape data into long form, using pivot_longer
, based on some common patterns. Once the pattern is identified, use the appropriate syntax for pivot_longer
. It is meant to be simple, straightforward, and efficient. For more control, have a look at the pivot_longer_spec
Comments#