{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dplyr's across: Replicating within Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[dplyr](https://dplyr.tidyverse.org/index.html) has the [across](https://dplyr.tidyverse.org/reference/across.html) function, which is meant to make column wise processing easy. Pandas has always supported column-wise operations, and in a relatively simple manner. \n",
"\n",
"This article aims to replicate solutions in the dplyr [column-wise operations](https://dplyr.tidyverse.org/articles/colwise.html) vignette with Pandas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load in the relevant libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" pandas version : 2.2.2 \n",
" numpy version : 2.0.0 \n",
" python version : 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ]\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import sys\n",
"\n",
"print(\" pandas version :\", pd.__version__,\"\\n\", \n",
" \"numpy version :\", np.__version__, \"\\n\", \n",
" \"python version :\", sys.version)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" height | \n",
" mass | \n",
" hair_color | \n",
" skin_color | \n",
" eye_color | \n",
" birth_year | \n",
" sex | \n",
" gender | \n",
" homeworld | \n",
" species | \n",
" films | \n",
" vehicles | \n",
" starships | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Luke Skywalker | \n",
" 172.0 | \n",
" 77.0 | \n",
" blond | \n",
" fair | \n",
" blue | \n",
" 19.0 | \n",
" male | \n",
" masculine | \n",
" Tatooine | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" Snowspeeder, Imperial Speeder Bike | \n",
" X-wing, Imperial shuttle | \n",
"
\n",
" \n",
" 1 | \n",
" C-3PO | \n",
" 167.0 | \n",
" 75.0 | \n",
" NaN | \n",
" gold | \n",
" yellow | \n",
" 112.0 | \n",
" none | \n",
" masculine | \n",
" Tatooine | \n",
" Droid | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" R2-D2 | \n",
" 96.0 | \n",
" 32.0 | \n",
" NaN | \n",
" white, blue | \n",
" red | \n",
" 33.0 | \n",
" none | \n",
" masculine | \n",
" Naboo | \n",
" Droid | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" Darth Vader | \n",
" 202.0 | \n",
" 136.0 | \n",
" none | \n",
" white | \n",
" yellow | \n",
" 41.9 | \n",
" male | \n",
" masculine | \n",
" Tatooine | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" NaN | \n",
" TIE Advanced x1 | \n",
"
\n",
" \n",
" 4 | \n",
" Leia Organa | \n",
" 150.0 | \n",
" 49.0 | \n",
" brown | \n",
" light | \n",
" brown | \n",
" 19.0 | \n",
" female | \n",
" feminine | \n",
" Alderaan | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" Imperial Speeder Bike | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name height mass hair_color skin_color eye_color \\\n",
"0 Luke Skywalker 172.0 77.0 blond fair blue \n",
"1 C-3PO 167.0 75.0 NaN gold yellow \n",
"2 R2-D2 96.0 32.0 NaN white, blue red \n",
"3 Darth Vader 202.0 136.0 none white yellow \n",
"4 Leia Organa 150.0 49.0 brown light brown \n",
"\n",
" birth_year sex gender homeworld species \\\n",
"0 19.0 male masculine Tatooine Human \n",
"1 112.0 none masculine Tatooine Droid \n",
"2 33.0 none masculine Naboo Droid \n",
"3 41.9 male masculine Tatooine Human \n",
"4 19.0 female feminine Alderaan Human \n",
"\n",
" films \\\n",
"0 A New Hope, The Empire Strikes Back, Return of... \n",
"1 A New Hope, The Empire Strikes Back, Return of... \n",
"2 A New Hope, The Empire Strikes Back, Return of... \n",
"3 A New Hope, The Empire Strikes Back, Return of... \n",
"4 A New Hope, The Empire Strikes Back, Return of... \n",
"\n",
" vehicles starships \n",
"0 Snowspeeder, Imperial Speeder Bike X-wing, Imperial shuttle \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN TIE Advanced x1 \n",
"4 Imperial Speeder Bike NaN "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\n",
"starwars = pd.read_csv(url)\n",
"starwars.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name 87\n",
"hair_color 12\n",
"skin_color 31\n",
"eye_color 15\n",
"sex 5\n",
"gender 3\n",
"homeworld 49\n",
"species 38\n",
"films 24\n",
"vehicles 11\n",
"starships 16\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# dplyr : \n",
"# starwars %>% \n",
"# summarise(across(where(is.character), n_distinct))\n",
"(starwars\n",
".select_dtypes('object')\n",
".nunique(dropna=False))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# dplyr :\n",
"# starwars %>% \n",
"# group_by(species) %>% \n",
"# filter(n() > 1) %>% \n",
"# summarise(across(c(sex, gender, homeworld), n_distinct))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sex | \n",
" gender | \n",
" homeworld | \n",
"
\n",
" \n",
" species | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Droid | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Gungan | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Human | \n",
" 2 | \n",
" 2 | \n",
" 15 | \n",
"
\n",
" \n",
" Kaminoan | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" Mirialan | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Twi'lek | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" Wookiee | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Zabrak | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sex gender homeworld\n",
"species \n",
"Droid 1 2 3\n",
"Gungan 1 1 1\n",
"Human 2 2 15\n",
"Kaminoan 2 2 1\n",
"Mirialan 1 1 1\n",
"Twi'lek 2 2 1\n",
"Wookiee 1 1 1\n",
"Zabrak 1 1 2"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grp = starwars.groupby('species')\n",
"\n",
"(grp[['sex', 'gender', 'homeworld']]\n",
" .nunique(dropna=False)\n",
" .loc[grp.size().gt(1)]\n",
" )"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Another approach is with the [pipe](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.pipe.html) function:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sex | \n",
" gender | \n",
" homeworld | \n",
"
\n",
" \n",
" species | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Droid | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Gungan | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Human | \n",
" 2 | \n",
" 2 | \n",
" 15 | \n",
"
\n",
" \n",
" Kaminoan | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" Mirialan | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Twi'lek | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" Wookiee | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Zabrak | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sex gender homeworld\n",
"species \n",
"Droid 1 2 3\n",
"Gungan 1 1 1\n",
"Human 2 2 15\n",
"Kaminoan 2 2 1\n",
"Mirialan 1 1 1\n",
"Twi'lek 2 2 1\n",
"Wookiee 1 1 1\n",
"Zabrak 1 1 2"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(starwars\n",
".filter(['species', 'sex', 'gender', 'homeworld'])\n",
".groupby('species')\n",
".pipe(lambda df: df.nunique(dropna = False).loc[df.size().gt(1)])\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# dplyr : \n",
"# starwars %>% \n",
"# group_by(homeworld) %>% \n",
"# filter(n() > 1) %>% \n",
"# summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" height | \n",
" mass | \n",
" birth_year | \n",
"
\n",
" \n",
" homeworld | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Alderaan | \n",
" 176.333333 | \n",
" 64.000000 | \n",
" 43.000000 | \n",
"
\n",
" \n",
" Corellia | \n",
" 175.000000 | \n",
" 78.500000 | \n",
" 25.000000 | \n",
"
\n",
" \n",
" Coruscant | \n",
" 173.666667 | \n",
" 50.000000 | \n",
" 91.000000 | \n",
"
\n",
" \n",
" Kamino | \n",
" 208.333333 | \n",
" 83.100000 | \n",
" 31.500000 | \n",
"
\n",
" \n",
" Kashyyyk | \n",
" 231.000000 | \n",
" 124.000000 | \n",
" 200.000000 | \n",
"
\n",
" \n",
" Mirial | \n",
" 168.000000 | \n",
" 53.100000 | \n",
" 49.000000 | \n",
"
\n",
" \n",
" Naboo | \n",
" 177.272727 | \n",
" 64.166667 | \n",
" 55.000000 | \n",
"
\n",
" \n",
" Ryloth | \n",
" 179.000000 | \n",
" 55.000000 | \n",
" 48.000000 | \n",
"
\n",
" \n",
" Tatooine | \n",
" 169.800000 | \n",
" 85.375000 | \n",
" 54.644444 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" height mass birth_year\n",
"homeworld \n",
"Alderaan 176.333333 64.000000 43.000000\n",
"Corellia 175.000000 78.500000 25.000000\n",
"Coruscant 173.666667 50.000000 91.000000\n",
"Kamino 208.333333 83.100000 31.500000\n",
"Kashyyyk 231.000000 124.000000 200.000000\n",
"Mirial 168.000000 53.100000 49.000000\n",
"Naboo 177.272727 64.166667 55.000000\n",
"Ryloth 179.000000 55.000000 48.000000\n",
"Tatooine 169.800000 85.375000 54.644444"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get the grouper\n",
"grp = starwars.set_index('homeworld').select_dtypes('number').groupby('homeworld')\n",
"# get the size:\n",
"n = grp.size()\n",
"# filter averages with the size:\n",
"grp.mean().loc[n.gt(1)]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"With the [pipe](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.pipe.html) function:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" height | \n",
" mass | \n",
" birth_year | \n",
"
\n",
" \n",
" homeworld | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Alderaan | \n",
" 176.333333 | \n",
" 64.000000 | \n",
" 43.000000 | \n",
"
\n",
" \n",
" Corellia | \n",
" 175.000000 | \n",
" 78.500000 | \n",
" 25.000000 | \n",
"
\n",
" \n",
" Coruscant | \n",
" 173.666667 | \n",
" 50.000000 | \n",
" 91.000000 | \n",
"
\n",
" \n",
" Kamino | \n",
" 208.333333 | \n",
" 83.100000 | \n",
" 31.500000 | \n",
"
\n",
" \n",
" Kashyyyk | \n",
" 231.000000 | \n",
" 124.000000 | \n",
" 200.000000 | \n",
"
\n",
" \n",
" Mirial | \n",
" 168.000000 | \n",
" 53.100000 | \n",
" 49.000000 | \n",
"
\n",
" \n",
" Naboo | \n",
" 177.272727 | \n",
" 64.166667 | \n",
" 55.000000 | \n",
"
\n",
" \n",
" Ryloth | \n",
" 179.000000 | \n",
" 55.000000 | \n",
" 48.000000 | \n",
"
\n",
" \n",
" Tatooine | \n",
" 169.800000 | \n",
" 85.375000 | \n",
" 54.644444 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" height mass birth_year\n",
"homeworld \n",
"Alderaan 176.333333 64.000000 43.000000\n",
"Corellia 175.000000 78.500000 25.000000\n",
"Coruscant 173.666667 50.000000 91.000000\n",
"Kamino 208.333333 83.100000 31.500000\n",
"Kashyyyk 231.000000 124.000000 200.000000\n",
"Mirial 168.000000 53.100000 49.000000\n",
"Naboo 177.272727 64.166667 55.000000\n",
"Ryloth 179.000000 55.000000 48.000000\n",
"Tatooine 169.800000 85.375000 54.644444"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(starwars\n",
".set_index('homeworld')\n",
".select_dtypes('number')\n",
".groupby('homeworld')\n",
".pipe(lambda df: df.mean().loc[df.size().gt(1)])\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An alternate approach, where the groupby is reused, is with [pyjanitor's](https://pyjanitor-devs.github.io/pyjanitor/) [get_columns](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.utils.get_columns) function:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" height | \n",
" mass | \n",
" birth_year | \n",
"
\n",
" \n",
" homeworld | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Alderaan | \n",
" 176.333333 | \n",
" 64.000000 | \n",
" 43.000000 | \n",
"
\n",
" \n",
" Corellia | \n",
" 175.000000 | \n",
" 78.500000 | \n",
" 25.000000 | \n",
"
\n",
" \n",
" Coruscant | \n",
" 173.666667 | \n",
" 50.000000 | \n",
" 91.000000 | \n",
"
\n",
" \n",
" Kamino | \n",
" 208.333333 | \n",
" 83.100000 | \n",
" 31.500000 | \n",
"
\n",
" \n",
" Kashyyyk | \n",
" 231.000000 | \n",
" 124.000000 | \n",
" 200.000000 | \n",
"
\n",
" \n",
" Mirial | \n",
" 168.000000 | \n",
" 53.100000 | \n",
" 49.000000 | \n",
"
\n",
" \n",
" Naboo | \n",
" 177.272727 | \n",
" 64.166667 | \n",
" 55.000000 | \n",
"
\n",
" \n",
" Ryloth | \n",
" 179.000000 | \n",
" 55.000000 | \n",
" 48.000000 | \n",
"
\n",
" \n",
" Tatooine | \n",
" 169.800000 | \n",
" 85.375000 | \n",
" 54.644444 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" height mass birth_year\n",
"homeworld \n",
"Alderaan 176.333333 64.000000 43.000000\n",
"Corellia 175.000000 78.500000 25.000000\n",
"Coruscant 173.666667 50.000000 91.000000\n",
"Kamino 208.333333 83.100000 31.500000\n",
"Kashyyyk 231.000000 124.000000 200.000000\n",
"Mirial 168.000000 53.100000 49.000000\n",
"Naboo 177.272727 64.166667 55.000000\n",
"Ryloth 179.000000 55.000000 48.000000\n",
"Tatooine 169.800000 85.375000 54.644444"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from janitor import get_columns\n",
"from pandas.api.types import is_numeric_dtype\n",
"\n",
"grp = starwars.groupby('homeworld')\n",
"(grp\n",
" .pipe(get_columns, is_numeric_dtype)\n",
" .mean()\n",
" .loc[grp.size().gt(1)]\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" hair_color | \n",
" skin_color | \n",
" eye_color | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" blond | \n",
" fair | \n",
" blue | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" gold | \n",
" yellow | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" white, blue | \n",
" red | \n",
"
\n",
" \n",
" 3 | \n",
" none | \n",
" white | \n",
" yellow | \n",
"
\n",
" \n",
" 4 | \n",
" brown | \n",
" light | \n",
" brown | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 80 | \n",
" none | \n",
" pale | \n",
" white | \n",
"
\n",
" \n",
" 82 | \n",
" black | \n",
" dark | \n",
" dark | \n",
"
\n",
" \n",
" 83 | \n",
" brown | \n",
" light | \n",
" hazel | \n",
"
\n",
" \n",
" 85 | \n",
" none | \n",
" none | \n",
" black | \n",
"
\n",
" \n",
" 86 | \n",
" none | \n",
" none | \n",
" unknown | \n",
"
\n",
" \n",
"
\n",
"
67 rows × 3 columns
\n",
"
"
],
"text/plain": [
" hair_color skin_color eye_color\n",
"0 blond fair blue\n",
"1 NaN gold yellow\n",
"2 NaN white, blue red\n",
"3 none white yellow\n",
"4 brown light brown\n",
".. ... ... ...\n",
"80 none pale white\n",
"82 black dark dark\n",
"83 brown light hazel\n",
"85 none none black\n",
"86 none none unknown\n",
"\n",
"[67 rows x 3 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# dplyr:\n",
"# starwars %>% distinct(across(contains(\"color\")))\n",
"starwars.filter(like='color').drop_duplicates()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"hair_color skin_color eye_color\n",
"brown light brown 6\n",
" fair blue 4\n",
"none grey black 4\n",
"black dark brown 3\n",
"blond fair blue 3\n",
" ..\n",
"none dark brown 1\n",
" green black 1\n",
"auburn, grey fair blue 1\n",
"none green red 1\n",
"NaN white, red red 1\n",
"Name: count, Length: 67, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# dplyr\n",
"# starwars %>% count(across(contains(\"color\")), sort = TRUE)\n",
"starwars.filter(like='color').value_counts(dropna = False)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" height | \n",
" mass | \n",
" hair_color | \n",
" skin_color | \n",
" eye_color | \n",
" birth_year | \n",
" sex | \n",
" gender | \n",
" homeworld | \n",
" species | \n",
" films | \n",
" vehicles | \n",
" starships | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Luke Skywalker | \n",
" 172.0 | \n",
" 77.0 | \n",
" blond | \n",
" fair | \n",
" blue | \n",
" 19.0 | \n",
" male | \n",
" masculine | \n",
" Tatooine | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" Snowspeeder, Imperial Speeder Bike | \n",
" X-wing, Imperial shuttle | \n",
"
\n",
" \n",
" 1 | \n",
" C-3PO | \n",
" 167.0 | \n",
" 75.0 | \n",
" NaN | \n",
" gold | \n",
" yellow | \n",
" 112.0 | \n",
" none | \n",
" masculine | \n",
" Tatooine | \n",
" Droid | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" R2-D2 | \n",
" 96.0 | \n",
" 32.0 | \n",
" NaN | \n",
" white, blue | \n",
" red | \n",
" 33.0 | \n",
" none | \n",
" masculine | \n",
" Naboo | \n",
" Droid | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" Darth Vader | \n",
" 202.0 | \n",
" 136.0 | \n",
" none | \n",
" white | \n",
" yellow | \n",
" 41.9 | \n",
" male | \n",
" masculine | \n",
" Tatooine | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" NaN | \n",
" TIE Advanced x1 | \n",
"
\n",
" \n",
" 4 | \n",
" Leia Organa | \n",
" 150.0 | \n",
" 49.0 | \n",
" brown | \n",
" light | \n",
" brown | \n",
" 19.0 | \n",
" female | \n",
" feminine | \n",
" Alderaan | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" Imperial Speeder Bike | \n",
" NaN | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 82 | \n",
" Finn | \n",
" NaN | \n",
" NaN | \n",
" black | \n",
" dark | \n",
" dark | \n",
" NaN | \n",
" male | \n",
" masculine | \n",
" NaN | \n",
" Human | \n",
" The Force Awakens | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 83 | \n",
" Rey | \n",
" NaN | \n",
" NaN | \n",
" brown | \n",
" light | \n",
" hazel | \n",
" NaN | \n",
" female | \n",
" feminine | \n",
" NaN | \n",
" Human | \n",
" The Force Awakens | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 84 | \n",
" Poe Dameron | \n",
" NaN | \n",
" NaN | \n",
" brown | \n",
" light | \n",
" brown | \n",
" NaN | \n",
" male | \n",
" masculine | \n",
" NaN | \n",
" Human | \n",
" The Force Awakens | \n",
" NaN | \n",
" X-wing | \n",
"
\n",
" \n",
" 85 | \n",
" BB8 | \n",
" NaN | \n",
" NaN | \n",
" none | \n",
" none | \n",
" black | \n",
" NaN | \n",
" none | \n",
" masculine | \n",
" NaN | \n",
" Droid | \n",
" The Force Awakens | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 86 | \n",
" Captain Phasma | \n",
" NaN | \n",
" NaN | \n",
" none | \n",
" none | \n",
" unknown | \n",
" NaN | \n",
" female | \n",
" feminine | \n",
" NaN | \n",
" Human | \n",
" The Force Awakens | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
87 rows × 14 columns
\n",
"
"
],
"text/plain": [
" name height mass hair_color skin_color eye_color \\\n",
"0 Luke Skywalker 172.0 77.0 blond fair blue \n",
"1 C-3PO 167.0 75.0 NaN gold yellow \n",
"2 R2-D2 96.0 32.0 NaN white, blue red \n",
"3 Darth Vader 202.0 136.0 none white yellow \n",
"4 Leia Organa 150.0 49.0 brown light brown \n",
".. ... ... ... ... ... ... \n",
"82 Finn NaN NaN black dark dark \n",
"83 Rey NaN NaN brown light hazel \n",
"84 Poe Dameron NaN NaN brown light brown \n",
"85 BB8 NaN NaN none none black \n",
"86 Captain Phasma NaN NaN none none unknown \n",
"\n",
" birth_year sex gender homeworld species \\\n",
"0 19.0 male masculine Tatooine Human \n",
"1 112.0 none masculine Tatooine Droid \n",
"2 33.0 none masculine Naboo Droid \n",
"3 41.9 male masculine Tatooine Human \n",
"4 19.0 female feminine Alderaan Human \n",
".. ... ... ... ... ... \n",
"82 NaN male masculine NaN Human \n",
"83 NaN female feminine NaN Human \n",
"84 NaN male masculine NaN Human \n",
"85 NaN none masculine NaN Droid \n",
"86 NaN female feminine NaN Human \n",
"\n",
" films \\\n",
"0 A New Hope, The Empire Strikes Back, Return of... \n",
"1 A New Hope, The Empire Strikes Back, Return of... \n",
"2 A New Hope, The Empire Strikes Back, Return of... \n",
"3 A New Hope, The Empire Strikes Back, Return of... \n",
"4 A New Hope, The Empire Strikes Back, Return of... \n",
".. ... \n",
"82 The Force Awakens \n",
"83 The Force Awakens \n",
"84 The Force Awakens \n",
"85 The Force Awakens \n",
"86 The Force Awakens \n",
"\n",
" vehicles starships \n",
"0 Snowspeeder, Imperial Speeder Bike X-wing, Imperial shuttle \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN TIE Advanced x1 \n",
"4 Imperial Speeder Bike NaN \n",
".. ... ... \n",
"82 NaN NaN \n",
"83 NaN NaN \n",
"84 NaN X-wing \n",
"85 NaN NaN \n",
"86 NaN NaN \n",
"\n",
"[87 rows x 14 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# starwars %>% \n",
"# filter(if_any(everything(), ~ !is.na(.x)))\n",
"starwars.loc[starwars.notna().any(axis=1)]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" height | \n",
" mass | \n",
" hair_color | \n",
" skin_color | \n",
" eye_color | \n",
" birth_year | \n",
" sex | \n",
" gender | \n",
" homeworld | \n",
" species | \n",
" films | \n",
" vehicles | \n",
" starships | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Luke Skywalker | \n",
" 172.0 | \n",
" 77.0 | \n",
" blond | \n",
" fair | \n",
" blue | \n",
" 19.0 | \n",
" male | \n",
" masculine | \n",
" Tatooine | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" Snowspeeder, Imperial Speeder Bike | \n",
" X-wing, Imperial shuttle | \n",
"
\n",
" \n",
" 9 | \n",
" Obi-Wan Kenobi | \n",
" 182.0 | \n",
" 77.0 | \n",
" auburn, white | \n",
" fair | \n",
" blue-gray | \n",
" 57.0 | \n",
" male | \n",
" masculine | \n",
" Stewjon | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" Tribubble bongo | \n",
" Jedi starfighter, Trade Federation cruiser, Na... | \n",
"
\n",
" \n",
" 10 | \n",
" Anakin Skywalker | \n",
" 188.0 | \n",
" 84.0 | \n",
" blond | \n",
" fair | \n",
" blue | \n",
" 41.9 | \n",
" male | \n",
" masculine | \n",
" Tatooine | \n",
" Human | \n",
" The Phantom Menace, Attack of the Clones, Reve... | \n",
" Zephyr-G swoop bike, XJ-6 airspeeder | \n",
" Naboo fighter, Trade Federation cruiser, Jedi ... | \n",
"
\n",
" \n",
" 12 | \n",
" Chewbacca | \n",
" 228.0 | \n",
" 112.0 | \n",
" brown | \n",
" unknown | \n",
" blue | \n",
" 200.0 | \n",
" male | \n",
" masculine | \n",
" Kashyyyk | \n",
" Wookiee | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" AT-ST | \n",
" Millennium Falcon, Imperial shuttle | \n",
"
\n",
" \n",
" 16 | \n",
" Wedge Antilles | \n",
" 170.0 | \n",
" 77.0 | \n",
" brown | \n",
" fair | \n",
" hazel | \n",
" 21.0 | \n",
" male | \n",
" masculine | \n",
" Corellia | \n",
" Human | \n",
" A New Hope, The Empire Strikes Back, Return of... | \n",
" Snowspeeder | \n",
" X-wing | \n",
"
\n",
" \n",
" 42 | \n",
" Darth Maul | \n",
" 175.0 | \n",
" 80.0 | \n",
" none | \n",
" red | \n",
" yellow | \n",
" 54.0 | \n",
" male | \n",
" masculine | \n",
" Dathomir | \n",
" Zabrak | \n",
" The Phantom Menace | \n",
" Sith speeder | \n",
" Scimitar | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name height mass hair_color skin_color eye_color \\\n",
"0 Luke Skywalker 172.0 77.0 blond fair blue \n",
"9 Obi-Wan Kenobi 182.0 77.0 auburn, white fair blue-gray \n",
"10 Anakin Skywalker 188.0 84.0 blond fair blue \n",
"12 Chewbacca 228.0 112.0 brown unknown blue \n",
"16 Wedge Antilles 170.0 77.0 brown fair hazel \n",
"42 Darth Maul 175.0 80.0 none red yellow \n",
"\n",
" birth_year sex gender homeworld species \\\n",
"0 19.0 male masculine Tatooine Human \n",
"9 57.0 male masculine Stewjon Human \n",
"10 41.9 male masculine Tatooine Human \n",
"12 200.0 male masculine Kashyyyk Wookiee \n",
"16 21.0 male masculine Corellia Human \n",
"42 54.0 male masculine Dathomir Zabrak \n",
"\n",
" films \\\n",
"0 A New Hope, The Empire Strikes Back, Return of... \n",
"9 A New Hope, The Empire Strikes Back, Return of... \n",
"10 The Phantom Menace, Attack of the Clones, Reve... \n",
"12 A New Hope, The Empire Strikes Back, Return of... \n",
"16 A New Hope, The Empire Strikes Back, Return of... \n",
"42 The Phantom Menace \n",
"\n",
" vehicles \\\n",
"0 Snowspeeder, Imperial Speeder Bike \n",
"9 Tribubble bongo \n",
"10 Zephyr-G swoop bike, XJ-6 airspeeder \n",
"12 AT-ST \n",
"16 Snowspeeder \n",
"42 Sith speeder \n",
"\n",
" starships \n",
"0 X-wing, Imperial shuttle \n",
"9 Jedi starfighter, Trade Federation cruiser, Na... \n",
"10 Naboo fighter, Trade Federation cruiser, Jedi ... \n",
"12 Millennium Falcon, Imperial shuttle \n",
"16 X-wing \n",
"42 Scimitar "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# starwars %>% \n",
"# filter(if_any(everything(), ~ !is.na(.x)))\n",
"starwars.loc[starwars.notna().all(axis=1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Output for the above is different from the vignette's because Pandas treats none as NaN, and dplyr does not (at least in the vignette)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at another solution, from [Stack Overflow](https://stackoverflow.com/questions/63200530/python-pandas-equivalent-to-dplyr-1-0-0-summarizeacross):"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" model | \n",
" mpg | \n",
" cyl | \n",
" disp | \n",
" hp | \n",
" drat | \n",
" wt | \n",
" qsec | \n",
" vs | \n",
" am | \n",
" gear | \n",
" carb | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Mazda RX4 | \n",
" 21.0 | \n",
" 6 | \n",
" 160.0 | \n",
" 110 | \n",
" 3.90 | \n",
" 2.620 | \n",
" 16.46 | \n",
" 0 | \n",
" 1 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" Mazda RX4 Wag | \n",
" 21.0 | \n",
" 6 | \n",
" 160.0 | \n",
" 110 | \n",
" 3.90 | \n",
" 2.875 | \n",
" 17.02 | \n",
" 0 | \n",
" 1 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" 2 | \n",
" Datsun 710 | \n",
" 22.8 | \n",
" 4 | \n",
" 108.0 | \n",
" 93 | \n",
" 3.85 | \n",
" 2.320 | \n",
" 18.61 | \n",
" 1 | \n",
" 1 | \n",
" 4 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" Hornet 4 Drive | \n",
" 21.4 | \n",
" 6 | \n",
" 258.0 | \n",
" 110 | \n",
" 3.08 | \n",
" 3.215 | \n",
" 19.44 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" Hornet Sportabout | \n",
" 18.7 | \n",
" 8 | \n",
" 360.0 | \n",
" 175 | \n",
" 3.15 | \n",
" 3.440 | \n",
" 17.02 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" model mpg cyl disp hp drat wt qsec vs am gear \\\n",
"0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 \n",
"1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 \n",
"2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 \n",
"3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 \n",
"4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 \n",
"\n",
" carb \n",
"0 4 \n",
"1 4 \n",
"2 1 \n",
"3 1 \n",
"4 2 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv\"\n",
"cars = pd.read_csv(url)\n",
"cars.head()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" disp | \n",
" hp | \n",
" drat | \n",
" wt | \n",
"
\n",
" \n",
" cyl | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 4 | \n",
" 1156.5 | \n",
" 909 | \n",
" 4.070909 | \n",
" 2.285727 | \n",
"
\n",
" \n",
" 6 | \n",
" 1283.2 | \n",
" 856 | \n",
" 3.585714 | \n",
" 3.117143 | \n",
"
\n",
" \n",
" 8 | \n",
" 4943.4 | \n",
" 2929 | \n",
" 3.229286 | \n",
" 3.999214 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" disp hp drat wt\n",
"cyl \n",
"4 1156.5 909 4.070909 2.285727\n",
"6 1283.2 856 3.585714 3.117143\n",
"8 4943.4 2929 3.229286 3.999214"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# dplyr:\n",
"# dat <- group_by(mtcars, cyl) \n",
"# summarize(dat, across(ends_with('p'), sum), across(ends_with('t'), mean))\n",
"aggs = {key:\"mean\" if key.endswith(\"t\") else \"sum\" for key in cars if key.endswith(('p', 't'))}\n",
"cars.groupby('cyl').agg(aggs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An alternate approach, where the groupby is reused, is with [pyjanitor's](https://pyjanitor-devs.github.io/pyjanitor/) [get_columns](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.utils.get_columns) function:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" disp | \n",
" hp | \n",
" drat | \n",
" wt | \n",
"
\n",
" \n",
" cyl | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 4 | \n",
" 1156.5 | \n",
" 909 | \n",
" 4.070909 | \n",
" 2.285727 | \n",
"
\n",
" \n",
" 6 | \n",
" 1283.2 | \n",
" 856 | \n",
" 3.585714 | \n",
" 3.117143 | \n",
"
\n",
" \n",
" 8 | \n",
" 4943.4 | \n",
" 2929 | \n",
" 3.229286 | \n",
" 3.999214 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" disp hp drat wt\n",
"cyl \n",
"4 1156.5 909 4.070909 2.285727\n",
"6 1283.2 856 3.585714 3.117143\n",
"8 4943.4 2929 3.229286 3.999214"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grp = cars.groupby('cyl')\n",
"summary = [get_columns(grp, '*p').sum(), get_columns(grp, '*t').mean()]\n",
"pd.concat(summary, axis = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments\n",
""
]
}
],
"metadata": {
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "blogger",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 4
}