Data Wrangling with Python Datatable - Selecting Columns#

This article highlights various ways to select columns in python datatable. The examples used here are based off the excellent article by Susan Baert.

The data file can be accessed here

Selecting Columns#

The Basics#

from datatable import dt, f, ltype, stype
import re

file_path='Data_files/msleep.txt'
DT = dt.fread(file_path)
DT.head(5)
namegenusvoreorderconservationsleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc12.1NANA11.9NA50
1Owl monkeyAotusomniPrimatesNA171.8NA70.01550.48
2Mountain beaverAplodontiaherbiRodentiant14.42.4NA9.6NA1.35
3Greater short-tailed shrewBlarinaomniSoricomorphalc14.92.30.1333339.10.000290.019
4CowBosherbiArtiodactyladomesticated40.70.666667200.423600

You can select columns by name or position in the j section:

DT[:, 'genus'].head(5)
genus
▪▪▪▪
0Acinonyx
1Aotus
2Aplodontia
3Blarina
4Bos
DT[:, 1].head()
genus
▪▪▪▪
0Acinonyx
1Aotus
2Aplodontia
3Blarina
4Bos
5Bradypus
6Callorhinus
7Calomys
8Canis
9Capreolus
DT[:, -10].head()
genus
▪▪▪▪
0Acinonyx
1Aotus
2Aplodontia
3Blarina
4Bos
5Bradypus
6Callorhinus
7Calomys
8Canis
9Capreolus

If you are selecting a single column, you can pass it into the brackets without specifying the i section:

DT['genus'].head(5)
genus
▪▪▪▪
0Acinonyx
1Aotus
2Aplodontia
3Blarina
4Bos

For the rest of this article, I will be focusing on column selection by name.

You can select columns by passing a list/tuple of the column names:

columns_to_select = ["name", "genus", "sleep_total", "awake"]
DT[:, columns_to_select].head(5)
namegenussleep_totalawake
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyx12.111.9
1Owl monkeyAotus177
2Mountain beaverAplodontia14.49.6
3Greater short-tailed shrewBlarina14.99.1
4CowBos420

You can pass a list/tuple of booleans:

columns_to_select = [True, True, False, False, False, True,False,True,True,False,False]
DT[:, columns_to_select].head(5)
namegenussleep_totalsleep_cycleawake
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyx12.1NA11.9
1Owl monkeyAotus17NA7
2Mountain beaverAplodontia14.4NA9.6
3Greater short-tailed shrewBlarina14.90.1333339.1
4CowBos40.66666720

You can select chunks of columns using python’s slice syntax or via the start:end shortcut:

DT[:, slice("name", "order")].head(5)
namegenusvoreorder
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora
1Owl monkeyAotusomniPrimates
2Mountain beaverAplodontiaherbiRodentia
3Greater short-tailed shrewBlarinaomniSoricomorpha
4CowBosherbiArtiodactyla
DT[:, "name" : "order"].head(5)
namegenusvoreorder
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora
1Owl monkeyAotusomniPrimates
2Mountain beaverAplodontiaherbiRodentia
3Greater short-tailed shrewBlarinaomniSoricomorpha
4CowBosherbiArtiodactyla

Multiple chunk selection is possible:

columns_to_select = [slice("name", "order"), slice("sleep_total", "sleep_cycle")]
DT[:, columns_to_select].head(5)
namegenusvoreordersleep_totalsleep_remsleep_cycle
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora12.1NANA
1Owl monkeyAotusomniPrimates171.8NA
2Mountain beaverAplodontiaherbiRodentia14.42.4NA
3Greater short-tailed shrewBlarinaomniSoricomorpha14.92.30.133333
4CowBosherbiArtiodactyla40.70.666667

For the shortcut notation, for multiple selections, it has to be prefixed with datatable’s f symbol:

columns_to_select = [f["name" : "order", "sleep_total" : "sleep_cycle"]]
DT[:, columns_to_select].head(5)
namegenusvoreordersleep_totalsleep_remsleep_cycle
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora12.1NANA
1Owl monkeyAotusomniPrimates171.8NA
2Mountain beaverAplodontiaherbiRodentia14.42.4NA
3Greater short-tailed shrewBlarinaomniSoricomorpha14.92.30.133333
4CowBosherbiArtiodactyla40.70.666667

To deselect/drop columns you can use the remove function:

columns_to_remove = [f["sleep_total" : "awake", "conservation"]]
DT[:, f[:].remove(columns_to_remove)].head(5)
namegenusvoreorderbrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoraNA50
1Owl monkeyAotusomniPrimates0.01550.48
2Mountain beaverAplodontiaherbiRodentiaNA1.35
3Greater short-tailed shrewBlarinaomniSoricomorpha0.000290.019
4CowBosherbiArtiodactyla0.423600

You can deselect a whole chunk, and then re-add a column again; this combines the remove and extend functions:

DT[:, f[:].remove(f["name" : "awake"]).extend(f["conservation"])].head(5)
brainwtbodywtconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NA50lc
10.01550.48NA
2NA1.35nt
30.000290.019lc
40.423600domesticated

Selecting Columns based on Partial Names#

You can use python’s string functions to filter for columns with partial matching:

columns_to_select = [name.startswith("sleep") for name in DT.names]
DT[:, columns_to_select].head(5)
sleep_totalsleep_remsleep_cycle
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1NANA
1171.8NA
214.42.4NA
314.92.30.133333
440.70.666667
columns_to_select = ["eep" in name or name.endswith("wt") for name in DT.names]
DT[:, columns_to_select].head(5)
sleep_totalsleep_remsleep_cyclebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1NANANA50
1171.8NA0.01550.48
214.42.4NANA1.35
314.92.30.1333330.000290.019
440.70.6666670.423600

Selecting Columns based on Regex#

Python’s re module can be used to select columns based on a regular expression:

# this returns a list of booleans
columns_to_select = [True if re.search(r"o.+er", name) else False for name in DT.names]
DT[:, columns_to_select].head(5)
orderconservation
▪▪▪▪▪▪▪▪
0Carnivoralc
1PrimatesNA
2Rodentiant
3Soricomorphalc
4Artiodactyladomesticated

Selecting columns by their data type#

You can pass a data type in the j section:

DT[:, str].head(5)
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated

You can pass a list of data types:

DT[:, [int, float]].head(5)
sleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1NANA11.9NA50
1171.8NA70.01550.48
214.42.4NA9.6NA1.35
314.92.30.1333339.10.000290.019
440.70.666667200.423600

You can also pass datatable’s stype or ltype data types:

DT[:, ltype.str].head(5)
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated
DT[:, stype.float64].head(5)
sleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1NANA11.9NA50
1171.8NA70.01550.48
214.42.4NA9.6NA1.35
314.92.30.1333339.10.000290.019
440.70.666667200.423600

You can remove columns based on their data type:

columns_to_remove = [f[int, float]]
DT[:, f[:].remove(columns_to_remove)].head(5)
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated

An alternative is to preselect the columns you intend to keep:

# creates a list of booleans
columns_to_select = [
    dtype not in (ltype.int, ltype.real)
    for name, dtype in zip(DT.names, DT.ltypes) 
]

DT[:, columns_to_select].head(5)
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated

You could also iterate through the frame and check each column’s type, before recombining with cbind:

matching_frames = [frame for frame in DT if frame.ltypes[0] not in (ltype.real, ltype.int)]
dt.cbind(matching_frames).head(5)
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated

Each column in a frame is treated as a frame, allowing for the list comprehension above.

You could also pass the matching frames to the j section of DT:

DT[:, matching_frames].head(5)
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated

Selecting columns by logical expressions#

The ideas expressed in the previous section allows for more nifty column selection.

Say we wish to select columns that are numeric, and have a mean greater than 10:

# returns a list of booleans
columns_to_select = [
    ltype in (ltype.real, ltype.int) and DT[name].mean()[0, 0] > 10
    for name, ltype in zip(DT.names, DT.ltypes)
]
DT[:, columns_to_select].head(5)
sleep_totalawakebodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.111.950
11770.48
214.49.61.35
314.99.10.019
4420600

The code above preselects the columns before passing it to datatable. Note the use of [0,0] to return a scalar value; this allows us to compare with the scalar value 10.

Alternatively, in the list comprehension, instead of a list of booleans, you could return the column names:

columns_to_select = [
    name
    for name, ltype in zip(DT.names, DT.ltypes)
    if ltype in (ltype.real, ltype.int) and DT[name].mean()[0, 0] > 10
]
DT[:, columns_to_select].head(5)
sleep_totalawakebodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.111.950
11770.48
214.49.61.35
314.99.10.019
4420600

You could also iterate through the frame in a list comprehension and check each column, before recombining with cbind:

matching_frames = [frame for frame in DT 
                    if frame.ltypes[0] in (ltype.int, ltype.real) 
                    and frame.mean()[0,0] > 10]
dt.cbind(matching_frames).head(5)
sleep_totalawakebodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.111.950
11770.48
214.49.61.35
314.99.10.019
4420600

Instead of recombining with cbind, you could pass the matching_frames to the j section:

DT[:, matching_frames].head(5)
sleep_totalawakebodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.111.950
11770.48
214.49.61.35
314.99.10.019
4420600

Let’s look at another example, where we select only columns where the number of distinct values is less than 10:

# returns a list of booleans
columns_to_select = [frame.nunique()[0, 0] < 10 for frame in DT]
DT[:, columns_to_select].head(5)
voreconservation
▪▪▪▪▪▪▪▪
0carnilc
1omniNA
2herbint
3omnilc
4herbidomesticated
matching_frames = [frame for frame in DT if frame.nunique()[0,0] < 10]
dt.cbind(matching_frames).head(5)
voreconservation
▪▪▪▪▪▪▪▪
0carnilc
1omniNA
2herbint
3omnilc
4herbidomesticated

Or pass matching_frames to the j section in DT:

DT[:, matching_frames].head(5)
voreconservation
▪▪▪▪▪▪▪▪
0carnilc
1omniNA
2herbint
3omnilc
4herbidomesticated

Reordering Columns#

You can select columns in the order that you want:

columns_to_select = ['conservation', 'sleep_total', 'name']
DT[:, columns_to_select].head(5)
conservationsleep_totalname
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0lc12.1Cheetah
1NA17Owl monkey
2nt14.4Mountain beaver
3lc14.9Greater short-tailed shrew
4domesticated4Cow

To move some columns to the front, you could write a function to cover that:

def move_to_the_front(frame, front_columns):
    column_names = list(frame.names)
    for name in front_columns:
        column_names.remove(name)
    front_columns.extend(column_names)
    return front_columns
DT[:, move_to_the_front(DT, ['conservation', 'sleep_total'])].head(5)
conservationsleep_totalnamegenusvoreordersleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0lc12.1CheetahAcinonyxcarniCarnivoraNANA11.9NA50
1NA17Owl monkeyAotusomniPrimates1.8NA70.01550.48
2nt14.4Mountain beaverAplodontiaherbiRodentia2.4NA9.6NA1.35
3lc14.9Greater short-tailed shrewBlarinaomniSoricomorpha2.30.1333339.10.000290.019
4domesticated4CowBosherbiArtiodactyla0.70.666667200.423600

Column Names#

Renaming Columns#

Columns with new names can be created within the j section by passing a dictionary:

new_names = {"animal": f.name, "extinction_threat": f.conservation}
DT[:, f.sleep_total.extend(new_names)].head(5)
sleep_totalanimalextinction_threat
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1Cheetahlc
117Owl monkeyNA
214.4Mountain beavernt
314.9Greater short-tailed shrewlc
44Cowdomesticated

You can also rename the columns via a dictionary that maps the old column name to the new column name, and assign it to DT.names:

DT_copy = DT.copy()
DT_copy.names = {"name": "animal", "conservation": "extinction_threat"}
DT_copy[:, ['animal', 'sleep_total', 'extinction_threat']].head(5)
animalsleep_totalextinction_threat
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Cheetah12.1lc
1Owl monkey17NA
2Mountain beaver14.4nt
3Greater short-tailed shrew14.9lc
4Cow4domesticated
DT_copy.head(5)
animalgenusvoreorderextinction_threatsleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc12.1NANA11.9NA50
1Owl monkeyAotusomniPrimatesNA171.8NA70.01550.48
2Mountain beaverAplodontiaherbiRodentiant14.42.4NA9.6NA1.35
3Greater short-tailed shrewBlarinaomniSoricomorphalc14.92.30.1333339.10.000290.019
4CowBosherbiArtiodactyladomesticated40.70.666667200.423600

Reformatting all Column Names#

You can use python’s string functions to reformat column names.

Let’s convert all column names to uppercase:

DT_copy.names = [name.upper() for name in DT.names] # or list(map(str.upper, DT.names))
DT_copy.head(5)
NAMEGENUSVOREORDERCONSERVATIONSLEEP_TOTALSLEEP_REMSLEEP_CYCLEAWAKEBRAINWTBODYWT
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc12.1NANA11.9NA50
1Owl monkeyAotusomniPrimatesNA171.8NA70.01550.48
2Mountain beaverAplodontiaherbiRodentiant14.42.4NA9.6NA1.35
3Greater short-tailed shrewBlarinaomniSoricomorphalc14.92.30.1333339.10.000290.019
4CowBosherbiArtiodactyladomesticated40.70.666667200.423600

Resources: