Data Wrangling with Python Datatable - Selecting Columns

Data Wrangling with Python Datatable - Selecting Columns#

This article highlights various ways to select columns in python datatable. The examples used here are based off the excellent article by Susan Baert.

The data file can be accessed here

Selecting Columns#

The Basics#

from datatable import dt, f, ltype, stype
import re

file_path='Data_files/msleep.txt'
DT = dt.fread(file_path)
DT.head(5)

	name	genus	vore	order	conservation	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc	12.1	NA	NA	11.9	NA	50
1	Owl monkey	Aotus	omni	Primates	NA	17	1.8	NA	7	0.0155	0.48
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt	14.4	2.4	NA	9.6	NA	1.35
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc	14.9	2.3	0.133333	9.1	0.00029	0.019
4	Cow	Bos	herbi	Artiodactyla	domesticated	4	0.7	0.666667	20	0.423	600

You can select columns by name or position in the j section:

DT[:, 'genus'].head(5)

	genus
	▪▪▪▪
0	Acinonyx
1	Aotus
2	Aplodontia
3	Blarina
4	Bos

DT[:, 1].head()

	genus
	▪▪▪▪
0	Acinonyx
1	Aotus
2	Aplodontia
3	Blarina
4	Bos
5	Bradypus
6	Callorhinus
7	Calomys
8	Canis
9	Capreolus

DT[:, -10].head()

	genus
	▪▪▪▪
0	Acinonyx
1	Aotus
2	Aplodontia
3	Blarina
4	Bos
5	Bradypus
6	Callorhinus
7	Calomys
8	Canis
9	Capreolus

If you are selecting a single column, you can pass it into the brackets without specifying the i section:

DT['genus'].head(5)

	genus
	▪▪▪▪
0	Acinonyx
1	Aotus
2	Aplodontia
3	Blarina
4	Bos

For the rest of this article, I will be focusing on column selection by name.

You can select columns by passing a list/tuple of the column names:

columns_to_select = ["name", "genus", "sleep_total", "awake"]
DT[:, columns_to_select].head(5)

	name	genus	sleep_total	awake
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	12.1	11.9
1	Owl monkey	Aotus	17	7
2	Mountain beaver	Aplodontia	14.4	9.6
3	Greater short-tailed shrew	Blarina	14.9	9.1
4	Cow	Bos	4	20

You can pass a list/tuple of booleans:

columns_to_select = [True, True, False, False, False, True,False,True,True,False,False]
DT[:, columns_to_select].head(5)

	name	genus	sleep_total	sleep_cycle	awake
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	12.1	NA	11.9
1	Owl monkey	Aotus	17	NA	7
2	Mountain beaver	Aplodontia	14.4	NA	9.6
3	Greater short-tailed shrew	Blarina	14.9	0.133333	9.1
4	Cow	Bos	4	0.666667	20

You can select chunks of columns using python’s slice syntax or via the start:end shortcut:

DT[:, slice("name", "order")].head(5)

	name	genus	vore	order
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora
1	Owl monkey	Aotus	omni	Primates
2	Mountain beaver	Aplodontia	herbi	Rodentia
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha
4	Cow	Bos	herbi	Artiodactyla

DT[:, "name" : "order"].head(5)

	name	genus	vore	order
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora
1	Owl monkey	Aotus	omni	Primates
2	Mountain beaver	Aplodontia	herbi	Rodentia
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha
4	Cow	Bos	herbi	Artiodactyla

Multiple chunk selection is possible:

columns_to_select = [slice("name", "order"), slice("sleep_total", "sleep_cycle")]
DT[:, columns_to_select].head(5)

	name	genus	vore	order	sleep_total	sleep_rem	sleep_cycle
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	12.1	NA	NA
1	Owl monkey	Aotus	omni	Primates	17	1.8	NA
2	Mountain beaver	Aplodontia	herbi	Rodentia	14.4	2.4	NA
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	14.9	2.3	0.133333
4	Cow	Bos	herbi	Artiodactyla	4	0.7	0.666667

For the shortcut notation, for multiple selections, it has to be prefixed with datatable’s f symbol:

columns_to_select = [f["name" : "order", "sleep_total" : "sleep_cycle"]]
DT[:, columns_to_select].head(5)

	name	genus	vore	order	sleep_total	sleep_rem	sleep_cycle
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	12.1	NA	NA
1	Owl monkey	Aotus	omni	Primates	17	1.8	NA
2	Mountain beaver	Aplodontia	herbi	Rodentia	14.4	2.4	NA
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	14.9	2.3	0.133333
4	Cow	Bos	herbi	Artiodactyla	4	0.7	0.666667

To deselect/drop columns you can use the remove function:

columns_to_remove = [f["sleep_total" : "awake", "conservation"]]
DT[:, f[:].remove(columns_to_remove)].head(5)

	name	genus	vore	order	brainwt	bodywt
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	NA	50
1	Owl monkey	Aotus	omni	Primates	0.0155	0.48
2	Mountain beaver	Aplodontia	herbi	Rodentia	NA	1.35
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	0.00029	0.019
4	Cow	Bos	herbi	Artiodactyla	0.423	600

You can deselect a whole chunk, and then re-add a column again; this combines the remove and extend functions:

DT[:, f[:].remove(f["name" : "awake"]).extend(f["conservation"])].head(5)

	brainwt	bodywt	conservation
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	NA	50	lc
1	0.0155	0.48	NA
2	NA	1.35	nt
3	0.00029	0.019	lc
4	0.423	600	domesticated

Selecting Columns based on Partial Names#

You can use python’s string functions to filter for columns with partial matching:

columns_to_select = [name.startswith("sleep") for name in DT.names]
DT[:, columns_to_select].head(5)

	sleep_total	sleep_rem	sleep_cycle
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	12.1	NA	NA
1	17	1.8	NA
2	14.4	2.4	NA
3	14.9	2.3	0.133333
4	4	0.7	0.666667

columns_to_select = ["eep" in name or name.endswith("wt") for name in DT.names]
DT[:, columns_to_select].head(5)

	sleep_total	sleep_rem	sleep_cycle	brainwt	bodywt
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	12.1	NA	NA	NA	50
1	17	1.8	NA	0.0155	0.48
2	14.4	2.4	NA	NA	1.35
3	14.9	2.3	0.133333	0.00029	0.019
4	4	0.7	0.666667	0.423	600

Selecting Columns based on Regex#

Python’s re module can be used to select columns based on a regular expression:

# this returns a list of booleans
columns_to_select = [True if re.search(r"o.+er", name) else False for name in DT.names]
DT[:, columns_to_select].head(5)

	order	conservation
	▪▪▪▪	▪▪▪▪
0	Carnivora	lc
1	Primates	NA
2	Rodentia	nt
3	Soricomorpha	lc
4	Artiodactyla	domesticated

Selecting columns by their data type#

You can pass a data type in the j section:

DT[:, str].head(5)

	name	genus	vore	order	conservation
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc
1	Owl monkey	Aotus	omni	Primates	NA
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc
4	Cow	Bos	herbi	Artiodactyla	domesticated

You can pass a list of data types:

DT[:, [int, float]].head(5)

	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	12.1	NA	NA	11.9	NA	50
1	17	1.8	NA	7	0.0155	0.48
2	14.4	2.4	NA	9.6	NA	1.35
3	14.9	2.3	0.133333	9.1	0.00029	0.019
4	4	0.7	0.666667	20	0.423	600

You can also pass datatable’s stype or ltype data types:

DT[:, ltype.str].head(5)

	name	genus	vore	order	conservation
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc
1	Owl monkey	Aotus	omni	Primates	NA
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc
4	Cow	Bos	herbi	Artiodactyla	domesticated

DT[:, stype.float64].head(5)

	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	12.1	NA	NA	11.9	NA	50
1	17	1.8	NA	7	0.0155	0.48
2	14.4	2.4	NA	9.6	NA	1.35
3	14.9	2.3	0.133333	9.1	0.00029	0.019
4	4	0.7	0.666667	20	0.423	600

You can remove columns based on their data type:

columns_to_remove = [f[int, float]]
DT[:, f[:].remove(columns_to_remove)].head(5)

	name	genus	vore	order	conservation
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc
1	Owl monkey	Aotus	omni	Primates	NA
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc
4	Cow	Bos	herbi	Artiodactyla	domesticated

An alternative is to preselect the columns you intend to keep:

# creates a list of booleans
columns_to_select = [
    dtype not in (ltype.int, ltype.real)
    for _, dtype in zip(DT.names, DT.ltypes) 
]

DT[:, columns_to_select].head(5)

	name	genus	vore	order	conservation
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc
1	Owl monkey	Aotus	omni	Primates	NA
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc
4	Cow	Bos	herbi	Artiodactyla	domesticated

You could also iterate through the frame and check each column’s type, before recombining with cbind:

matching_frames = [frame for frame in DT if frame.ltypes[0] not in (ltype.real, ltype.int)]
dt.cbind(matching_frames).head(5)

	name	genus	vore	order	conservation
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc
1	Owl monkey	Aotus	omni	Primates	NA
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc
4	Cow	Bos	herbi	Artiodactyla	domesticated

Each column in a frame is treated as a frame, allowing for the list comprehension above.

You could also pass the matching frames to the j section of DT:

DT[:, matching_frames].head(5)

	name	genus	vore	order	conservation
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc
1	Owl monkey	Aotus	omni	Primates	NA
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc
4	Cow	Bos	herbi	Artiodactyla	domesticated

Selecting columns by logical expressions#

The ideas expressed in the previous section allows for more nifty column selection.

Say we wish to select columns that are numeric, and have a mean greater than 10:

# returns a list of booleans
columns_to_select = [
    ltype in (ltype.real, ltype.int) and DT[name].mean()[0, 0] > 10
    for name, ltype in zip(DT.names, DT.ltypes)
]
DT[:, columns_to_select].head(5)

	sleep_total	awake	bodywt
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	12.1	11.9	50
1	17	7	0.48
2	14.4	9.6	1.35
3	14.9	9.1	0.019
4	4	20	600

The code above preselects the columns before passing it to datatable. Note the use of [0,0] to return a scalar value; this allows us to compare with the scalar value 10.

Alternatively, in the list comprehension, instead of a list of booleans, you could return the column names:

columns_to_select = [
    name
    for name, ltype in zip(DT.names, DT.ltypes)
    if ltype in (ltype.real, ltype.int) and DT[name].mean()[0, 0] > 10
]
DT[:, columns_to_select].head(5)

	sleep_total	awake	bodywt
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	12.1	11.9	50
1	17	7	0.48
2	14.4	9.6	1.35
3	14.9	9.1	0.019
4	4	20	600

You could also iterate through the frame in a list comprehension and check each column, before recombining with cbind:

matching_frames = [frame for frame in DT 
                    if frame.ltypes[0] in (ltype.int, ltype.real) 
                    and frame.mean()[0,0] > 10]
dt.cbind(matching_frames).head(5)

	sleep_total	awake	bodywt
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	12.1	11.9	50
1	17	7	0.48
2	14.4	9.6	1.35
3	14.9	9.1	0.019
4	4	20	600

Instead of recombining with cbind, you could pass the matching_frames to the j section:

DT[:, matching_frames].head(5)

	sleep_total	awake	bodywt
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	12.1	11.9	50
1	17	7	0.48
2	14.4	9.6	1.35
3	14.9	9.1	0.019
4	4	20	600

Let’s look at another example, where we select only columns where the number of distinct values is less than 10:

# returns a list of booleans
columns_to_select = [frame.nunique()[0, 0] < 10 for frame in DT]
DT[:, columns_to_select].head(5)

	vore	conservation
	▪▪▪▪	▪▪▪▪
0	carni	lc
1	omni	NA
2	herbi	nt
3	omni	lc
4	herbi	domesticated

matching_frames = [frame for frame in DT if frame.nunique()[0,0] < 10]
dt.cbind(matching_frames).head(5)

	vore	conservation
	▪▪▪▪	▪▪▪▪
0	carni	lc
1	omni	NA
2	herbi	nt
3	omni	lc
4	herbi	domesticated

Or pass matching_frames to the j section in DT:

DT[:, matching_frames].head(5)

	vore	conservation
	▪▪▪▪	▪▪▪▪
0	carni	lc
1	omni	NA
2	herbi	nt
3	omni	lc
4	herbi	domesticated

Reordering Columns#

You can select columns in the order that you want:

columns_to_select = ['conservation', 'sleep_total', 'name']
DT[:, columns_to_select].head(5)

	conservation	sleep_total	name
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	lc	12.1	Cheetah
1	NA	17	Owl monkey
2	nt	14.4	Mountain beaver
3	lc	14.9	Greater short-tailed shrew
4	domesticated	4	Cow

To move some columns to the front, you could write a function to cover that:

def move_to_the_front(frame, front_columns):
    column_names = list(frame.names)
    for name in front_columns:
        column_names.remove(name)
    front_columns.extend(column_names)
    return front_columns

DT[:, move_to_the_front(DT, ['conservation', 'sleep_total'])].head(5)

	conservation	sleep_total	name	genus	vore	order	sleep_rem	sleep_cycle	awake	brainwt	bodywt
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	lc	12.1	Cheetah	Acinonyx	carni	Carnivora	NA	NA	11.9	NA	50
1	NA	17	Owl monkey	Aotus	omni	Primates	1.8	NA	7	0.0155	0.48
2	nt	14.4	Mountain beaver	Aplodontia	herbi	Rodentia	2.4	NA	9.6	NA	1.35
3	lc	14.9	Greater short-tailed shrew	Blarina	omni	Soricomorpha	2.3	0.133333	9.1	0.00029	0.019
4	domesticated	4	Cow	Bos	herbi	Artiodactyla	0.7	0.666667	20	0.423	600

Column Names#

Renaming Columns#

Columns with new names can be created within the j section by passing a dictionary:

new_names = {"animal": f.name, "extinction_threat": f.conservation}
DT[:, f.sleep_total.extend(new_names)].head(5)

	sleep_total	animal	extinction_threat
	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪
0	12.1	Cheetah	lc
1	17	Owl monkey	NA
2	14.4	Mountain beaver	nt
3	14.9	Greater short-tailed shrew	lc
4	4	Cow	domesticated

You can also rename the columns via a dictionary that maps the old column name to the new column name, and assign it to DT.names:

DT_copy = DT.copy()
DT_copy.names = {"name": "animal", "conservation": "extinction_threat"}
DT_copy[:, ['animal', 'sleep_total', 'extinction_threat']].head(5)

	animal	sleep_total	extinction_threat
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	Cheetah	12.1	lc
1	Owl monkey	17	NA
2	Mountain beaver	14.4	nt
3	Greater short-tailed shrew	14.9	lc
4	Cow	4	domesticated

DT_copy.head(5)

	animal	genus	vore	order	extinction_threat	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc	12.1	NA	NA	11.9	NA	50
1	Owl monkey	Aotus	omni	Primates	NA	17	1.8	NA	7	0.0155	0.48
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt	14.4	2.4	NA	9.6	NA	1.35
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc	14.9	2.3	0.133333	9.1	0.00029	0.019
4	Cow	Bos	herbi	Artiodactyla	domesticated	4	0.7	0.666667	20	0.423	600

Reformatting all Column Names#

You can use python’s string functions to reformat column names.

Let’s convert all column names to uppercase:

DT_copy.names = [name.upper() for name in DT.names] # or list(map(str.upper, DT.names))
DT_copy.head(5)

	NAME	GENUS	VORE	ORDER	CONSERVATION	SLEEP_TOTAL	SLEEP_REM	SLEEP_CYCLE	AWAKE	BRAINWT	BODYWT
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc	12.1	NA	NA	11.9	NA	50
1	Owl monkey	Aotus	omni	Primates	NA	17	1.8	NA	7	0.0155	0.48
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt	14.4	2.4	NA	9.6	NA	1.35
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc	14.9	2.3	0.133333	9.1	0.00029	0.019
4	Cow	Bos	herbi	Artiodactyla	domesticated	4	0.7	0.666667	20	0.423	600

Resources:

datatable docs