Filtering Rows in Datatable

Filtering Rows in Datatable#

This article highlights various ways of filtering rows in python datatable. The examples used here are based off the excellent article by Susan Baert.

The data file can be accessed here

Basic Row Filters#

from datatable import dt, f
from operator import  or_, xor

file_path = "Data_files/msleep.txt"

DT = dt.fread(file_path)

DT.head(5)

	name	genus	vore	order	conservation	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	lc	12.1	NA	NA	11.9	NA	50
1	Owl monkey	Aotus	omni	Primates	NA	17	1.8	NA	7	0.0155	0.48
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt	14.4	2.4	NA	9.6	NA	1.35
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc	14.9	2.3	0.133333	9.1	0.00029	0.019
4	Cow	Bos	herbi	Artiodactyla	domesticated	4	0.7	0.666667	20	0.423	600

Filtering Rows Based on a Numeric Variable#

You can filter numeric variables based on their values. A number of commonly used operators include: >, >=, <, <=, == and !=.

Note that in datatable, filtration occurs in the i section:

DT[f.sleep_total > 18,  ["name", "sleep_total"]]

	name	sleep_total
	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Big brown bat	19.7
1	Thick-tailed opposum	19.4
2	Little brown bat	19.9
3	Giant armadillo	18.1

To select a range of values, you can use two logical requirements; in the example below, only rows where sleep_total is greater than or equal to 16, and less than or equal to 18 are selected:

DT[(f.sleep_total >= 16) & (f.sleep_total <= 18),  ["name", "sleep_total"]]

	name	sleep_total
	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Owl monkey	17
1	Long-nosed armadillo	17.4
2	North American Opossum	18
3	Arctic ground squirrel	16.6

Note in the code above, that each condition is wrapped in parentheses; this is to ensure that the correct output is obtained.

In Pandas/dplyr/rdatatable, there is a between function that makes selection such as the above easier; at the moment, there is no equivalent function in datatable; you can create a temporary between function:

def between(column, left, right):
    l = f[column]>=left
    r = f[column]<=right
    return l & r

DT[between('sleep_total', 16, 18),  ['name', 'sleep_total']]

	name	sleep_total
	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Owl monkey	17
1	Long-nosed armadillo	17.4
2	North American Opossum	18
3	Arctic ground squirrel	16.6

There are scenarios where you may want to select rows where the value is nearly a given value. You may also want to specify a tolerance value to indicate how far the values can be.

This can be replicated with the isclose function in the datatable.math submodule.

Let’s assume that the tolerance should be within one standard deviation of 17:

from datatable.math import isclose

# calculate tolerance
tolerance = DT['sleep_total'].sd1()

DT[isclose(f.sleep_total, 17, atol = tolerance),  ['name', 'sleep_total']]

	name	sleep_total
	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Owl monkey	17
1	Mountain beaver	14.4
2	Greater short-tailed shrew	14.9
3	Three-toed sloth	14.4
4	Long-nosed armadillo	17.4
5	North American Opossum	18
6	Big brown bat	19.7
7	Western american chipmunk	14.9
8	Thick-tailed opposum	19.4
9	Mongolian gerbil	14.2
10	Golden hamster	14.3
11	Vole	12.8
12	Little brown bat	19.9
13	Round-tailed muskrat	14.6
14	Northern grasshopper mouse	14.5
15	Tiger	15.8
16	Lion	13.5
17	Phalanger	13.7
18	Giant armadillo	18.1
19	Laboratory rat	13
20	Arctic ground squirrel	16.6
21	Thirteen-lined ground squirrel	13.8
22	Golden-mantled ground squirrel	15.9
23	Musk shrew	12.8
24	Eastern american chipmunk	15.8
25	Tenrec	15.6

Filtering based on String Matches#

You can select on string matches as well; in the example below, the == comparison operator is used to select a specific group of animals:

DT[f.order == "Didelphimorphia",  ["order", "name", "sleep_total"]]

	order	name	sleep_total
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Didelphimorphia	North American Opossum	18
1	Didelphimorphia	Thick-tailed opposum	19.4

Other operators can be used also:

DT[f.order != 'Rodentia',  ['order', 'name', 'sleep_total']]

	order	name	sleep_total
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Carnivora	Cheetah	12.1
1	Primates	Owl monkey	17
2	Soricomorpha	Greater short-tailed shrew	14.9
3	Artiodactyla	Cow	4
4	Pilosa	Three-toed sloth	14.4
5	Carnivora	Northern fur seal	8.7
6	Carnivora	Dog	10.1
7	Artiodactyla	Roe deer	3
8	Artiodactyla	Goat	5.3
9	Primates	Grivet	10
10	Soricomorpha	Star-nosed mole	10.3
11	Soricomorpha	Lesser short-tailed shrew	9.1
12	Cingulata	Long-nosed armadillo	17.4
13	Hyracoidea	Tree hyrax	5.3
14	Didelphimorphia	North American Opossum	18
⋮	⋮	⋮	⋮
56	Scandentia	Tree shrew	8.9
57	Cetacea	Bottle-nosed dolphin	5.2
58	Carnivora	Genet	6.3
59	Carnivora	Arctic fox	12.5
60	Carnivora	Red fox	9.8

DT[f.name > 'V', ['order', 'name', 'sleep_total']]

	order	name	sleep_total
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Rodentia	Vesper mouse	7
1	Rodentia	Western american chipmunk	14.9
2	Rodentia	Vole	12.8

In the examples above, only one animal is used; to select more animals, you could pass a list of conditions, with the | (or) symbol:

rows = (f.order == "Didelphimorphia") | (f.order == "Diprotodontia")

columns = ['order', 'name', 'sleep_total']

DT[rows, columns]

	order	name	sleep_total
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Didelphimorphia	North American Opossum	18
1	Didelphimorphia	Thick-tailed opposum	19.4
2	Diprotodontia	Phalanger	13.7
3	Diprotodontia	Potoroo	11.1

However, this can become unwieldy, as the number of animals increase. At the moment, there is no equivalent of python’s in operator in datatable; let’s create a temporary function to help with this:

from functools import reduce

def isin(column, sequence_of_labels):
    func = lambda x: f[column] == x
    return reduce(or_, map(func, sequence_of_labels))

labels =  ("Didelphimorphia", "Diprotodontia")

columns = ["order", "name", "sleep_total"]

DT[isin('order', labels), columns]

	order	name	sleep_total
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Didelphimorphia	North American Opossum	18
1	Didelphimorphia	Thick-tailed opposum	19.4
2	Diprotodontia	Phalanger	13.7
3	Diprotodontia	Potoroo	11.1

You can also deselect certain groups using the isin function above, and combine it with the ~ symbol:

labels = ("Rodentia", "Carnivora", "Primates")

columns = ['order', 'name', 'sleep_total']

DT[~isin('order', labels), columns]

	order	name	sleep_total
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Soricomorpha	Greater short-tailed shrew	14.9
1	Artiodactyla	Cow	4
2	Pilosa	Three-toed sloth	14.4
3	Artiodactyla	Roe deer	3
4	Artiodactyla	Goat	5.3
5	Soricomorpha	Star-nosed mole	10.3
6	Soricomorpha	Lesser short-tailed shrew	9.1
7	Cingulata	Long-nosed armadillo	17.4
8	Hyracoidea	Tree hyrax	5.3
9	Didelphimorphia	North American Opossum	18
10	Proboscidea	Asian elephant	3.9
11	Chiroptera	Big brown bat	19.7
12	Perissodactyla	Horse	2.9
13	Perissodactyla	Donkey	3.1
14	Erinaceomorpha	European hedgehog	10.1
⋮	⋮	⋮	⋮
32	Monotremata	Short-nosed echidna	8.6
33	Perissodactyla	Brazilian tapir	4.4
34	Afrosoricida	Tenrec	15.6
35	Scandentia	Tree shrew	8.9
36	Cetacea	Bottle-nosed dolphin	5.2

Filtering Rows Based on Regex#

There are scenarios where you need to filter string columns based on partial matches; a regular expression comes in handy here.

At the moment, there are very few string functions in datatable; However, we can make do with the re_match function, which is similar to Python’s re.match to get by.

Let’s filter for rows where mouse can be found in the column name:

columns = ['name', 'sleep_total']

# returns a boolean column 
row = dt.re.match(f.name, '.*mouse.*')

DT[rows, columns]

	name	sleep_total
	▪▪▪▪	▪▪▪▪▪▪▪▪
0	North American Opossum	18
1	Thick-tailed opposum	19.4
2	Phalanger	13.7
3	Potoroo	11.1

Filtering Rows based on Multiple Conditions#

Select rows with a bodywt above 100 and either have a sleep_total above 15, or are not part of the Carnivora order:

rows = (f.bodywt > 100) & ((f.sleep_total > 15) | (f.order != "Carnivora"))

columns = ["name", "order", slice("sleep_total", "bodywt")]

DT[rows, columns]

	name	order	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cow	Artiodactyla	4	0.7	0.666667	20	0.423	600
1	Asian elephant	Proboscidea	3.9	NA	NA	20.1	4.603	2547
2	Horse	Perissodactyla	2.9	0.6	1	21.1	0.655	521
3	Donkey	Perissodactyla	3.1	0.4	NA	20.9	0.419	187
4	Giraffe	Artiodactyla	1.9	0.4	NA	22.1	NA	899.995
5	Pilot whale	Cetacea	2.7	0.1	NA	21.35	NA	800
6	African elephant	Proboscidea	3.3	NA	NA	20.7	5.712	6654
7	Tiger	Carnivora	15.8	NA	NA	8.2	NA	162.564
8	Brazilian tapir	Perissodactyla	4.4	1	0.9	19.6	0.169	207.501
9	Bottle-nosed dolphin	Cetacea	5.2	NA	NA	18.8	NA	173.33

Return rows where bodywt is either greater than 100 or brainwt greater than 1, but not both:

rows = xor((f.bodywt > 100), (f.brainwt > 1))

columns = ["name", slice("bodywt", "brainwt")]

DT[rows, columns]

	name	bodywt	brainwt
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cow	600	0.423
1	Horse	521	0.655
2	Donkey	187	0.419
3	Giraffe	899.995	NA
4	Pilot whale	800	NA
5	Human	62	1.32
6	Tiger	162.564	NA
7	Lion	161.499	NA
8	Brazilian tapir	207.501	0.169
9	Bottle-nosed dolphin	173.33	NA

Select all rows where brainwt is larger than 1, but bodywt does not exceed 100:

rows = (f.bodywt <= 100) & (f.brainwt > 1)

columns = ["name", "sleep_total", "brainwt", "bodywt"]

DT[rows, columns]

	name	sleep_total	brainwt	bodywt
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Human	8	1.32	62

Filtering out Empty Rows#

There are two options for filtering out empty rows - comparing with None, or using the isna function:

rows = f.conservation != None 

columns = ["name", slice("conservation", "sleep_cycle")]

DT[rows, columns]

	name	conservation	sleep_total	sleep_rem	sleep_cycle
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	lc	12.1	NA	NA
1	Mountain beaver	nt	14.4	2.4	NA
2	Greater short-tailed shrew	lc	14.9	2.3	0.133333
3	Cow	domesticated	4	0.7	0.666667
4	Northern fur seal	vu	8.7	1.4	0.383333
5	Dog	domesticated	10.1	2.9	0.333333
6	Roe deer	lc	3	NA	NA
7	Goat	lc	5.3	0.6	NA
8	Guinea pig	domesticated	9.4	0.8	0.216667
9	Grivet	lc	10	0.7	NA
10	Chinchilla	domesticated	12.5	1.5	0.116667
11	Star-nosed mole	lc	10.3	2.2	NA
12	Lesser short-tailed shrew	lc	9.1	1.4	0.15
13	Long-nosed armadillo	lc	17.4	3.1	0.383333
14	Tree hyrax	lc	5.3	0.5	NA
⋮	⋮	⋮	⋮	⋮	⋮
49	Arctic ground squirrel	lc	16.6	NA	NA
50	Thirteen-lined ground squirrel	lc	13.8	3.4	0.216667
51	Golden-mantled ground squirrel	lc	15.9	3	NA
52	Pig	domesticated	9.1	2.4	0.5
53	Brazilian tapir	vu	4.4	1	0.9

rows = ~dt.isna(f.conservation) 

columns = ["name", slice("conservation", "sleep_cycle")]

DT[rows, columns]

	name	conservation	sleep_total	sleep_rem	sleep_cycle
	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	lc	12.1	NA	NA
1	Mountain beaver	nt	14.4	2.4	NA
2	Greater short-tailed shrew	lc	14.9	2.3	0.133333
3	Cow	domesticated	4	0.7	0.666667
4	Northern fur seal	vu	8.7	1.4	0.383333
5	Dog	domesticated	10.1	2.9	0.333333
6	Roe deer	lc	3	NA	NA
7	Goat	lc	5.3	0.6	NA
8	Guinea pig	domesticated	9.4	0.8	0.216667
9	Grivet	lc	10	0.7	NA
10	Chinchilla	domesticated	12.5	1.5	0.116667
11	Star-nosed mole	lc	10.3	2.2	NA
12	Lesser short-tailed shrew	lc	9.1	1.4	0.15
13	Long-nosed armadillo	lc	17.4	3.1	0.383333
14	Tree hyrax	lc	5.3	0.5	NA
⋮	⋮	⋮	⋮	⋮	⋮
49	Arctic ground squirrel	lc	16.6	NA	NA
50	Thirteen-lined ground squirrel	lc	13.8	3.4	0.216667
51	Golden-mantled ground squirrel	lc	15.9	3	NA
52	Pig	domesticated	9.1	2.4	0.5
53	Brazilian tapir	vu	4.4	1	0.9

Filtering across Multiple Columns#

Filter across all Columns#

It is possible to filter for rows based on values across columns.

One thing to note, and be careful about, is that in datatable, within the same bracket, operations in the i section, occur before any operation within the j section; as such, depending on the context, and to ensure the right output, it is better to select the columns first, then chain the row filtration via another bracket. The examples below should make this clearer.

Let’s filter for rows across the selected columns, keeping only rows where any column has the pattern Ca inside:

columns = f['name':'order', 'sleep_total'].remove(f.vore)
rows = dt.re.match(f[str], ".*Ca.*").rowany()

DT[rows, columns]

	name	genus	order	sleep_total
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	Carnivora	12.1
1	Northern fur seal	Callorhinus	Carnivora	8.7
2	Vesper mouse	Calomys	Rodentia	7
3	Dog	Canis	Carnivora	10.1
4	Roe deer	Capreolus	Artiodactyla	3
5	Goat	Capri	Artiodactyla	5.3
6	Guinea pig	Cavis	Rodentia	9.4
7	Domestic cat	Felis	Carnivora	12.5
8	Gray seal	Haliochoerus	Carnivora	6.2
9	Tiger	Panthera	Carnivora	15.8
10	Jaguar	Panthera	Carnivora	10.4
11	Lion	Panthera	Carnivora	13.5
12	Caspian seal	Phoca	Carnivora	3.5
13	Genet	Genetta	Carnivora	6.3
14	Arctic fox	Vulpes	Carnivora	12.5
15	Red fox	Vulpes	Carnivora	9.8

Let’s look at another example, to filter for rows, across selected columns, where any column has a value less than 0.1:

columns = f['name', 'sleep_total':'bodywt']

rows = (f[int, float] < 0.1).rowany()

DT[rows, columns]

	name	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Owl monkey	17	1.8	NA	7	0.0155	0.48
1	Greater short-tailed shrew	14.9	2.3	0.133333	9.1	0.00029	0.019
2	Vesper mouse	7	NA	NA	17	NA	0.045
3	Dog	10.1	2.9	0.333333	13.9	0.07	14
4	Roe deer	3	NA	NA	21	0.0982	14.8
5	Guinea pig	9.4	0.8	0.216667	14.6	0.0055	0.728
6	Chinchilla	12.5	1.5	0.116667	11.5	0.0064	0.42
7	Star-nosed mole	10.3	2.2	NA	13.7	0.001	0.06
8	African giant pouched rat	8.3	2	NA	15.7	0.0066	1
9	Lesser short-tailed shrew	9.1	1.4	0.15	14.9	0.00014	0.005
10	Long-nosed armadillo	17.4	3.1	0.383333	6.6	0.0108	3.5
11	Tree hyrax	5.3	0.5	NA	18.7	0.0123	2.95
12	North American Opossum	18	4.9	0.333333	6	0.0063	1.7
13	Big brown bat	19.7	3.9	0.116667	4.3	0.0003	0.023
14	European hedgehog	10.1	3.5	0.283333	13.9	0.0035	0.77
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
42	Tenrec	15.6	2.3	NA	8.4	0.0026	0.9
43	Tree shrew	8.9	2.6	0.233333	15.1	0.0025	0.104
44	Genet	6.3	1.3	NA	17.7	0.0175	2
45	Arctic fox	12.5	NA	NA	11.5	0.0445	3.38
46	Red fox	9.8	2.4	0.35	14.2	0.0504	4.23

The above example only requires that at least one column has a value less than 0.1. What if the goal is to return only rows where all the columns have values above 1?

columns = f['name', 'sleep_total' : 'bodywt'].remove(f.awake)

rows = (f[int, float] > 1).rowall()

DT[rows, columns]

	name	sleep_total	sleep_rem	sleep_cycle	brainwt	bodywt
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Human	8	1.9	1.5	1.32	62

Note the change from rowany to rowall; rowany will return True for rows where ANY column matches the condition, whereas rowall will only return True for rows where ALL columns match the condition.

All the examples so far combine i and j within a single bracket; so why all the noise about context and selecting columns first before rows? The next section should shed more light.

You can also limit the filtration to columns that match a particular type; the examples above show how that can be done. This can be handy in certain situations where the target is not limited to one data type.

Consider the example below, where only rows that have nulls should be returned. Nulls can be both in numeric and string columns:

columns = f['name' : 'order', 'sleep_total' : 'sleep_rem']

rows = (f[:] == None).rowany()

DT[:, columns][rows, :]

	name	genus	vore	order	sleep_total	sleep_rem
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Cheetah	Acinonyx	carni	Carnivora	12.1	NA
1	Vesper mouse	Calomys	NA	Rodentia	7	NA
2	Roe deer	Capreolus	herbi	Artiodactyla	3	NA
3	Asian elephant	Elephas	herbi	Proboscidea	3.9	NA
4	Western american chipmunk	Eutamias	herbi	Rodentia	14.9	NA
5	African elephant	Loxodonta	herbi	Proboscidea	3.3	NA
6	Vole	Microtus	herbi	Rodentia	12.8	NA
7	Round-tailed muskrat	Neofiber	herbi	Rodentia	14.6	NA
8	Slow loris	Nyctibeus	carni	Primates	11	NA
9	Northern grasshopper mouse	Onychomys	carni	Rodentia	14.5	NA
10	Tiger	Panthera	carni	Carnivora	15.8	NA
11	Jaguar	Panthera	carni	Carnivora	10.4	NA
12	Lion	Panthera	carni	Carnivora	13.5	NA
13	Desert hedgehog	Paraechinus	NA	Erinaceomorpha	10.3	2.7
14	Potto	Perodicticus	omni	Primates	11	NA
15	Deer mouse	Peromyscus	NA	Rodentia	11.5	NA
16	Phalanger	Phalanger	NA	Diprotodontia	13.7	1.8
17	Common porpoise	Phocoena	carni	Cetacea	5.6	NA
18	Rock hyrax	Procavia	NA	Hyracoidea	5.4	0.5
19	African striped mouse	Rhabdomys	omni	Rodentia	8.7	NA
20	Mole rat	Spalax	NA	Rodentia	10.6	2.4
21	Arctic ground squirrel	Spermophilus	herbi	Rodentia	16.6	NA
22	Musk shrew	Suncus	NA	Soricomorpha	12.8	2
23	Short-nosed echidna	Tachyglossus	insecti	Monotremata	8.6	NA
24	Eastern american chipmunk	Tamias	herbi	Rodentia	15.8	NA
25	Bottle-nosed dolphin	Tursiops	carni	Cetacea	5.2	NA
26	Arctic fox	Vulpes	carni	Carnivora	12.5	NA

However, we are only interested in rows where the string columns are null; simply modifying the selected columns in rows should resolve this:

rows = (f[str] == None).rowany()

DT[rows, columns]

	name	genus	vore	order	sleep_total	sleep_rem
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Owl monkey	Aotus	omni	Primates	17	1.8
1	Three-toed sloth	Bradypus	herbi	Pilosa	14.4	2.2
2	Vesper mouse	Calomys	NA	Rodentia	7	NA
3	African giant pouched rat	Cricetomys	omni	Rodentia	8.3	2
4	Western american chipmunk	Eutamias	herbi	Rodentia	14.9	NA
5	Galago	Galago	omni	Primates	9.8	1.1
6	Human	Homo	omni	Primates	8	1.9
7	Macaque	Macaca	omni	Primates	10.1	1.2
8	Vole	Microtus	herbi	Rodentia	12.8	NA
9	Little brown bat	Myotis	insecti	Chiroptera	19.9	2
10	Slow loris	Nyctibeus	carni	Primates	11	NA
11	Chimpanzee	Pan	omni	Primates	9.7	1.4
12	Baboon	Papio	omni	Primates	9.4	1
13	Desert hedgehog	Paraechinus	NA	Erinaceomorpha	10.3	2.7
14	Deer mouse	Peromyscus	NA	Rodentia	11.5	NA
⋮	⋮	⋮	⋮	⋮	⋮	⋮
26	Tree shrew	Tupaia	omni	Scandentia	8.9	2.6
27	Bottle-nosed dolphin	Tursiops	carni	Cetacea	5.2	NA
28	Genet	Genetta	carni	Carnivora	6.3	1.3
29	Arctic fox	Vulpes	carni	Carnivora	12.5	NA
30	Red fox	Vulpes	carni	Carnivora	9.8	2.4

That doesn’t seem right. There are no missing values in rows 0 and 1, same for rows 26 and 30. What’s going on?

As mentioned earlier, operations in i occur before j; in the code above, ALL the string columns in the frame were filtered, and not restricted to the selected columns. The right way about it is to select the columns first, then chain the row filtration:

DT[:, columns][rows, :]

	name	genus	vore	order	sleep_total	sleep_rem
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Vesper mouse	Calomys	NA	Rodentia	7	NA
1	Desert hedgehog	Paraechinus	NA	Erinaceomorpha	10.3	2.7
2	Deer mouse	Peromyscus	NA	Rodentia	11.5	NA
3	Phalanger	Phalanger	NA	Diprotodontia	13.7	1.8
4	Rock hyrax	Procavia	NA	Hyracoidea	5.4	0.5
5	Mole rat	Spalax	NA	Rodentia	10.6	2.4
6	Musk shrew	Suncus	NA	Soricomorpha	12.8	2

Again, the filtration process depend on the context, and should be adapted accordingly.

Filter at#

It is also possible to filter rows based on specific columns:

columns = f['name', 'sleep_total' : 'sleep_rem', 'brainwt' : 'bodywt']

rows = (f['sleep_total', 'sleep_rem'] > 5).rowall()

DT[rows, columns]

	name	sleep_total	sleep_rem	brainwt	bodywt
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Thick-tailed opposum	19.4	6.6	NA	0.37
1	Giant armadillo	18.1	6.1	0.081	60

Note in the example above, the rows and columns are within the same bracket, because the columns are explicitly specified; only values from those columns will be used for the filtration.

Another example below that uses a different select option:

columns = f['name', 'sleep_total' : 'sleep_rem', 'brainwt' : 'bodywt']

rows = [name for name in DT[:, columns].names if 'sleep' in name]

rows = (f[rows]>5).rowall()

DT[rows, columns]

	name	sleep_total	sleep_rem	brainwt	bodywt
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	Thick-tailed opposum	19.4	6.6	NA	0.37
1	Giant armadillo	18.1	6.1	0.081	60

Resources:

datatable docs
Based on datatable version 1.1