Filtering Rows in Datatable#
This article highlights various ways of filtering rows in python datatable. The examples used here are based off the excellent article by Susan Baert.
The data file can be accessed here
Basic Row Filters#
from datatable import dt, f
from operator import or_, xor
file_path = "Data_files/msleep.txt"
DT = dt.fread(file_path)
DT.head(5)
name | genus | vore | order | conservation | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | |
---|---|---|---|---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Cheetah | Acinonyx | carni | Carnivora | lc | 12.1 | NA | NA | 11.9 | NA | 50 |
1 | Owl monkey | Aotus | omni | Primates | NA | 17 | 1.8 | NA | 7 | 0.0155 | 0.48 |
2 | Mountain beaver | Aplodontia | herbi | Rodentia | nt | 14.4 | 2.4 | NA | 9.6 | NA | 1.35 |
3 | Greater short-tailed shrew | Blarina | omni | Soricomorpha | lc | 14.9 | 2.3 | 0.133333 | 9.1 | 0.00029 | 0.019 |
4 | Cow | Bos | herbi | Artiodactyla | domesticated | 4 | 0.7 | 0.666667 | 20 | 0.423 | 600 |
Filtering Rows Based on a Numeric Variable#
You can filter numeric variables based on their values. A number of commonly used operators include: >, >=, <, <=, == and !=.
Note that in datatable, filtration occurs in the i
section:
DT[f.sleep_total > 18, ["name", "sleep_total"]]
name | sleep_total | |
---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Big brown bat | 19.7 |
1 | Thick-tailed opposum | 19.4 |
2 | Little brown bat | 19.9 |
3 | Giant armadillo | 18.1 |
To select a range of values, you can use two logical requirements; in the example below, only rows where sleep_total
is greater than or equal to 16, and less than or equal to 18 are selected:
DT[(f.sleep_total >= 16) & (f.sleep_total <= 18), ["name", "sleep_total"]]
name | sleep_total | |
---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Owl monkey | 17 |
1 | Long-nosed armadillo | 17.4 |
2 | North American Opossum | 18 |
3 | Arctic ground squirrel | 16.6 |
Note in the code above, that each condition is wrapped in parentheses; this is to ensure that the correct output is obtained.
In Pandas/dplyr/rdatatable, there is a between
function that makes selection such as the above easier; at the moment, there is no equivalent function in datatable; you can create a temporary between
function:
def between(column, left, right):
l = f[column]>=left
r = f[column]<=right
return l & r
DT[between('sleep_total', 16, 18), ['name', 'sleep_total']]
name | sleep_total | |
---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Owl monkey | 17 |
1 | Long-nosed armadillo | 17.4 |
2 | North American Opossum | 18 |
3 | Arctic ground squirrel | 16.6 |
There are scenarios where you may want to select rows where the value is nearly a given value. You may also want to specify a tolerance value to indicate how far the values can be.
This can be replicated with the isclose function in the datatable.math submodule.
Let’s assume that the tolerance should be within one standard deviation of 17:
from datatable.math import isclose
# calculate tolerance
tolerance = DT['sleep_total'].sd1()
DT[isclose(f.sleep_total, 17, atol = tolerance), ['name', 'sleep_total']]
name | sleep_total | |
---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Owl monkey | 17 |
1 | Mountain beaver | 14.4 |
2 | Greater short-tailed shrew | 14.9 |
3 | Three-toed sloth | 14.4 |
4 | Long-nosed armadillo | 17.4 |
5 | North American Opossum | 18 |
6 | Big brown bat | 19.7 |
7 | Western american chipmunk | 14.9 |
8 | Thick-tailed opposum | 19.4 |
9 | Mongolian gerbil | 14.2 |
10 | Golden hamster | 14.3 |
11 | Vole | 12.8 |
12 | Little brown bat | 19.9 |
13 | Round-tailed muskrat | 14.6 |
14 | Northern grasshopper mouse | 14.5 |
15 | Tiger | 15.8 |
16 | Lion | 13.5 |
17 | Phalanger | 13.7 |
18 | Giant armadillo | 18.1 |
19 | Laboratory rat | 13 |
20 | Arctic ground squirrel | 16.6 |
21 | Thirteen-lined ground squirrel | 13.8 |
22 | Golden-mantled ground squirrel | 15.9 |
23 | Musk shrew | 12.8 |
24 | Eastern american chipmunk | 15.8 |
25 | Tenrec | 15.6 |
Filtering based on String Matches#
You can select on string matches as well; in the example below, the ==
comparison operator is used to select a specific group of animals:
DT[f.order == "Didelphimorphia", ["order", "name", "sleep_total"]]
order | name | sleep_total | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Didelphimorphia | North American Opossum | 18 |
1 | Didelphimorphia | Thick-tailed opposum | 19.4 |
Other operators can be used also:
DT[f.order != 'Rodentia', ['order', 'name', 'sleep_total']]
order | name | sleep_total | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Carnivora | Cheetah | 12.1 |
1 | Primates | Owl monkey | 17 |
2 | Soricomorpha | Greater short-tailed shrew | 14.9 |
3 | Artiodactyla | Cow | 4 |
4 | Pilosa | Three-toed sloth | 14.4 |
5 | Carnivora | Northern fur seal | 8.7 |
6 | Carnivora | Dog | 10.1 |
7 | Artiodactyla | Roe deer | 3 |
8 | Artiodactyla | Goat | 5.3 |
9 | Primates | Grivet | 10 |
10 | Soricomorpha | Star-nosed mole | 10.3 |
11 | Soricomorpha | Lesser short-tailed shrew | 9.1 |
12 | Cingulata | Long-nosed armadillo | 17.4 |
13 | Hyracoidea | Tree hyrax | 5.3 |
14 | Didelphimorphia | North American Opossum | 18 |
⋮ | ⋮ | ⋮ | ⋮ |
56 | Scandentia | Tree shrew | 8.9 |
57 | Cetacea | Bottle-nosed dolphin | 5.2 |
58 | Carnivora | Genet | 6.3 |
59 | Carnivora | Arctic fox | 12.5 |
60 | Carnivora | Red fox | 9.8 |
DT[f.name > 'V', ['order', 'name', 'sleep_total']]
order | name | sleep_total | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Rodentia | Vesper mouse | 7 |
1 | Rodentia | Western american chipmunk | 14.9 |
2 | Rodentia | Vole | 12.8 |
In the examples above, only one animal is used; to select more animals, you could pass a list of conditions, with the |
(or) symbol:
rows = (f.order == "Didelphimorphia") | (f.order == "Diprotodontia")
columns = ['order', 'name', 'sleep_total']
DT[rows, columns]
order | name | sleep_total | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Didelphimorphia | North American Opossum | 18 |
1 | Didelphimorphia | Thick-tailed opposum | 19.4 |
2 | Diprotodontia | Phalanger | 13.7 |
3 | Diprotodontia | Potoroo | 11.1 |
However, this can become unwieldy, as the number of animals increase. At the moment, there is no equivalent of python’s in operator in datatable; let’s create a temporary function to help with this:
from functools import reduce
def isin(column, sequence_of_labels):
func = lambda x: f[column] == x
return reduce(or_, map(func, sequence_of_labels))
labels = ("Didelphimorphia", "Diprotodontia")
columns = ["order", "name", "sleep_total"]
DT[isin('order', labels), columns]
order | name | sleep_total | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Didelphimorphia | North American Opossum | 18 |
1 | Didelphimorphia | Thick-tailed opposum | 19.4 |
2 | Diprotodontia | Phalanger | 13.7 |
3 | Diprotodontia | Potoroo | 11.1 |
You can also deselect certain groups using the isin
function above, and combine it with the ~
symbol:
labels = ("Rodentia", "Carnivora", "Primates")
columns = ['order', 'name', 'sleep_total']
DT[~isin('order', labels), columns]
order | name | sleep_total | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Soricomorpha | Greater short-tailed shrew | 14.9 |
1 | Artiodactyla | Cow | 4 |
2 | Pilosa | Three-toed sloth | 14.4 |
3 | Artiodactyla | Roe deer | 3 |
4 | Artiodactyla | Goat | 5.3 |
5 | Soricomorpha | Star-nosed mole | 10.3 |
6 | Soricomorpha | Lesser short-tailed shrew | 9.1 |
7 | Cingulata | Long-nosed armadillo | 17.4 |
8 | Hyracoidea | Tree hyrax | 5.3 |
9 | Didelphimorphia | North American Opossum | 18 |
10 | Proboscidea | Asian elephant | 3.9 |
11 | Chiroptera | Big brown bat | 19.7 |
12 | Perissodactyla | Horse | 2.9 |
13 | Perissodactyla | Donkey | 3.1 |
14 | Erinaceomorpha | European hedgehog | 10.1 |
⋮ | ⋮ | ⋮ | ⋮ |
32 | Monotremata | Short-nosed echidna | 8.6 |
33 | Perissodactyla | Brazilian tapir | 4.4 |
34 | Afrosoricida | Tenrec | 15.6 |
35 | Scandentia | Tree shrew | 8.9 |
36 | Cetacea | Bottle-nosed dolphin | 5.2 |
Filtering Rows Based on Regex#
There are scenarios where you need to filter string columns based on partial matches; a regular expression comes in handy here.
At the moment, there are very few string functions in datatable; However, we can make do with the re_match
function, which is similar to Python’s re.match to get by.
Let’s filter for rows where mouse
can be found in the column name
:
columns = ['name', 'sleep_total']
# returns a boolean column
row = dt.re.match(f.name, '.*mouse.*')
DT[rows, columns]
name | sleep_total | |
---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | North American Opossum | 18 |
1 | Thick-tailed opposum | 19.4 |
2 | Phalanger | 13.7 |
3 | Potoroo | 11.1 |
Filtering Rows based on Multiple Conditions#
Select rows with a bodywt
above 100 and either have a sleep_total
above 15, or are not part of the Carnivora
order
:
rows = (f.bodywt > 100) & ((f.sleep_total > 15) | (f.order != "Carnivora"))
columns = ["name", "order", slice("sleep_total", "bodywt")]
DT[rows, columns]
name | order | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | |
---|---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Cow | Artiodactyla | 4 | 0.7 | 0.666667 | 20 | 0.423 | 600 |
1 | Asian elephant | Proboscidea | 3.9 | NA | NA | 20.1 | 4.603 | 2547 |
2 | Horse | Perissodactyla | 2.9 | 0.6 | 1 | 21.1 | 0.655 | 521 |
3 | Donkey | Perissodactyla | 3.1 | 0.4 | NA | 20.9 | 0.419 | 187 |
4 | Giraffe | Artiodactyla | 1.9 | 0.4 | NA | 22.1 | NA | 899.995 |
5 | Pilot whale | Cetacea | 2.7 | 0.1 | NA | 21.35 | NA | 800 |
6 | African elephant | Proboscidea | 3.3 | NA | NA | 20.7 | 5.712 | 6654 |
7 | Tiger | Carnivora | 15.8 | NA | NA | 8.2 | NA | 162.564 |
8 | Brazilian tapir | Perissodactyla | 4.4 | 1 | 0.9 | 19.6 | 0.169 | 207.501 |
9 | Bottle-nosed dolphin | Cetacea | 5.2 | NA | NA | 18.8 | NA | 173.33 |
Return rows where bodywt
is either greater than 100 or brainwt
greater than 1, but not both:
rows = xor((f.bodywt > 100), (f.brainwt > 1))
columns = ["name", slice("bodywt", "brainwt")]
DT[rows, columns]
name | bodywt | brainwt | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Cow | 600 | 0.423 |
1 | Horse | 521 | 0.655 |
2 | Donkey | 187 | 0.419 |
3 | Giraffe | 899.995 | NA |
4 | Pilot whale | 800 | NA |
5 | Human | 62 | 1.32 |
6 | Tiger | 162.564 | NA |
7 | Lion | 161.499 | NA |
8 | Brazilian tapir | 207.501 | 0.169 |
9 | Bottle-nosed dolphin | 173.33 | NA |
Select all rows where brainwt
is larger than 1, but bodywt
does not exceed 100:
rows = (f.bodywt <= 100) & (f.brainwt > 1)
columns = ["name", "sleep_total", "brainwt", "bodywt"]
DT[rows, columns]
name | sleep_total | brainwt | bodywt | |
---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Human | 8 | 1.32 | 62 |
Filtering out Empty Rows#
There are two options for filtering out empty rows - comparing with None
, or using the isna
function:
rows = f.conservation != None
columns = ["name", slice("conservation", "sleep_cycle")]
DT[rows, columns]
name | conservation | sleep_total | sleep_rem | sleep_cycle | |
---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Cheetah | lc | 12.1 | NA | NA |
1 | Mountain beaver | nt | 14.4 | 2.4 | NA |
2 | Greater short-tailed shrew | lc | 14.9 | 2.3 | 0.133333 |
3 | Cow | domesticated | 4 | 0.7 | 0.666667 |
4 | Northern fur seal | vu | 8.7 | 1.4 | 0.383333 |
5 | Dog | domesticated | 10.1 | 2.9 | 0.333333 |
6 | Roe deer | lc | 3 | NA | NA |
7 | Goat | lc | 5.3 | 0.6 | NA |
8 | Guinea pig | domesticated | 9.4 | 0.8 | 0.216667 |
9 | Grivet | lc | 10 | 0.7 | NA |
10 | Chinchilla | domesticated | 12.5 | 1.5 | 0.116667 |
11 | Star-nosed mole | lc | 10.3 | 2.2 | NA |
12 | Lesser short-tailed shrew | lc | 9.1 | 1.4 | 0.15 |
13 | Long-nosed armadillo | lc | 17.4 | 3.1 | 0.383333 |
14 | Tree hyrax | lc | 5.3 | 0.5 | NA |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
49 | Arctic ground squirrel | lc | 16.6 | NA | NA |
50 | Thirteen-lined ground squirrel | lc | 13.8 | 3.4 | 0.216667 |
51 | Golden-mantled ground squirrel | lc | 15.9 | 3 | NA |
52 | Pig | domesticated | 9.1 | 2.4 | 0.5 |
53 | Brazilian tapir | vu | 4.4 | 1 | 0.9 |
rows = ~dt.isna(f.conservation)
columns = ["name", slice("conservation", "sleep_cycle")]
DT[rows, columns]
name | conservation | sleep_total | sleep_rem | sleep_cycle | |
---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Cheetah | lc | 12.1 | NA | NA |
1 | Mountain beaver | nt | 14.4 | 2.4 | NA |
2 | Greater short-tailed shrew | lc | 14.9 | 2.3 | 0.133333 |
3 | Cow | domesticated | 4 | 0.7 | 0.666667 |
4 | Northern fur seal | vu | 8.7 | 1.4 | 0.383333 |
5 | Dog | domesticated | 10.1 | 2.9 | 0.333333 |
6 | Roe deer | lc | 3 | NA | NA |
7 | Goat | lc | 5.3 | 0.6 | NA |
8 | Guinea pig | domesticated | 9.4 | 0.8 | 0.216667 |
9 | Grivet | lc | 10 | 0.7 | NA |
10 | Chinchilla | domesticated | 12.5 | 1.5 | 0.116667 |
11 | Star-nosed mole | lc | 10.3 | 2.2 | NA |
12 | Lesser short-tailed shrew | lc | 9.1 | 1.4 | 0.15 |
13 | Long-nosed armadillo | lc | 17.4 | 3.1 | 0.383333 |
14 | Tree hyrax | lc | 5.3 | 0.5 | NA |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
49 | Arctic ground squirrel | lc | 16.6 | NA | NA |
50 | Thirteen-lined ground squirrel | lc | 13.8 | 3.4 | 0.216667 |
51 | Golden-mantled ground squirrel | lc | 15.9 | 3 | NA |
52 | Pig | domesticated | 9.1 | 2.4 | 0.5 |
53 | Brazilian tapir | vu | 4.4 | 1 | 0.9 |
Filtering across Multiple Columns#
Filter across all Columns#
It is possible to filter for rows based on values across columns.
One thing to note, and be careful about, is that in datatable, within the same bracket, operations in the i
section, occur before any operation within the j
section; as such, depending on the context, and to ensure the right output, it is better to select the columns first, then chain the row filtration via another bracket. The examples below should make this clearer.
Let’s filter for rows across the selected columns, keeping only rows where any column has the pattern Ca
inside:
columns = f['name':'order', 'sleep_total'].remove(f.vore)
rows = dt.re.match(f[str], ".*Ca.*").rowany()
DT[rows, columns]
name | genus | order | sleep_total | |
---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Cheetah | Acinonyx | Carnivora | 12.1 |
1 | Northern fur seal | Callorhinus | Carnivora | 8.7 |
2 | Vesper mouse | Calomys | Rodentia | 7 |
3 | Dog | Canis | Carnivora | 10.1 |
4 | Roe deer | Capreolus | Artiodactyla | 3 |
5 | Goat | Capri | Artiodactyla | 5.3 |
6 | Guinea pig | Cavis | Rodentia | 9.4 |
7 | Domestic cat | Felis | Carnivora | 12.5 |
8 | Gray seal | Haliochoerus | Carnivora | 6.2 |
9 | Tiger | Panthera | Carnivora | 15.8 |
10 | Jaguar | Panthera | Carnivora | 10.4 |
11 | Lion | Panthera | Carnivora | 13.5 |
12 | Caspian seal | Phoca | Carnivora | 3.5 |
13 | Genet | Genetta | Carnivora | 6.3 |
14 | Arctic fox | Vulpes | Carnivora | 12.5 |
15 | Red fox | Vulpes | Carnivora | 9.8 |
Let’s look at another example, to filter for rows, across selected columns, where any column has a value less than 0.1:
columns = f['name', 'sleep_total':'bodywt']
rows = (f[int, float] < 0.1).rowany()
DT[rows, columns]
name | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | |
---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Owl monkey | 17 | 1.8 | NA | 7 | 0.0155 | 0.48 |
1 | Greater short-tailed shrew | 14.9 | 2.3 | 0.133333 | 9.1 | 0.00029 | 0.019 |
2 | Vesper mouse | 7 | NA | NA | 17 | NA | 0.045 |
3 | Dog | 10.1 | 2.9 | 0.333333 | 13.9 | 0.07 | 14 |
4 | Roe deer | 3 | NA | NA | 21 | 0.0982 | 14.8 |
5 | Guinea pig | 9.4 | 0.8 | 0.216667 | 14.6 | 0.0055 | 0.728 |
6 | Chinchilla | 12.5 | 1.5 | 0.116667 | 11.5 | 0.0064 | 0.42 |
7 | Star-nosed mole | 10.3 | 2.2 | NA | 13.7 | 0.001 | 0.06 |
8 | African giant pouched rat | 8.3 | 2 | NA | 15.7 | 0.0066 | 1 |
9 | Lesser short-tailed shrew | 9.1 | 1.4 | 0.15 | 14.9 | 0.00014 | 0.005 |
10 | Long-nosed armadillo | 17.4 | 3.1 | 0.383333 | 6.6 | 0.0108 | 3.5 |
11 | Tree hyrax | 5.3 | 0.5 | NA | 18.7 | 0.0123 | 2.95 |
12 | North American Opossum | 18 | 4.9 | 0.333333 | 6 | 0.0063 | 1.7 |
13 | Big brown bat | 19.7 | 3.9 | 0.116667 | 4.3 | 0.0003 | 0.023 |
14 | European hedgehog | 10.1 | 3.5 | 0.283333 | 13.9 | 0.0035 | 0.77 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
42 | Tenrec | 15.6 | 2.3 | NA | 8.4 | 0.0026 | 0.9 |
43 | Tree shrew | 8.9 | 2.6 | 0.233333 | 15.1 | 0.0025 | 0.104 |
44 | Genet | 6.3 | 1.3 | NA | 17.7 | 0.0175 | 2 |
45 | Arctic fox | 12.5 | NA | NA | 11.5 | 0.0445 | 3.38 |
46 | Red fox | 9.8 | 2.4 | 0.35 | 14.2 | 0.0504 | 4.23 |
The above example only requires that at least one column has a value less than 0.1. What if the goal is to return only rows where all the columns have values above 1?
columns = f['name', 'sleep_total' : 'bodywt'].remove(f.awake)
rows = (f[int, float] > 1).rowall()
DT[rows, columns]
name | sleep_total | sleep_rem | sleep_cycle | brainwt | bodywt | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Human | 8 | 1.9 | 1.5 | 1.32 | 62 |
Note the change from rowany to rowall; rowany will return True
for rows where ANY
column matches the condition, whereas rowall will only return True
for rows where ALL
columns match the condition.
All the examples so far combine i
and j
within a single bracket; so why all the noise about context and selecting columns first before rows? The next section should shed more light.
You can also limit the filtration to columns that match a particular type; the examples above show how that can be done. This can be handy in certain situations where the target is not limited to one data type.
Consider the example below, where only rows that have nulls should be returned. Nulls can be both in numeric and string columns:
columns = f['name' : 'order', 'sleep_total' : 'sleep_rem']
rows = (f[:] == None).rowany()
DT[:, columns][rows, :]
name | genus | vore | order | sleep_total | sleep_rem | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Cheetah | Acinonyx | carni | Carnivora | 12.1 | NA |
1 | Vesper mouse | Calomys | NA | Rodentia | 7 | NA |
2 | Roe deer | Capreolus | herbi | Artiodactyla | 3 | NA |
3 | Asian elephant | Elephas | herbi | Proboscidea | 3.9 | NA |
4 | Western american chipmunk | Eutamias | herbi | Rodentia | 14.9 | NA |
5 | African elephant | Loxodonta | herbi | Proboscidea | 3.3 | NA |
6 | Vole | Microtus | herbi | Rodentia | 12.8 | NA |
7 | Round-tailed muskrat | Neofiber | herbi | Rodentia | 14.6 | NA |
8 | Slow loris | Nyctibeus | carni | Primates | 11 | NA |
9 | Northern grasshopper mouse | Onychomys | carni | Rodentia | 14.5 | NA |
10 | Tiger | Panthera | carni | Carnivora | 15.8 | NA |
11 | Jaguar | Panthera | carni | Carnivora | 10.4 | NA |
12 | Lion | Panthera | carni | Carnivora | 13.5 | NA |
13 | Desert hedgehog | Paraechinus | NA | Erinaceomorpha | 10.3 | 2.7 |
14 | Potto | Perodicticus | omni | Primates | 11 | NA |
15 | Deer mouse | Peromyscus | NA | Rodentia | 11.5 | NA |
16 | Phalanger | Phalanger | NA | Diprotodontia | 13.7 | 1.8 |
17 | Common porpoise | Phocoena | carni | Cetacea | 5.6 | NA |
18 | Rock hyrax | Procavia | NA | Hyracoidea | 5.4 | 0.5 |
19 | African striped mouse | Rhabdomys | omni | Rodentia | 8.7 | NA |
20 | Mole rat | Spalax | NA | Rodentia | 10.6 | 2.4 |
21 | Arctic ground squirrel | Spermophilus | herbi | Rodentia | 16.6 | NA |
22 | Musk shrew | Suncus | NA | Soricomorpha | 12.8 | 2 |
23 | Short-nosed echidna | Tachyglossus | insecti | Monotremata | 8.6 | NA |
24 | Eastern american chipmunk | Tamias | herbi | Rodentia | 15.8 | NA |
25 | Bottle-nosed dolphin | Tursiops | carni | Cetacea | 5.2 | NA |
26 | Arctic fox | Vulpes | carni | Carnivora | 12.5 | NA |
However, we are only interested in rows where the string columns are null; simply modifying the selected columns in rows
should resolve this:
rows = (f[str] == None).rowany()
DT[rows, columns]
name | genus | vore | order | sleep_total | sleep_rem | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Owl monkey | Aotus | omni | Primates | 17 | 1.8 |
1 | Three-toed sloth | Bradypus | herbi | Pilosa | 14.4 | 2.2 |
2 | Vesper mouse | Calomys | NA | Rodentia | 7 | NA |
3 | African giant pouched rat | Cricetomys | omni | Rodentia | 8.3 | 2 |
4 | Western american chipmunk | Eutamias | herbi | Rodentia | 14.9 | NA |
5 | Galago | Galago | omni | Primates | 9.8 | 1.1 |
6 | Human | Homo | omni | Primates | 8 | 1.9 |
7 | Macaque | Macaca | omni | Primates | 10.1 | 1.2 |
8 | Vole | Microtus | herbi | Rodentia | 12.8 | NA |
9 | Little brown bat | Myotis | insecti | Chiroptera | 19.9 | 2 |
10 | Slow loris | Nyctibeus | carni | Primates | 11 | NA |
11 | Chimpanzee | Pan | omni | Primates | 9.7 | 1.4 |
12 | Baboon | Papio | omni | Primates | 9.4 | 1 |
13 | Desert hedgehog | Paraechinus | NA | Erinaceomorpha | 10.3 | 2.7 |
14 | Deer mouse | Peromyscus | NA | Rodentia | 11.5 | NA |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
26 | Tree shrew | Tupaia | omni | Scandentia | 8.9 | 2.6 |
27 | Bottle-nosed dolphin | Tursiops | carni | Cetacea | 5.2 | NA |
28 | Genet | Genetta | carni | Carnivora | 6.3 | 1.3 |
29 | Arctic fox | Vulpes | carni | Carnivora | 12.5 | NA |
30 | Red fox | Vulpes | carni | Carnivora | 9.8 | 2.4 |
That doesn’t seem right. There are no missing values in rows 0 and 1, same for rows 26 and 30. What’s going on?
As mentioned earlier, operations in i
occur before j
; in the code above, ALL the string columns in the frame were filtered, and not restricted to the selected columns. The right way about it is to select the columns first, then chain the row filtration:
DT[:, columns][rows, :]
name | genus | vore | order | sleep_total | sleep_rem | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Vesper mouse | Calomys | NA | Rodentia | 7 | NA |
1 | Desert hedgehog | Paraechinus | NA | Erinaceomorpha | 10.3 | 2.7 |
2 | Deer mouse | Peromyscus | NA | Rodentia | 11.5 | NA |
3 | Phalanger | Phalanger | NA | Diprotodontia | 13.7 | 1.8 |
4 | Rock hyrax | Procavia | NA | Hyracoidea | 5.4 | 0.5 |
5 | Mole rat | Spalax | NA | Rodentia | 10.6 | 2.4 |
6 | Musk shrew | Suncus | NA | Soricomorpha | 12.8 | 2 |
Again, the filtration process depend on the context, and should be adapted accordingly.
Filter at#
It is also possible to filter rows based on specific columns:
columns = f['name', 'sleep_total' : 'sleep_rem', 'brainwt' : 'bodywt']
rows = (f['sleep_total', 'sleep_rem'] > 5).rowall()
DT[rows, columns]
name | sleep_total | sleep_rem | brainwt | bodywt | |
---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Thick-tailed opposum | 19.4 | 6.6 | NA | 0.37 |
1 | Giant armadillo | 18.1 | 6.1 | 0.081 | 60 |
Note in the example above, the rows
and columns
are within the same bracket, because the columns are explicitly specified; only values from those columns will be used for the filtration.
Another example below that uses a different select option:
columns = f['name', 'sleep_total' : 'sleep_rem', 'brainwt' : 'bodywt']
rows = [name for name in DT[:, columns].names if 'sleep' in name]
rows = (f[rows]>5).rowall()
DT[rows, columns]
name | sleep_total | sleep_rem | brainwt | bodywt | |
---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | Thick-tailed opposum | 19.4 | 6.6 | NA | 0.37 |
1 | Giant armadillo | 18.1 | 6.1 | 0.081 | 60 |
Resources:
Based on datatable version
1.1
Comments#