Filtering Rows in Datatable#

This article highlights various ways of filtering rows in python datatable. The examples used here are based off the excellent article by Susan Baert.

The data file can be accessed here

Basic Row Filters#

from datatable import dt, f
from operator import  or_, xor
file_path = "Data_files/msleep.txt"

DT = dt.fread(file_path)

DT.head(5)
namegenusvoreorderconservationsleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc12.1NANA11.9NA50
1Owl monkeyAotusomniPrimatesNA171.8NA70.01550.48
2Mountain beaverAplodontiaherbiRodentiant14.42.4NA9.6NA1.35
3Greater short-tailed shrewBlarinaomniSoricomorphalc14.92.30.1333339.10.000290.019
4CowBosherbiArtiodactyladomesticated40.70.666667200.423600

Filtering Rows Based on a Numeric Variable#

You can filter numeric variables based on their values. A number of commonly used operators include: >, >=, <, <=, == and !=.

Note that in datatable, filtration occurs in the i section:

DT[f.sleep_total > 18,  ["name", "sleep_total"]]
namesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪
0Big brown bat19.7
1Thick-tailed opposum19.4
2Little brown bat19.9
3Giant armadillo18.1

To select a range of values, you can use two logical requirements; in the example below, only rows where sleep_total is greater than or equal to 16, and less than or equal to 18 are selected:

DT[(f.sleep_total >= 16) & (f.sleep_total <= 18),  ["name", "sleep_total"]]
namesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪
0Owl monkey17
1Long-nosed armadillo17.4
2North American Opossum18
3Arctic ground squirrel16.6

Note in the code above, that each condition is wrapped in parentheses; this is to ensure that the correct output is obtained.

In Pandas/dplyr/rdatatable, there is a between function that makes selection such as the above easier; at the moment, there is no equivalent function in datatable; you can create a temporary between function:

def between(column, left, right):
    l = f[column]>=left
    r = f[column]<=right
    return l & r

DT[between('sleep_total', 16, 18),  ['name', 'sleep_total']]
namesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪
0Owl monkey17
1Long-nosed armadillo17.4
2North American Opossum18
3Arctic ground squirrel16.6

There are scenarios where you may want to select rows where the value is nearly a given value. You may also want to specify a tolerance value to indicate how far the values can be.

This can be replicated with the isclose function in the datatable.math submodule.

Let’s assume that the tolerance should be within one standard deviation of 17:

from datatable.math import isclose

# calculate tolerance
tolerance = DT['sleep_total'].sd1()

DT[isclose(f.sleep_total, 17, atol = tolerance),  ['name', 'sleep_total']]
namesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪
0Owl monkey17
1Mountain beaver14.4
2Greater short-tailed shrew14.9
3Three-toed sloth14.4
4Long-nosed armadillo17.4
5North American Opossum18
6Big brown bat19.7
7Western american chipmunk14.9
8Thick-tailed opposum19.4
9Mongolian gerbil14.2
10Golden hamster14.3
11Vole12.8
12Little brown bat19.9
13Round-tailed muskrat14.6
14Northern grasshopper mouse14.5
15Tiger15.8
16Lion13.5
17Phalanger13.7
18Giant armadillo18.1
19Laboratory rat13
20Arctic ground squirrel16.6
21Thirteen-lined ground squirrel13.8
22Golden-mantled ground squirrel15.9
23Musk shrew12.8
24Eastern american chipmunk15.8
25Tenrec15.6

Filtering based on String Matches#

You can select on string matches as well; in the example below, the == comparison operator is used to select a specific group of animals:

DT[f.order == "Didelphimorphia",  ["order", "name", "sleep_total"]]
ordernamesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0DidelphimorphiaNorth American Opossum18
1DidelphimorphiaThick-tailed opposum19.4

Other operators can be used also:

DT[f.order != 'Rodentia',  ['order', 'name', 'sleep_total']]
ordernamesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CarnivoraCheetah12.1
1PrimatesOwl monkey17
2SoricomorphaGreater short-tailed shrew14.9
3ArtiodactylaCow4
4PilosaThree-toed sloth14.4
5CarnivoraNorthern fur seal8.7
6CarnivoraDog10.1
7ArtiodactylaRoe deer3
8ArtiodactylaGoat5.3
9PrimatesGrivet10
10SoricomorphaStar-nosed mole10.3
11SoricomorphaLesser short-tailed shrew9.1
12CingulataLong-nosed armadillo17.4
13HyracoideaTree hyrax5.3
14DidelphimorphiaNorth American Opossum18
56ScandentiaTree shrew8.9
57CetaceaBottle-nosed dolphin5.2
58CarnivoraGenet6.3
59CarnivoraArctic fox12.5
60CarnivoraRed fox9.8
DT[f.name > 'V', ['order', 'name', 'sleep_total']]
ordernamesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0RodentiaVesper mouse7
1RodentiaWestern american chipmunk14.9
2RodentiaVole12.8

In the examples above, only one animal is used; to select more animals, you could pass a list of conditions, with the | (or) symbol:

rows = (f.order == "Didelphimorphia") | (f.order == "Diprotodontia")

columns = ['order', 'name', 'sleep_total']

DT[rows, columns]
ordernamesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0DidelphimorphiaNorth American Opossum18
1DidelphimorphiaThick-tailed opposum19.4
2DiprotodontiaPhalanger13.7
3DiprotodontiaPotoroo11.1

However, this can become unwieldy, as the number of animals increase. At the moment, there is no equivalent of python’s in operator in datatable; let’s create a temporary function to help with this:

from functools import reduce

def isin(column, sequence_of_labels):
    func = lambda x: f[column] == x
    return reduce(or_, map(func, sequence_of_labels))


labels =  ("Didelphimorphia", "Diprotodontia")

columns = ["order", "name", "sleep_total"]

DT[isin('order', labels), columns]
ordernamesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0DidelphimorphiaNorth American Opossum18
1DidelphimorphiaThick-tailed opposum19.4
2DiprotodontiaPhalanger13.7
3DiprotodontiaPotoroo11.1

You can also deselect certain groups using the isin function above, and combine it with the ~ symbol:

labels = ("Rodentia", "Carnivora", "Primates")

columns = ['order', 'name', 'sleep_total']

DT[~isin('order', labels), columns]
ordernamesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0SoricomorphaGreater short-tailed shrew14.9
1ArtiodactylaCow4
2PilosaThree-toed sloth14.4
3ArtiodactylaRoe deer3
4ArtiodactylaGoat5.3
5SoricomorphaStar-nosed mole10.3
6SoricomorphaLesser short-tailed shrew9.1
7CingulataLong-nosed armadillo17.4
8HyracoideaTree hyrax5.3
9DidelphimorphiaNorth American Opossum18
10ProboscideaAsian elephant3.9
11ChiropteraBig brown bat19.7
12PerissodactylaHorse2.9
13PerissodactylaDonkey3.1
14ErinaceomorphaEuropean hedgehog10.1
32MonotremataShort-nosed echidna8.6
33PerissodactylaBrazilian tapir4.4
34AfrosoricidaTenrec15.6
35ScandentiaTree shrew8.9
36CetaceaBottle-nosed dolphin5.2

Filtering Rows Based on Regex#

There are scenarios where you need to filter string columns based on partial matches; a regular expression comes in handy here.

At the moment, there are very few string functions in datatable; However, we can make do with the re_match function, which is similar to Python’s re.match to get by.

Let’s filter for rows where mouse can be found in the column name:

columns = ['name', 'sleep_total']

# returns a boolean column 
row = dt.re.match(f.name, '.*mouse.*')

DT[rows, columns]
namesleep_total
▪▪▪▪▪▪▪▪▪▪▪▪
0North American Opossum18
1Thick-tailed opposum19.4
2Phalanger13.7
3Potoroo11.1

Filtering Rows based on Multiple Conditions#

Select rows with a bodywt above 100 and either have a sleep_total above 15, or are not part of the Carnivora order:

rows = (f.bodywt > 100) & ((f.sleep_total > 15) | (f.order != "Carnivora"))

columns = ["name", "order", slice("sleep_total", "bodywt")]

DT[rows, columns]
nameordersleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CowArtiodactyla40.70.666667200.423600
1Asian elephantProboscidea3.9NANA20.14.6032547
2HorsePerissodactyla2.90.6121.10.655521
3DonkeyPerissodactyla3.10.4NA20.90.419187
4GiraffeArtiodactyla1.90.4NA22.1NA899.995
5Pilot whaleCetacea2.70.1NA21.35NA800
6African elephantProboscidea3.3NANA20.75.7126654
7TigerCarnivora15.8NANA8.2NA162.564
8Brazilian tapirPerissodactyla4.410.919.60.169207.501
9Bottle-nosed dolphinCetacea5.2NANA18.8NA173.33

Return rows where bodywt is either greater than 100 or brainwt greater than 1, but not both:

rows = xor((f.bodywt > 100), (f.brainwt > 1))

columns = ["name", slice("bodywt", "brainwt")]

DT[rows, columns]
namebodywtbrainwt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Cow6000.423
1Horse5210.655
2Donkey1870.419
3Giraffe899.995NA
4Pilot whale800NA
5Human621.32
6Tiger162.564NA
7Lion161.499NA
8Brazilian tapir207.5010.169
9Bottle-nosed dolphin173.33NA

Select all rows where brainwt is larger than 1, but bodywt does not exceed 100:

rows = (f.bodywt <= 100) & (f.brainwt > 1)

columns = ["name", "sleep_total", "brainwt", "bodywt"]

DT[rows, columns]
namesleep_totalbrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Human81.3262

Filtering out Empty Rows#

There are two options for filtering out empty rows - comparing with None, or using the isna function:

rows = f.conservation != None 

columns = ["name", slice("conservation", "sleep_cycle")]

DT[rows, columns]
nameconservationsleep_totalsleep_remsleep_cycle
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Cheetahlc12.1NANA
1Mountain beavernt14.42.4NA
2Greater short-tailed shrewlc14.92.30.133333
3Cowdomesticated40.70.666667
4Northern fur sealvu8.71.40.383333
5Dogdomesticated10.12.90.333333
6Roe deerlc3NANA
7Goatlc5.30.6NA
8Guinea pigdomesticated9.40.80.216667
9Grivetlc100.7NA
10Chinchilladomesticated12.51.50.116667
11Star-nosed molelc10.32.2NA
12Lesser short-tailed shrewlc9.11.40.15
13Long-nosed armadillolc17.43.10.383333
14Tree hyraxlc5.30.5NA
49Arctic ground squirrellc16.6NANA
50Thirteen-lined ground squirrellc13.83.40.216667
51Golden-mantled ground squirrellc15.93NA
52Pigdomesticated9.12.40.5
53Brazilian tapirvu4.410.9
rows = ~dt.isna(f.conservation) 

columns = ["name", slice("conservation", "sleep_cycle")]

DT[rows, columns]
nameconservationsleep_totalsleep_remsleep_cycle
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Cheetahlc12.1NANA
1Mountain beavernt14.42.4NA
2Greater short-tailed shrewlc14.92.30.133333
3Cowdomesticated40.70.666667
4Northern fur sealvu8.71.40.383333
5Dogdomesticated10.12.90.333333
6Roe deerlc3NANA
7Goatlc5.30.6NA
8Guinea pigdomesticated9.40.80.216667
9Grivetlc100.7NA
10Chinchilladomesticated12.51.50.116667
11Star-nosed molelc10.32.2NA
12Lesser short-tailed shrewlc9.11.40.15
13Long-nosed armadillolc17.43.10.383333
14Tree hyraxlc5.30.5NA
49Arctic ground squirrellc16.6NANA
50Thirteen-lined ground squirrellc13.83.40.216667
51Golden-mantled ground squirrellc15.93NA
52Pigdomesticated9.12.40.5
53Brazilian tapirvu4.410.9

Filtering across Multiple Columns#

Filter across all Columns#

It is possible to filter for rows based on values across columns.

One thing to note, and be careful about, is that in datatable, within the same bracket, operations in the i section, occur before any operation within the j section; as such, depending on the context, and to ensure the right output, it is better to select the columns first, then chain the row filtration via another bracket. The examples below should make this clearer.

Let’s filter for rows across the selected columns, keeping only rows where any column has the pattern Ca inside:

columns = f['name':'order', 'sleep_total'].remove(f.vore)
rows = dt.re.match(f[str], ".*Ca.*").rowany()

DT[rows, columns]
namegenusordersleep_total
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxCarnivora12.1
1Northern fur sealCallorhinusCarnivora8.7
2Vesper mouseCalomysRodentia7
3DogCanisCarnivora10.1
4Roe deerCapreolusArtiodactyla3
5GoatCapriArtiodactyla5.3
6Guinea pigCavisRodentia9.4
7Domestic catFelisCarnivora12.5
8Gray sealHaliochoerusCarnivora6.2
9TigerPantheraCarnivora15.8
10JaguarPantheraCarnivora10.4
11LionPantheraCarnivora13.5
12Caspian sealPhocaCarnivora3.5
13GenetGenettaCarnivora6.3
14Arctic foxVulpesCarnivora12.5
15Red foxVulpesCarnivora9.8

Let’s look at another example, to filter for rows, across selected columns, where any column has a value less than 0.1:

columns = f['name', 'sleep_total':'bodywt']

rows = (f[int, float] < 0.1).rowany()

DT[rows, columns]
namesleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Owl monkey171.8NA70.01550.48
1Greater short-tailed shrew14.92.30.1333339.10.000290.019
2Vesper mouse7NANA17NA0.045
3Dog10.12.90.33333313.90.0714
4Roe deer3NANA210.098214.8
5Guinea pig9.40.80.21666714.60.00550.728
6Chinchilla12.51.50.11666711.50.00640.42
7Star-nosed mole10.32.2NA13.70.0010.06
8African giant pouched rat8.32NA15.70.00661
9Lesser short-tailed shrew9.11.40.1514.90.000140.005
10Long-nosed armadillo17.43.10.3833336.60.01083.5
11Tree hyrax5.30.5NA18.70.01232.95
12North American Opossum184.90.33333360.00631.7
13Big brown bat19.73.90.1166674.30.00030.023
14European hedgehog10.13.50.28333313.90.00350.77
42Tenrec15.62.3NA8.40.00260.9
43Tree shrew8.92.60.23333315.10.00250.104
44Genet6.31.3NA17.70.01752
45Arctic fox12.5NANA11.50.04453.38
46Red fox9.82.40.3514.20.05044.23

The above example only requires that at least one column has a value less than 0.1. What if the goal is to return only rows where all the columns have values above 1?

columns = f['name', 'sleep_total' : 'bodywt'].remove(f.awake)

rows = (f[int, float] > 1).rowall()

DT[rows, columns]
namesleep_totalsleep_remsleep_cyclebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Human81.91.51.3262

Note the change from rowany to rowall; rowany will return True for rows where ANY column matches the condition, whereas rowall will only return True for rows where ALL columns match the condition.

All the examples so far combine i and j within a single bracket; so why all the noise about context and selecting columns first before rows? The next section should shed more light.

You can also limit the filtration to columns that match a particular type; the examples above show how that can be done. This can be handy in certain situations where the target is not limited to one data type.

Consider the example below, where only rows that have nulls should be returned. Nulls can be both in numeric and string columns:

columns = f['name' : 'order', 'sleep_total' : 'sleep_rem']

rows = (f[:] == None).rowany()

DT[:, columns][rows, :]
namegenusvoreordersleep_totalsleep_rem
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora12.1NA
1Vesper mouseCalomysNARodentia7NA
2Roe deerCapreolusherbiArtiodactyla3NA
3Asian elephantElephasherbiProboscidea3.9NA
4Western american chipmunkEutamiasherbiRodentia14.9NA
5African elephantLoxodontaherbiProboscidea3.3NA
6VoleMicrotusherbiRodentia12.8NA
7Round-tailed muskratNeofiberherbiRodentia14.6NA
8Slow lorisNyctibeuscarniPrimates11NA
9Northern grasshopper mouseOnychomyscarniRodentia14.5NA
10TigerPantheracarniCarnivora15.8NA
11JaguarPantheracarniCarnivora10.4NA
12LionPantheracarniCarnivora13.5NA
13Desert hedgehogParaechinusNAErinaceomorpha10.32.7
14PottoPerodicticusomniPrimates11NA
15Deer mousePeromyscusNARodentia11.5NA
16PhalangerPhalangerNADiprotodontia13.71.8
17Common porpoisePhocoenacarniCetacea5.6NA
18Rock hyraxProcaviaNAHyracoidea5.40.5
19African striped mouseRhabdomysomniRodentia8.7NA
20Mole ratSpalaxNARodentia10.62.4
21Arctic ground squirrelSpermophilusherbiRodentia16.6NA
22Musk shrewSuncusNASoricomorpha12.82
23Short-nosed echidnaTachyglossusinsectiMonotremata8.6NA
24Eastern american chipmunkTamiasherbiRodentia15.8NA
25Bottle-nosed dolphinTursiopscarniCetacea5.2NA
26Arctic foxVulpescarniCarnivora12.5NA

However, we are only interested in rows where the string columns are null; simply modifying the selected columns in rows should resolve this:

rows = (f[str] == None).rowany()

DT[rows, columns]
namegenusvoreordersleep_totalsleep_rem
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Owl monkeyAotusomniPrimates171.8
1Three-toed slothBradypusherbiPilosa14.42.2
2Vesper mouseCalomysNARodentia7NA
3African giant pouched ratCricetomysomniRodentia8.32
4Western american chipmunkEutamiasherbiRodentia14.9NA
5GalagoGalagoomniPrimates9.81.1
6HumanHomoomniPrimates81.9
7MacaqueMacacaomniPrimates10.11.2
8VoleMicrotusherbiRodentia12.8NA
9Little brown batMyotisinsectiChiroptera19.92
10Slow lorisNyctibeuscarniPrimates11NA
11ChimpanzeePanomniPrimates9.71.4
12BaboonPapioomniPrimates9.41
13Desert hedgehogParaechinusNAErinaceomorpha10.32.7
14Deer mousePeromyscusNARodentia11.5NA
26Tree shrewTupaiaomniScandentia8.92.6
27Bottle-nosed dolphinTursiopscarniCetacea5.2NA
28GenetGenettacarniCarnivora6.31.3
29Arctic foxVulpescarniCarnivora12.5NA
30Red foxVulpescarniCarnivora9.82.4

That doesn’t seem right. There are no missing values in rows 0 and 1, same for rows 26 and 30. What’s going on?

As mentioned earlier, operations in i occur before j; in the code above, ALL the string columns in the frame were filtered, and not restricted to the selected columns. The right way about it is to select the columns first, then chain the row filtration:

DT[:, columns][rows, :]
namegenusvoreordersleep_totalsleep_rem
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Vesper mouseCalomysNARodentia7NA
1Desert hedgehogParaechinusNAErinaceomorpha10.32.7
2Deer mousePeromyscusNARodentia11.5NA
3PhalangerPhalangerNADiprotodontia13.71.8
4Rock hyraxProcaviaNAHyracoidea5.40.5
5Mole ratSpalaxNARodentia10.62.4
6Musk shrewSuncusNASoricomorpha12.82

Again, the filtration process depend on the context, and should be adapted accordingly.

Filter at#

It is also possible to filter rows based on specific columns:

columns = f['name', 'sleep_total' : 'sleep_rem', 'brainwt' : 'bodywt']

rows = (f['sleep_total', 'sleep_rem'] > 5).rowall()

DT[rows, columns]
namesleep_totalsleep_rembrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Thick-tailed opposum19.46.6NA0.37
1Giant armadillo18.16.10.08160

Note in the example above, the rows and columns are within the same bracket, because the columns are explicitly specified; only values from those columns will be used for the filtration.

Another example below that uses a different select option:

columns = f['name', 'sleep_total' : 'sleep_rem', 'brainwt' : 'bodywt']

rows = [name for name in DT[:, columns].names if 'sleep' in name]

rows = (f[rows]>5).rowall()

DT[rows, columns]
namesleep_totalsleep_rembrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Thick-tailed opposum19.46.6NA0.37
1Giant armadillo18.16.10.08160

Resources:

Comments#