Selecting and Grouping Data with Python Datatable#

Datatable#

This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R’s data.table and attempts to mimic its core algorithms and API.

I like datatable primarily because of its simple syntax. Yes, there are significant speed gains, which is the primary aim of the package, but the simplicity is compelling.

Do note that datatable is in active development - more features will be added. Check out the documentation for more information.

This introduction focuses on how to select rows, select and perform calculations on columns, and perform aggregations by group.

Datatable syntax#

datatable_syntax.png

Short syntax right? Let’s break it down :
- DT refers to the data frame. This is the fundamental building block in datatable. It is a 2-dimensional array with rows and columns, similar to an Excel/SQL table.
- The i part is used for subsetting on rows and shares a similar concept with SQL’s WHERE clause.
- The j part is used to select columns and act on them.
- ... are for extra modifiers, e.g grouping, sorting, joining, etc.

Let’s dive into some examples to see how datatable works. Our data will be the iris dataset.

#import datatable
from datatable import dt, f, by

#read in data
DT = dt.fread("Data_files/iris.csv")

DT.head()
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa

Notes :

  • dt refers to the datatable module.

  • All computations occur within the [] bracket.

  • dt.fread is a powerful and very fast function for reading in various text files, zip archives, and urls. It can even read in data from the command line.

  • by is a function for grouping.

  • f is a variable that provides a convenient way to reference the data frame’s column within the square brackets. It is really useful when performing computations or creating expressions.

  • In a jupyter notebook, the tab colours for the columns indicate various data types - blue is for float column, green is for integer column, red is for string column, yellow is for boolean, while black is for object column.

Some basic information about the data frame :

#shape of data
DT.shape
(150, 5)
#column names
DT.names
('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species')
#data types of the columns
DT.stypes
(stype.float64, stype.float64, stype.float64, stype.float64, stype.str32)

The i part - Subset Rows#

Select the first three rows#

#We can use python's slicing syntax to get the rows. 
DT[:3, :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa

Notes :

  • No new syntax, just our knowledge of python’s sequence indexing.

  • Note also that nothing was selected in the columns. Passing None or ... in the j section will work as well.

Select the 2nd, 4th and 8th rows#

#Python has a zero based indexing notation,
#so, we will pass in a list [1,3,7]

DT[[1,3,7], :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
04.931.40.2setosa
14.63.11.50.2setosa
253.41.50.2setosa

Find rows where species == “versicolor”#

This is an expression. How do we talk to the dataframe to filter for only rows where the species column is equal to versicolor? Through the f variable :

result = DT[f.species == "versicolor", :]
result.head()
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
073.24.71.4versicolor
16.43.24.51.5versicolor
26.93.14.91.5versicolor
35.52.341.3versicolor
46.52.84.61.5versicolor
55.72.84.51.3versicolor
66.33.34.71.6versicolor
74.92.43.31versicolor
86.62.94.61.3versicolor
95.22.73.91.4versicolor

Notes :

  • We create a boolean expression, using the f symbol, to select rows that match the condition.

Find rows where species == “versicolor” and sepal_length == 7#

DT[(f.species == "versicolor") & (f.sepal_length == 7), :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
073.24.71.4versicolor

Notes :

  • Again, a filter expression is needed -> we use the f symbol to access the columns and create our expression.

  • Note also that we have two conditions; as such, each condition is wrapped in parentheses to ensure the operator precedence.

  • Operator & for and, | for or.

  • You can chain as many conditions as needed within the i section.

The j part - Columns#

Select species, petal_width and petal_length columns#

#Simply pass a list of the names in the j section 
result = DT[:, ["species","petal_width","petal_length"]]

result.head()
speciespetal_widthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa0.21.4
1setosa0.21.4
2setosa0.21.3
3setosa0.21.5
4setosa0.21.4
5setosa0.41.7
6setosa0.31.4
7setosa0.21.5
8setosa0.21.4
9setosa0.11.5

Notes :

  • Note that we passed : in the i section to indicate that all rows will be selected. None or ... works fine as well.

Select the last two columns#

#Again, we use python's indexing syntax
result = DT[:, -2:]

result.head()
petal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪
00.2setosa
10.2setosa
20.2setosa
30.2setosa
40.2setosa
50.4setosa
60.3setosa
70.2setosa
80.2setosa
90.1setosa

Select only columns whose names end with length#

result = DT[:, [col.endswith("length") for col in DT.names]]

result.head()
sepal_lengthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.11.4
14.91.4
24.71.3
34.61.5
451.4
55.41.7
64.61.4
751.5
84.41.4
94.91.5

Notes:

  • For this, we pass a list comprehension to extract the matches. Again, we are using familar python constructs.

Calculate the mean value of sepal_length#

DT[:, f.sepal_length.mean()]
sepal_length
▪▪▪▪▪▪▪▪
05.84333
  • Here, we have to compute a value, and as such, the f symbol comes into play.

  • To get the average, we use the mean function from the datatable module.

REMEMBER: Any time you want to compute a value or create an expression, make use of the f symbol.

TIP : We can export the column names and use them, instead of the f symbol :

sepal_length,  *_ = DT.export_names()

#calculate the mean value of sepal_length
DT[:, sepal_length.mean()]
sepal_length
▪▪▪▪▪▪▪▪
05.84333

We can also export only the names we need :

petal_width, petal_length = DT[:, [2,3]].export_names()

#calculate mean value of petal_width and petal_length
DT[:, dt.mean([petal_width.mean(), petal_length])]
petal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
03.7581.19933

Notes:

  • To get aggregates of more than one column, pass a list of column names prefixed with the f symbol to j.

  • Alternatively, you can pass a dictionary of key - value pairs, where the keys will be the new column names, and the values are the computed columns.

For the rest of the tutorial, I will continue using f-expressions, as they are convenient to use, without having to explicitly export column names.

Select only species and petal_length columns for rows where the petal_length is greater than 1.5#

result = DT[f.petal_length > 1.5, ["species", "petal_length"]]

result.head()
speciespetal_length
▪▪▪▪▪▪▪▪▪▪▪▪
0setosa1.7
1setosa1.6
2setosa1.7
3setosa1.7
4setosa1.7
5setosa1.9
6setosa1.6
7setosa1.6
8setosa1.6
9setosa1.6

Select only string columns#

result = DT[:, f[str]]
result.head()
species
▪▪▪▪
0setosa
1setosa
2setosa
3setosa
4setosa
5setosa
6setosa
7setosa
8setosa
9setosa

Notes:

  • The above concept is applicable for any data type that we wish to include or exclude from the dataframe

Convert petal_length from float to integer#

  • Option 1. Direct assignment :

result = DT.copy()

result["petal_length"].stype
stype.float64
#change column type
result["petal_length"] = dt.int32

#check new data type
result[:, f.petal_length].stype
stype.int32
  • Option 2. Use the update method. The process is done in-place; no assignment is needed :

res = DT.copy()

res["petal_length"].stype
stype.float64
#update is an inplace operation
res[:, dt.update(petal_length = dt.int32(f.petal_length))]
#check data type after update 
res['petal_length'].stype
stype.int32
  • Option 3. Use the as_type method. This is the recommended way to change the column type.

DT[:, f.petal_length.as_type(dt.int32)].stype
stype.int32

TIP : The update method allows creation of new columns, or updating existing ones.

Multiply petal_length by 2 and add it as a new column petal_double#

  • Option 1. Direct Assignment :

sol_1 = DT.copy()

sol_1["petal_double"] = sol_1[:, 2 * f.petal_length]

sol_1.head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa2.8
14.931.40.2setosa2.8
24.73.21.30.2setosa2.6
34.63.11.50.2setosa3
453.61.40.2setosa2.8
  • Option 2. Use the extend method and pass in a dictionary of key value pairs, where the key is the new column name, and the value is the computed column:

sol_2 = DT.copy()

#f[:] means the entire columns are selected
sol_2 = sol_2[:, f[:].extend({"petal_double" : 2 * f.petal_length})]

sol_2.head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa2.8
14.931.40.2setosa2.8
24.73.21.30.2setosa2.6
34.63.11.50.2setosa3
453.61.40.2setosa2.8
  • Option 3. The update method :

sol_3 = DT.copy()

sol_3[:, dt.update(petal_double = 2 * f.petal_length)]
sol_3.head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa2.8
14.931.40.2setosa2.8
24.73.21.30.2setosa2.6
34.63.11.50.2setosa3
453.61.40.2setosa2.8
  • Option 4. Use the alias method:

DT[:, [f[:], (f.petal_length*2).alias('petal_double')]]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa2.8
14.931.40.2setosa2.8
24.73.21.30.2setosa2.6
34.63.11.50.2setosa3
453.61.40.2setosa2.8
55.43.91.70.4setosa3.4
64.63.41.40.3setosa2.8
753.41.50.2setosa3
84.42.91.40.2setosa2.8
94.93.11.50.1setosa3
105.43.71.50.2setosa3
114.83.41.60.2setosa3.2
124.831.40.1setosa2.8
134.331.10.1setosa2.2
145.841.20.2setosa2.4
1456.735.22.3virginica10.4
1466.32.551.9virginica10
1476.535.22virginica10.4
1486.23.45.42.3virginica10.8
1495.935.11.8virginica10.2

Drop the sepal_width column from the table#

  • Option 1. Use the del keyword :

del sol_1["sepal_width"]
sol_1.names
('sepal_length', 'petal_length', 'petal_width', 'species', 'petal_double')
sol_1.head(3)
sepal_lengthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.11.40.2setosa2.8
14.91.40.2setosa2.8
24.71.30.2setosa2.6

Option 2. Use the remove method :

sol_3 = sol_3[:, f[:].remove(f.sepal_width)]
sol_3.names
('sepal_length', 'petal_length', 'petal_width', 'species', 'petal_double')
sol_3.head(3)
sepal_lengthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.11.40.2setosa2.8
14.91.40.2setosa2.8
24.71.30.2setosa2.6
  • Option 3. Use a list comprehension

DT[:, [name!='sepal_width' for name in DT.names]]
sepal_lengthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.11.40.2setosa
14.91.40.2setosa
24.71.30.2setosa
34.61.50.2setosa
451.40.2setosa
55.41.70.4setosa
64.61.40.3setosa
751.50.2setosa
84.41.40.2setosa
94.91.50.1setosa
105.41.50.2setosa
114.81.60.2setosa
124.81.40.1setosa
134.31.10.1setosa
145.81.20.2setosa
1456.75.22.3virginica
1466.351.9virginica
1476.55.22virginica
1486.25.42.3virginica
1495.95.11.8virginica

Extra Modifiers#

Notes :

  • We will focus on grouping and sorting.

Sort data by sepal_width#

outcome = DT[:, :, dt.sort('sepal_width')]

outcome.head()
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0523.51versicolor
162.241versicolor
26.22.24.51.5versicolor
362.251.5virginica
44.52.31.30.3setosa
55.52.341.3versicolor
66.32.34.41.3versicolor
752.33.31versicolor
84.92.43.31versicolor
95.52.43.81.1versicolor
  • That’s all there is to it, just pass in the column name, either as a string or as an f expression.

  • By default, sorting is in ascending order.

Sort by sepal_width ascending and petal_width descending#

outcome = DT[:, :, dt.sort([f.sepal_width, -f.petal_width])]

outcome.head()
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0523.51versicolor
16.22.24.51.5versicolor
262.251.5virginica
362.241versicolor
45.52.341.3versicolor
56.32.34.41.3versicolor
652.33.31versicolor
74.52.31.30.3setosa
85.52.43.81.1versicolor
94.92.43.31versicolor

Notes :

  • When sorting on multiple columns, pass a list of the column names to the sort function.

  • The - sign in front of petal_width instructs data table to sort in descending order. The - is symbolic and no actual negation occurs; as such it can be used on string columns.

  • Note that at this point, we had to switch to f-expression to get the desired result.

Group By - Get the average sepal length per species#

DT[:, f.sepal_length.mean(), by("species")] 
speciessepal_length
▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.006
1versicolor5.936
2virginica6.588

Notes :

  • by is the primary way to group data in the datatable.

  • Since we are computing a value (average), the f symbol is used.

Group By - Get the average sepal length and petal length per species#

DT[:, dt.mean([f.sepal_length, f.petal_length]), by("species")]
speciessepal_lengthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.0061.462
1versicolor5.9364.26
2virginica6.5885.552

Group By - Filter the rows for sepal_width greater than or equal to 3, then get the average sepal length and petal length per species#

DT[f.sepal_width>=3,:][:,dt.mean([f.sepal_length, f.petal_length]), by("species")]
speciessepal_lengthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.029171.46667
1versicolor6.218754.54375
2virginica6.768975.64138

Notes :

  • At the moment, computation on i, j and by at the same time is not yet implemented. As such, we have to break down the steps into two parts.

  • This also shows the chaining capabilities of datatable; simply put the next [], and datatable will compute the results from left to right.

Group By - Get the average sepal length and group on sepal width greater than 3#

DT[:, dt.mean(f.sepal_length), by((f.sepal_width > 3).alias('sepal_width_gt_3'))]
sepal_width_gt_3sepal_length
▪▪▪▪▪▪▪▪
005.97229
115.68358

Notes :

  • With by, you can group on boolean expressions. Note the use of f-expression in the by function.

  • Integers 0 and 1 signifiy False and True respectively.

Group By - Get the count of each species, and label the aggregation column as species_count#

DT[:, {"species_count" : dt.count()}, by("species")]
speciesspecies_count
▪▪▪▪▪▪▪▪▪▪▪▪
0setosa50
1versicolor50
2virginica50

Notes :

  • As with previous examples, a column can be renamed by passing a dictionary, where the key is the new column name and the value is the aggregation result.

  • count is another aggregation function within datatable.

Group By - Get the count of each species where the sepal width is greater than 3#

DT[:, dt.count(), by(f.species, (f.sepal_width > 3).alias('sepal_width_gt_3'))]
speciessepal_width_gt_3count
▪▪▪▪▪▪▪▪▪▪▪▪
0setosa08
1setosa142
2versicolor042
3versicolor18
4virginica033
5virginica117

Notes :

  • The by function can take a combination of columns, including booleans.

Summary#

This was an introduction to the syntax of python datatable. We will explore more features in subsequent blog posts.