Fast and Efficient Inequality Joins in Pandas#

Introduction#

Pandas supports equi-join, where the keys involved in the join are considered to be equal. This is implemented via the merge and join functions. There are scenarios however, where the keys involved might not be equal; some other logical condition is involved in the join. This is known as a non-equi join or an inequality join.

Let’s look at some examples, culled from here, that illustrate equi and non-equi joins:

import pandas as pd

df1 = pd.DataFrame({'left': [1,2,3], 'col_a': ["A", "B", "C"]})
df2 = pd.DataFrame({'right': [0, 2, 3], 'col_b': ["Z", "X", "Y"]})
df1
left col_a
0 1 A
1 2 B
2 3 C
df2
right col_b
0 0 Z
1 2 X
2 3 Y

Join based on the condition df1.left == df2.right - this is an equi join, because the join operator is an equality operator, where df1.left is equal to df2.right:

left

col_a

right

col_b

2

B

2

X

3

C

3

Y

Join based on the condition df1.left != df2.right - this is an example of a non-equi join; it should return rows where df1.left is not equal to df2.right:

left

col_a

right

col_b

1

A

2

X

1

A

3

Y

2

B

3

Y

1

A

0

Z

2

B

0

Z

3

C

0

Z

3

C

2

X

Join based on the condition df1.left > df2.right - this is another example of a non-equi join; it should return rows where df1.left is greater than df2.right:

left

col_a

right

col_b

1

A

0

Z

2

B

0

Z

3

C

0

Z

3

C

2

X

non-equi joins are supported in SQL, and offer elegant solutions for specific uses cases. A good example is provided here -

Consider a dataframe `df_process` that contains `process_id`, `process_start_date` and `process_end_date` columns for some business process, and a second dataframe `df_events` that contains an `event_id` and an `event_date` column for particular events. If you want to find all events that occur during some business process, you can easily obtain these by applying the conditional join statement:

df_event.event_date >= df_process.process_start_date & df_event.event_date <= df_process.process_end_date

The above scenario is an example of a range join, where the join is between a range of values. Currently in Pandas, to solve non-equi joins, one option is via an IntervalIndex, if the intervals do not overlap. Another option, which is more common, involves a cartesian join, where every row from the left dataframe is joined to every row from the right dataframe, before it is filtered with the non-equi joins:

(df_event
.merge(df_process, how = 'cross')
.query('process_start_date <= event_date <= process_end_date')
)

This is not an efficient way to deal with inequality joins, in terms of memory and speed, especially for large data. This is where the conditional_join function from pyjanitor comes into play. It provides an efficient way to deal with non-equi joins, and does so in a more memory efficient way than cartesian joins, while still being performant. Under the hood it uses binary search, and in some special cases, such as range joins, it uses special algorithms to improve performance.

Strictly Non-equi Joins#

Let’s load the pyjanitor library:

# pip install pyjanitor
import janitor as jn
import sys
print("pandas version :", pd.__version__)
print("janitor version :", jn.__version__)
print("python version :", sys.version)
pandas version : 2.2.2
janitor version : 0.28.1
python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]

Let’s walk through some of the parameters in conditional_join:

  • df : Left DataFrame.

  • right: Named Series or DataFrame to join to df.

  • conditions : Variable argument of tuple(s) of the form (left_on, right_on, op)- left_on is the column from df, right_on is the column from right, while op is the join operator. For multiple conditions, the and(&) operator is used to combine the results of the individual conditions.

  • how: indicates the type of join to be performed. Can be one of left, right, inner, or outer.

  • use_numba: Whether to use numba, for possible performance improvement.

  • df_columns: Select columns from df.

  • right_columns: Select columns from right.

Let’s apply the conditional_join function to solve the non-equi joins earlier:

df1.conditional_join(df2, ('left', 'right', '!='))
left col_a right col_b
0 1 A 2 X
1 1 A 3 Y
2 2 B 3 Y
3 1 A 0 Z
4 2 B 0 Z
5 3 C 0 Z
6 3 C 2 X
df1.conditional_join(df2, ('left', 'right', '>'))
left col_a right col_b
0 1 A 0 Z
1 2 B 0 Z
2 3 C 0 Z
3 3 C 2 X

For multiple non-equi conditions, the results of the individual conditions are combined with the and(&) operator. Let’s look at a range join, to see this in action - the example is adapted from here:

east = dict(id = range(100,103), 
            dur = [140, 100, 90], 
            rev = [9, 12, 5], 
            cores = [2, 8, 4])

west = dict(t_id = [404,498, 676, 742], 
            time = [100, 140, 80, 90], 
            cost = [6, 11, 10, 5], 
            cores = [4, 2, 1, 4])

east = pd.DataFrame(east)
west = pd.DataFrame(west)
east
id dur rev cores
0 100 140 9 2
1 101 100 12 8
2 102 90 5 4
west
t_id time cost cores
0 404 100 6 4
1 498 140 11 2
2 676 80 10 1
3 742 90 5 4

Task : Look for any customer from the West Coast who rented a virtual machine for more hours than any customer from the East Coast, but who paid less. Basically, get rows where east.dur < west.time and east.rev > west.cost.

This is a range join, and is solved easily and efficiently with conditional_join:

(east
.conditional_join(
    west, 
    ('dur', 'time', '<'), 
    ('rev', 'cost', '>'), 
    df_columns = ['id', 'dur', 'rev'],
    right_columns = ['t_id', 'time', 'cost']
    )
)
id dur rev t_id time cost
0 101 100 12 498 140 11

For range joins, and when use_numba = False, conditional_join uses an algorithm based on this vertica blog post.

You might get more performance with use_numba = True. The implementation is based on the algorithm in this publication

Note that non-equi join is supported only for numeric and date types.

Equi and Non-equi Joins#

conditional_join also supports a combination of equi and non-equi join conditions - under the hood it uses Pandas’ internal merge function to get the matching pairs, before pruning the non-equi joins to get the final dataframe. Depending on the size of the dataframes, this might offer more performance than the classic merge and filter approach:

df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 
                    'value_1': [2,5,7,1,3,4]})
                    
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 
                    'value_2A': [0,3,7,12,0,2,3,1], 
                    'value_2B': [1,5,9,15,1,4,6,3]})
df1
id value_1
0 1 2
1 1 5
2 1 7
3 2 1
4 2 3
5 3 4
df2
id value_2A value_2B
0 1 0 1
1 1 3 5
2 1 7 9
3 1 12 15
4 2 0 1
5 2 2 4
6 2 3 6
7 3 1 3

Classic Pandas’ merge and filter approach:

(df1
.merge(df2, on = 'id')
.query('value_2A <= value_1 <= value_2B')
)
id value_1 value_2A value_2B
5 1 5 3 5
10 1 7 7 9
12 2 1 0 1
16 2 3 2 4
17 2 3 3 6

With conditional_join:

(df1
.conditional_join(
    df2,
    ('id', 'id', '=='),
    ('value_1', 'value_2A', '>='),
    ('value_1', 'value_2B', '<=')
))
left right
id value_1 id value_2A value_2B
0 1 5 1 3 5
1 1 7 1 7 9
2 2 1 2 0 1
3 2 3 2 2 4
4 2 3 2 3 6

Note the above returned a MultiIndex column; if the columns from the left and right dataframes have something in common a MultiIndex column is returned. Of course, you can select/rename columns with df_columns and right_columns to get a single index column, if that is preferred.

Non-equi joins with OR#

For multiple non-equi joins, an and(&) operator is applied, to combine the results of the individual conditions. There are scenarios however, where the join might have an OR condition. In this case, the joins are executed individually, and the resulting dataframes can then be concatenated into one. Let’s look at an example from Nelson Tang’s blog post:

sales_volume_table = pd.DataFrame.from_dict([
    {'date':'2021-11-15', 'quantity':1, 'brand':'Outdoor'},
    {'date':'2021-11-20', 'quantity':2, 'brand':'Leisure'},
    {'date':'2021-11-25', 'quantity':3, 'brand':'Athletic'},
    {'date':'2021-11-26', 'quantity':2, 'brand':'Outdoor'},
])

promos_table = pd.DataFrame.from_dict([
    {'start_date':'2021-11-01', 'end_date':'2021-11-25',
    'brand':'ANY', 'rebate_per_unit':3},
    {'start_date':'2021-11-25', 'end_date':'2021-11-26',
    'brand':'Outdoor', 'rebate_per_unit':5},
])

sales_volume_table['date'] = pd.to_datetime(sales_volume_table['date'])
promos_table['start_date'] = pd.to_datetime(promos_table['start_date'])
promos_table['end_date'] = pd.to_datetime(promos_table['end_date'])
sales_volume_table
date quantity brand
0 2021-11-15 1 Outdoor
1 2021-11-20 2 Leisure
2 2021-11-25 3 Athletic
3 2021-11-26 2 Outdoor
promos_table
start_date end_date brand rebate_per_unit
0 2021-11-01 2021-11-25 ANY 3
1 2021-11-25 2021-11-26 Outdoor 5

Problem statement, culled from the blog post:

  1. The date in the left table was between two dates (a start and end date) in the second table

  2. …AND the values in two other columns matched each other, OR the column on the right table was equal to ‘ANY’ (aka a ‘wildcard’ value)

Translating this into SQL is easy:

SELECT *
FROM sales_volume_table, promos_table
WHERE (sales_volume_table.brand=promos_table.brand or promos_table.brand='ANY')
AND (start_date <= date AND date <= end_date)

Replicating this in Pandas is more involved but doable:

top = (sales_volume_table
       .conditional_join(
            promos_table,
            ('brand', 'brand', '=='),
            ('date', 'start_date', '>='),
            ('date', 'end_date', '<=')
          )
    )

bottom = (sales_volume_table
          .conditional_join(
               promos_table.query('brand == "ANY"'),
               ('date', 'start_date', '>='),
               ('date', 'end_date', '<=')
               )
     )

pd.concat([top, bottom])
left right
date quantity brand start_date end_date brand rebate_per_unit
0 2021-11-26 2 Outdoor 2021-11-25 2021-11-26 Outdoor 5
0 2021-11-15 1 Outdoor 2021-11-01 2021-11-25 ANY 3
1 2021-11-20 2 Leisure 2021-11-01 2021-11-25 ANY 3
2 2021-11-25 3 Athletic 2021-11-01 2021-11-25 ANY 3

Performance#

Why bother with conditional_join? simple syntax, memory and speed efficiency. Let’s see the memory and speed benefits, compared to a cartesian join (the data used below is adapted from here):

%load_ext memory_profiler
from string import ascii_lowercase
import numpy as np
np.random.seed(1)
n = 2_000; k = 10_000

idx1 = np.random.randint(0, high = 100, size = n)
idx2 = np.random.randint(0, high = 100, size = n)

d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),
          start = np.minimum(idx1, idx2),
          end = np.maximum(idx1, idx2))

d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),
          pos1 = np.random.randint(low=60, high = 151, size=k))

d1 = pd.DataFrame(d1)
d2 = pd.DataFrame(d2)
/Users/samuel.oranyeli/mambaforge/envs/blogger/lib/python3.10/site-packages/memory_profiler.py:1136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  ipython_version = LooseVersion(IPython.__version__)
/Users/samuel.oranyeli/mambaforge/envs/blogger/lib/python3.10/site-packages/setuptools/_distutils/version.py:337: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  other = LooseVersion(other)
d1.head()
x start end
0 c 37 61
1 d 12 16
2 b 72 79
3 c 9 74
4 d 75 76
d2.head()
x pos1
0 f 126
1 m 121
2 o 128
3 a 96
4 n 61
len(d1), len(d2)
(2000, 10000)
A = (d1
    .merge(d2, how = 'cross')
    .query("start <= pos1 <= end")
    .filter(['start', 'end', 'pos1'])
    )
A.shape
(2740898, 3)
A.head()
start end pos1
4 37 61 61
53 37 61 60
74 37 61 60
104 37 61 60
105 37 61 60
B = (d1
    .conditional_join(
        d2, 
        ('start', 'pos1', '<='), 
        ('end', 'pos1', '>='), 
        df_columns=['start', 'end'],
        right_columns='pos1',
        use_numba=False)
    )
B.shape
(2740898, 3)
B.head()
start end pos1
0 37 61 60
1 37 61 60
2 37 61 60
3 37 61 60
4 37 61 60
C = (d1
    .conditional_join(
        d2, 
        ('start', 'pos1', '<='), 
        ('end', 'pos1', '>='), 
        df_columns=['start', 'end'],
        right_columns='pos1',
        use_numba=True)
    )
C.shape
(2740898, 3)

Check that the outputs match:

cols = ['start', 'end', 'pos1']
A = A.sort_values(cols, ignore_index=True)
B = B.sort_values(cols, ignore_index=True)
C = C.sort_values(cols, ignore_index=True)
A.equals(B), A.equals(C), B.equals(C)
(True, True, True)

Let’s compare the speed:

%timeit d1.merge(d2, how = 'cross').query("start <= pos1 <= end")
1.11 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)
38.9 ms ± 401 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)
40.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Compare memory usage:

%memit d1.merge(d2, how = 'cross').query("start <= pos1 <= end")
peak memory: 3844.30 MiB, increment: 0.14 MiB
%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)
peak memory: 3844.30 MiB, increment: 0.00 MiB
%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)
peak memory: 3844.30 MiB, increment: 0.00 MiB

Let’s run another performance test - the code below is a self join to find overlapping events in a 30,000 row DataFrame, and adapted from DuckDB:

url = 'https://raw.githubusercontent.com/samukweku/data-wrangling-blog/master/notebooks/Data_files/results.csv'
events = pd.read_csv(url, parse_dates=['start', 'end']).iloc[:, 1:]
events.head()
id name audience start sponsor end
0 1 Event 1 1178 2022-11-19 10:00:00 Sponsor 2 2022-11-19 10:15:00
1 2 Event 2 1446 2015-09-27 15:00:00 Sponsor 11 2015-09-27 15:11:00
2 3 Event 3 2261 2019-11-12 18:00:00 Sponsor 10 2019-11-12 18:53:00
3 4 Event 4 1471 2019-12-24 22:00:00 Sponsor 6 2019-12-24 22:11:00
4 5 Event 5 2605 2028-06-20 12:00:00 Sponsor 8 2028-06-20 12:31:00
(events
.conditional_join(
    events,
    ('start', 'end', '<='),
    ('end', 'start', '>='),
    ('id', 'id', '!='),
    use_numba = False,
    df_columns = ['id', 'start', 'end'],
    right_columns = ['id', 'start', 'end'])
)
left right
id start end id start end
0 10 1993-11-27 12:00:00 1993-11-27 12:37:00 2345 1993-11-27 10:00:00 1993-11-27 12:00:00
1 15 1993-04-04 16:00:00 1993-04-04 18:00:00 11178 1993-04-04 17:00:00 1993-04-04 17:22:00
2 17 2030-10-25 07:00:00 2030-10-25 07:27:00 19605 2030-10-25 06:00:00 2030-10-25 08:00:00
3 26 2005-10-04 17:00:00 2005-10-04 17:18:00 8218 2005-10-04 17:00:00 2005-10-04 17:27:00
4 35 2024-05-02 15:00:00 2024-05-02 15:35:00 6916 2024-05-02 15:00:00 2024-05-02 15:36:00
... ... ... ... ... ... ...
3697 29966 2000-08-26 11:00:00 2000-08-26 13:00:00 29375 2000-08-26 13:00:00 2000-08-26 13:53:00
3698 29971 2018-05-18 04:00:00 2018-05-18 04:18:00 24173 2018-05-18 04:00:00 2018-05-18 04:36:00
3699 29978 1992-06-07 22:00:00 1992-06-07 22:23:00 981 1992-06-07 22:00:00 1992-06-07 22:30:00
3700 29984 2025-06-05 03:00:00 2025-06-05 03:17:00 19051 2025-06-05 01:00:00 2025-06-05 03:00:00
3701 29995 2016-09-04 14:00:00 2016-09-04 14:32:00 12296 2016-09-04 14:00:00 2016-09-04 14:50:00

3702 rows × 6 columns

The cartesian join takes a very long time - too long for us to test here. The focus for this test will just be on conditional_join.

%%timeit
(events
.conditional_join(
    events,
    ('start', 'end', '<='),
    ('end', 'start', '>='),
    ('id', 'id', '!='),
    use_numba = False,
    df_columns = ['id', 'start', 'end'],
    right_columns = ['id', 'start', 'end'])
)
19.6 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
(events
.conditional_join(
    events,
    ('start', 'end', '<='),
    ('end', 'start', '>='),
    ('id', 'id', '!='),
    use_numba = True,
    df_columns = ['id', 'start', 'end'],
    right_columns = ['id', 'start', 'end'])
)
22.2 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s look at performance when an equi join is present:

from string import ascii_lowercase
np.random.seed(1)
n = 20_000; k = 50_000

idx1 = np.random.randint(0, high = 100, size = n)
idx2 = np.random.randint(0, high = 100, size = n)

d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),
          start = np.minimum(idx1, idx2),
          end = np.maximum(idx1, idx2))

d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),
          pos1 = np.random.randint(low=60, high = 151, size=k))

d1 = pd.DataFrame(d1)
d2 = pd.DataFrame(d2)
out = (d1
       .merge(d2, on = 'x')
       .query('start <= pos1 <= end')
       )
out.shape
(8886840, 4)
out = (d1
       .conditional_join(
            d2, 
            ('start', 'pos1', '<='), 
            ('end', 'pos1', '>='), 
            ('x', 'x', '=='), 
            use_numba=False)
    )
out.shape
(8886840, 5)

About 18 million rows are returned; let’s see how fast each function runs:

%%timeit
(d1
.merge(d2, on = 'x')
.query('start <= pos1 <= end')
)
2.65 s ± 51.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.merge(d2, on = 'x')
.loc[lambda df: df.start.le(df.pos1) & df.pos1.le(df.end)]
)
2.56 s ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.conditional_join(
    d2, 
    ('start', 'pos1', '<='), 
    ('end', 'pos1', '>='), 
    ('x', 'x', '=='), 
    use_numba=False)
)
1.74 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%memit d1.merge(d2, on = 'x').query('start <= pos1 <= end')
peak memory: 7749.67 MiB, increment: 1846.95 MiB
%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), ('x', 'x', '=='), use_numba=False)
peak memory: 5711.58 MiB, increment: 0.00 MiB

conditional_join is slightly faster; this might not be the case all the time.

Depending on the distribution of data, using numba where an equi-join is present can give significant performance speedups:

from itertools import count
mapper = dict(zip(ascii_lowercase, count()))
d1['xx'] = d1.x.map(mapper)
d2['xx'] = d2.x.map(mapper)
d1.head()
x start end xx
0 d 37 82 3
1 c 2 12 2
2 b 46 72 1
3 b 7 9 1
4 c 39 75 2
d2.head()
x pos1 xx
0 c 143 2
1 i 85 8
2 k 121 10
3 m 68 12
4 o 97 14
%%timeit
(d1
.merge(d2, on = 'xx')
.query('start <= pos1 <= end')
)
4.44 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.merge(d2, on = 'xx')
.loc[lambda df: df.start.le(df.pos1) & df.pos1.le(df.end)]
)
4.23 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.conditional_join(
    d2, 
    ('start', 'pos1', '<='), 
    ('end', 'pos1', '>='), 
    ('xx', 'xx', '=='), 
    use_numba=False)
)
1.76 s ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.conditional_join(
    d2, 
    ('start', 'pos1', '<='), 
    ('end', 'pos1', '>='), 
    ('xx', 'xx', '=='), 
    use_numba=True)
)
159 ms ± 638 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The above numba implementation depends on a number of conditions:

  • Only a single equi join is supported

  • The columns in the equi join should be either numeric or datetime data types

  • The columns in the equi join are duplicated with large ranges, but small unique values

Summary#

This blog post shows how to efficiently join on non-equi conditions, using conditional_join. If you are comfortable with SQL, you could get more performance with DuckDB, which allows querying Pandas DataFrames efficiently with SQL, supports non-equi joins as well, and has a performant implementation for range joins.

Comments#