Fast and Efficient Inequality Joins in Pandas

Fast and Efficient Inequality Joins in Pandas#

Introduction#

Pandas supports equi-join, where the keys involved in the join are considered to be equal. This is implemented via the merge and join functions. There are scenarios however, where the keys involved might not be equal; some other logical condition is involved in the join. This is known as a non-equi join or an inequality join.

Let’s look at some examples, culled from here, that illustrate equi and non-equi joins:

import pandas as pd

df1 = pd.DataFrame({'left': [1,2,3], 'col_a': ["A", "B", "C"]})
df2 = pd.DataFrame({'right': [0, 2, 3], 'col_b': ["Z", "X", "Y"]})

df1

	left	col_a
0	1	A
1	2	B
2	3	C

df2

	right	col_b
0	0	Z
1	2	X
2	3	Y

Join based on the condition df1.left == df2.right - this is an equi join, because the join operator is an equality operator, where df1.left is equal to df2.right:

left	col_a	right	col_b
2	B	2	X
3	C	3	Y

Join based on the condition df1.left != df2.right - this is an example of a non-equi join; it should return rows where df1.left is not equal to df2.right:

left	col_a	right	col_b
1	A	2	X
1	A	3	Y
2	B	3	Y
1	A	0	Z
2	B	0	Z
3	C	0	Z
3	C	2	X

Join based on the condition df1.left > df2.right - this is another example of a non-equi join; it should return rows where df1.left is greater than df2.right:

left	col_a	right	col_b
1	A	0	Z
2	B	0	Z
3	C	0	Z
3	C	2	X

non-equi joins are supported in SQL, and offer elegant solutions for specific uses cases. A good example is provided here -

Consider a dataframe `df_process` that contains `process_id`, `process_start_date` and `process_end_date` columns for some business process, and a second dataframe `df_events` that contains an `event_id` and an `event_date` column for particular events. If you want to find all events that occur during some business process, you can easily obtain these by applying the conditional join statement:
df_event.event_date >= df_process.process_start_date & df_event.event_date <= df_process.process_end_date

The above scenario is an example of a range join, where the join is between a range of values. Currently in Pandas, to solve non-equi joins, one option is via an IntervalIndex, if the intervals do not overlap. Another option, which is more common, involves a cartesian join, where every row from the left dataframe is joined to every row from the right dataframe, before it is filtered with the non-equi joins:

(df_event
.merge(df_process, how = 'cross')
.query('process_start_date <= event_date <= process_end_date')
)

This is not an efficient way to deal with inequality joins, in terms of memory and speed, especially for large data. This is where the conditional_join function from pyjanitor comes into play. It provides an efficient way to deal with non-equi joins, and does so in a more memory efficient way than cartesian joins, while still being performant. Under the hood it uses binary search, and in some special cases, such as range joins, it uses special algorithms to improve performance.

Strictly Non-equi Joins#

Let’s load the pyjanitor library:

# pip install pyjanitor
import janitor as jn
import sys
from janitor import col

print(" pandas version :", pd.__version__,"\n", 
      "janitor version :", jn.__version__, "\n",
      "python version :", sys.version)

 pandas version : 2.2.0 
 janitor version : 0.26.0 
 python version : 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) 
[GCC 12.3.0]

Let’s walk through some of the parameters in conditional_join:

df : Left DataFrame.
right: Named Series or DataFrame to join to df.
conditions : Variable argument of tuple(s) of the form (left_on, right_on, op)- left_on is the column from df, right_on is the column from right, while op is the join operator. For multiple conditions, the and(&) operator is used to combine the results of the individual conditions.
how: indicates the type of join to be performed. Can be one of left, right, inner, or outer.
use_numba: Whether to use numba, for possible performance improvement.
df_columns: Select columns from df.
right_columns: Select columns from right.

Let’s apply the conditional_join function to solve the non-equi joins earlier:

df1.conditional_join(df2, ('left', 'right', '!='))

	left	col_a	right	col_b
0	1	A	2	X
1	1	A	3	Y
2	2	B	3	Y
3	1	A	0	Z
4	2	B	0	Z
5	3	C	0	Z
6	3	C	2	X

df1.conditional_join(df2, ('left', 'right', '>'))

	left	col_a	right	col_b
0	1	A	0	Z
1	2	B	0	Z
2	3	C	0	Z
3	3	C	2	X

You can also use the col function:

df1.conditional_join(df2, col('left') != col('right'))

	left	col_a	right	col_b
0	1	A	2	X
1	1	A	3	Y
2	2	B	3	Y
3	1	A	0	Z
4	2	B	0	Z
5	3	C	0	Z
6	3	C	2	X

df1.conditional_join(df2, col('left') > col('right'))

	left	col_a	right	col_b
0	1	A	0	Z
1	2	B	0	Z
2	3	C	0	Z
3	3	C	2	X

For multiple non-equi conditions, the results of the individual conditions are combined with the and(&) operator. Let’s look at a range join, to see this in action - the example is adapted from here:

east = dict(id = range(100,103), 
            dur = [140, 100, 90], 
            rev = [9, 12, 5], 
            cores = [2, 8, 4])

west = dict(t_id = [404,498, 676, 742], 
            time = [100, 140, 80, 90], 
            cost = [6, 11, 10, 5], 
            cores = [4, 2, 1, 4])

east = pd.DataFrame(east)
west = pd.DataFrame(west)

east

	id	dur	rev	cores
0	100	140	9	2
1	101	100	12	8
2	102	90	5	4

west

	t_id	time	cost	cores
0	404	100	6	4
1	498	140	11	2
2	676	80	10	1
3	742	90	5	4

Task : Look for any customer from the West Coast who rented a virtual machine for more hours than any customer from the East Coast, but who paid less. Basically, get rows where east.dur < west.time and east.rev > west.cost.

This is a range join, and is solved easily and efficiently with conditional_join:

(east
.conditional_join(
    west, 
    ('dur', 'time', '<'), 
    ('rev', 'cost', '>'), 
    df_columns = ['id', 'dur', 'rev'],
    right_columns = ['t_id', 'time', 'cost']
    )
)

	id	dur	rev	t_id	time	cost
0	101	100	12	498	140	11

(east
.conditional_join(
    west, 
    col('dur') < col('time'),
    col('rev') > col('cost'),
    df_columns = ['id', 'dur', 'rev'],
    right_columns = ['t_id', 'time', 'cost']
    )
)

	id	dur	rev	t_id	time	cost
0	101	100	12	498	140	11

For range joins, and when use_numba = False, conditional_join uses an algorithm based on this vertica blog post.

You might get more performance with use_numba = True. The implementation is based on the algorithm in this publication

Note that non-equi join is supported only for numeric and date types.

Equi and Non-equi Joins#

conditional_join also supports a combination of equi and non-equi join conditions - under the hood it uses Pandas’ internal merge function to get the matching pairs, before pruning the non-equi joins to get the final dataframe. Depending on the size of the dataframes, this might offer more performance than the classic merge and filter approach:

df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 
                    'value_1': [2,5,7,1,3,4]})
                    
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 
                    'value_2A': [0,3,7,12,0,2,3,1], 
                    'value_2B': [1,5,9,15,1,4,6,3]})

df1

	id	value_1
0	1	2
1	1	5
2	1	7
3	2	1
4	2	3
5	3	4

df2

	id	value_2A	value_2B
0	1	0	1
1	1	3	5
2	1	7	9
3	1	12	15
4	2	0	1
5	2	2	4
6	2	3	6
7	3	1	3

Classic Pandas’ merge and filter approach:

(df1
.merge(df2, on = 'id')
.query('value_2A <= value_1 <= value_2B')
)

	id	value_1	value_2A	value_2B
5	1	5	3	5
10	1	7	7	9
12	2	1	0	1
16	2	3	2	4
17	2	3	3	6

With conditional_join:

(df1
.conditional_join(
    df2,
    ('id', 'id', '=='),
    ('value_1', 'value_2A', '>='),
    ('value_1', 'value_2B', '<=')
))

	left		right
	id	value_1	id	value_2A	value_2B
0	1	5	1	3	5
1	1	7	1	7	9
2	2	1	2	0	1
3	2	3	2	2	4
4	2	3	2	3	6

(df1
.conditional_join(
    df2,
    col('id') == col('id'),
    col('value_1') >= col('value_2A'),
    col('value_1') <= col('value_2B')
    )
)

	left		right
	id	value_1	id	value_2A	value_2B
0	1	5	1	3	5
1	1	7	1	7	9
2	2	1	2	0	1
3	2	3	2	2	4
4	2	3	2	3	6

Note the above returned a MultiIndex column; if the columns from the left and right dataframes have something in common a MultiIndex column is returned. Of course, you can select/rename columns with df_columns and right_columns to get a single index column, if that is preferred.

Non-equi joins with OR#

For multiple non-equi joins, an and(&) operator is applied, to combine the results of the individual conditions. There are scenarios however, where the join might have an OR condition. In this case, the joins are executed individually, and the resulting dataframes can then be concatenated into one. Let’s look at an example from Nelson Tang’s blog post:

sales_volume_table = pd.DataFrame.from_dict([
    {'date':'2021-11-15', 'quantity':1, 'brand':'Outdoor'},
    {'date':'2021-11-20', 'quantity':2, 'brand':'Leisure'},
    {'date':'2021-11-25', 'quantity':3, 'brand':'Athletic'},
    {'date':'2021-11-26', 'quantity':2, 'brand':'Outdoor'},
])

promos_table = pd.DataFrame.from_dict([
    {'start_date':'2021-11-01', 'end_date':'2021-11-25',
    'brand':'ANY', 'rebate_per_unit':3},
    {'start_date':'2021-11-25', 'end_date':'2021-11-26',
    'brand':'Outdoor', 'rebate_per_unit':5},
])

sales_volume_table['date'] = pd.to_datetime(sales_volume_table['date'])
promos_table['start_date'] = pd.to_datetime(promos_table['start_date'])
promos_table['end_date'] = pd.to_datetime(promos_table['end_date'])

sales_volume_table

	date	quantity	brand
0	2021-11-15	1	Outdoor
1	2021-11-20	2	Leisure
2	2021-11-25	3	Athletic
3	2021-11-26	2	Outdoor

promos_table

	start_date	end_date	brand	rebate_per_unit
0	2021-11-01	2021-11-25	ANY	3
1	2021-11-25	2021-11-26	Outdoor	5

Problem statement, culled from the blog post:

The date in the left table was between two dates (a start and end date) in the second table
…AND the values in two other columns matched each other, OR the column on the right table was equal to ‘ANY’ (aka a ‘wildcard’ value)

Translating this into SQL is easy:

SELECT *
FROM sales_volume_table, promos_table
WHERE (sales_volume_table.brand=promos_table.brand or promos_table.brand='ANY')
AND (start_date <= date AND date <= end_date)

Replicating this in Pandas is more involved but doable:

top = (sales_volume_table
       .conditional_join(
            promos_table,
            ('brand', 'brand', '=='),
            ('date', 'start_date', '>='),
            ('date', 'end_date', '<=')
          )
    )

bottom = (sales_volume_table
          .conditional_join(
               promos_table.query('brand == "ANY"'),
               ('date', 'start_date', '>='),
               ('date', 'end_date', '<=')
               )
     )

pd.concat([top, bottom])

	left			right
	date	quantity	brand	start_date	end_date	brand	rebate_per_unit
0	2021-11-26	2	Outdoor	2021-11-25	2021-11-26	Outdoor	5
0	2021-11-15	1	Outdoor	2021-11-01	2021-11-25	ANY	3
1	2021-11-20	2	Leisure	2021-11-01	2021-11-25	ANY	3
2	2021-11-25	3	Athletic	2021-11-01	2021-11-25	ANY	3

Performance#

Why bother with conditional_join? simple syntax, memory and speed efficiency. Let’s see the memory and speed benefits, compared to a cartesian join (the data used below is adapted from here):

%load_ext memory_profiler
from string import ascii_lowercase
import numpy as np
np.random.seed(1)
n = 2_000; k = 10_000

idx1 = np.random.randint(0, high = 100, size = n)
idx2 = np.random.randint(0, high = 100, size = n)

d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),
          start = np.minimum(idx1, idx2),
          end = np.maximum(idx1, idx2))

d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),
          pos1 = np.random.randint(low=60, high = 151, size=k))

d1 = pd.DataFrame(d1)
d2 = pd.DataFrame(d2)

/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/memory_profiler.py:1136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  ipython_version = LooseVersion(IPython.__version__)
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/setuptools/_distutils/version.py:345: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  other = LooseVersion(other)

d1.head()

	x	start	end
0	c	37	61
1	d	12	16
2	b	72	79
3	c	9	74
4	d	75	76

d2.head()

	x	pos1
0	f	126
1	m	121
2	o	128
3	a	96
4	n	61

len(d1), len(d2)

(2000, 10000)

A = (d1
    .merge(d2, how = 'cross')
    .query("start <= pos1 <= end")
    .filter(['start', 'end', 'pos1'])
    )
A.shape

(2740898, 3)

A.head()

	start	end	pos1
4	37	61	61
53	37	61	60
74	37	61	60
104	37	61	60
105	37	61	60

B = (d1
    .conditional_join(
        d2, 
        ('start', 'pos1', '<='), 
        ('end', 'pos1', '>='), 
        df_columns=['start', 'end'],
        right_columns='pos1',
        use_numba=False)
    )
B.shape

(2740898, 3)

B.head()

	start	end	pos1
0	37	61	60
1	37	61	60
2	37	61	60
3	37	61	60
4	37	61	60

C = (d1
    .conditional_join(
        d2, 
        ('start', 'pos1', '<='), 
        ('end', 'pos1', '>='), 
        df_columns=['start', 'end'],
        right_columns='pos1',
        use_numba=True)
    )
C.shape

(2740898, 3)

Check that the outputs match:

cols = ['start', 'end', 'pos1']
A = A.sort_values(cols, ignore_index=True)
B = B.sort_values(cols, ignore_index=True)
C = C.sort_values(cols, ignore_index=True)
A.equals(B), A.equals(C), B.equals(C)

(True, True, True)

Let’s compare the speed:

%timeit d1.merge(d2, how = 'cross').query("start <= pos1 <= end")

1.3 s ± 95.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)

52.6 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)

67.5 ms ± 4.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Compare memory usage:

%memit d1.merge(d2, how = 'cross').query("start <= pos1 <= end")

peak memory: 3174.46 MiB, increment: 2459.45 MiB

%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)

peak memory: 715.01 MiB, increment: 0.00 MiB

%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)

peak memory: 715.01 MiB, increment: 0.00 MiB

Let’s run another performance test - the code below is a self join to find overlapping events in a 30,000 row DataFrame, and adapted from DuckDB:

url = 'https://raw.githubusercontent.com/samukweku/data-wrangling-blog/master/notebooks/Data_files/results.csv'
events = pd.read_csv(url, parse_dates=['start', 'end']).iloc[:, 1:]
events.head()

	id	name	audience	start	sponsor	end
0	1	Event 1	1178	2022-11-19 10:00:00	Sponsor 2	2022-11-19 10:15:00
1	2	Event 2	1446	2015-09-27 15:00:00	Sponsor 11	2015-09-27 15:11:00
2	3	Event 3	2261	2019-11-12 18:00:00	Sponsor 10	2019-11-12 18:53:00
3	4	Event 4	1471	2019-12-24 22:00:00	Sponsor 6	2019-12-24 22:11:00
4	5	Event 5	2605	2028-06-20 12:00:00	Sponsor 8	2028-06-20 12:31:00

(events
.conditional_join(
    events,
    ('start', 'end', '<='),
    ('end', 'start', '>='),
    ('id', 'id', '!='),
    use_numba = False,
    df_columns = ['id', 'start', 'end'],
    right_columns = ['id', 'start', 'end'])
)

	left			right
	id	start	end	id	start	end
0	10	1993-11-27 12:00:00	1993-11-27 12:37:00	2345	1993-11-27 10:00:00	1993-11-27 12:00:00
1	15	1993-04-04 16:00:00	1993-04-04 18:00:00	11178	1993-04-04 17:00:00	1993-04-04 17:22:00
2	17	2030-10-25 07:00:00	2030-10-25 07:27:00	19605	2030-10-25 06:00:00	2030-10-25 08:00:00
3	26	2005-10-04 17:00:00	2005-10-04 17:18:00	8218	2005-10-04 17:00:00	2005-10-04 17:27:00
4	35	2024-05-02 15:00:00	2024-05-02 15:35:00	6916	2024-05-02 15:00:00	2024-05-02 15:36:00
...	...	...	...	...	...	...
3697	29966	2000-08-26 11:00:00	2000-08-26 13:00:00	29375	2000-08-26 13:00:00	2000-08-26 13:53:00
3698	29971	2018-05-18 04:00:00	2018-05-18 04:18:00	24173	2018-05-18 04:00:00	2018-05-18 04:36:00
3699	29978	1992-06-07 22:00:00	1992-06-07 22:23:00	981	1992-06-07 22:00:00	1992-06-07 22:30:00
3700	29984	2025-06-05 03:00:00	2025-06-05 03:17:00	19051	2025-06-05 01:00:00	2025-06-05 03:00:00
3701	29995	2016-09-04 14:00:00	2016-09-04 14:32:00	12296	2016-09-04 14:00:00	2016-09-04 14:50:00

3702 rows × 6 columns

The cartesian join takes a very long time - too long for us to test here. The focus for this test will just be on conditional_join.

%%timeit
(events
.conditional_join(
    events,
    ('start', 'end', '<='),
    ('end', 'start', '>='),
    ('id', 'id', '!='),
    use_numba = False,
    df_columns = ['id', 'start', 'end'],
    right_columns = ['id', 'start', 'end'])
)

23.5 ms ± 85 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
(events
.conditional_join(
    events,
    ('start', 'end', '<='),
    ('end', 'start', '>='),
    ('id', 'id', '!='),
    use_numba = True,
    df_columns = ['id', 'start', 'end'],
    right_columns = ['id', 'start', 'end'])
)

28.2 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let’s look at performance when an equi join is present:

from string import ascii_lowercase
np.random.seed(1)
n = 20_000; k = 50_000

idx1 = np.random.randint(0, high = 100, size = n)
idx2 = np.random.randint(0, high = 100, size = n)

d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),
          start = np.minimum(idx1, idx2),
          end = np.maximum(idx1, idx2))

d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),
          pos1 = np.random.randint(low=60, high = 151, size=k))

d1 = pd.DataFrame(d1)
d2 = pd.DataFrame(d2)

out = (d1
       .merge(d2, on = 'x')
       .query('start <= pos1 <= end')
       )
out.shape

(8886840, 4)

out = (d1
       .conditional_join(
            d2, 
            ('start', 'pos1', '<='), 
            ('end', 'pos1', '>='), 
            ('x', 'x', '=='), 
            use_numba=False)
    )
out.shape

(8886840, 5)

About 18 million rows are returned; let’s see how fast each function runs:

%%timeit
(d1
.merge(d2, on = 'x')
.query('start <= pos1 <= end')
)

3.06 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
(d1
.conditional_join(
    d2, 
    ('start', 'pos1', '<='), 
    ('end', 'pos1', '>='), 
    ('x', 'x', '=='), 
    use_numba=False)
)

2.21 s ± 9.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%memit d1.merge(d2, on = 'x').query('start <= pos1 <= end')

peak memory: 5868.17 MiB, increment: 4803.89 MiB

%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), ('x', 'x', '=='), use_numba=False)

peak memory: 4137.93 MiB, increment: 3073.66 MiB

conditional_join is faster; this might not be the case all the time.

Depending on the distribution of data, using numba where an equi-join is present can give significant performance speedups:

from itertools import count
mapper = dict(zip(ascii_lowercase, count()))
d1['xx'] = d1.x.map(mapper)
d2['xx'] = d2.x.map(mapper)

d1.head()

	x	start	end	xx
0	d	37	82	3
1	c	2	12	2
2	b	46	72	1
3	b	7	9	1
4	c	39	75	2

d2.head()

	x	pos1	xx
0	c	143	2
1	i	85	8
2	k	121	10
3	m	68	12
4	o	97	14

%%timeit
(d1
.merge(d2, on = 'xx')
.query('start <= pos1 <= end')
)

4.72 s ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
(d1
.conditional_join(
    d2, 
    ('start', 'pos1', '<='), 
    ('end', 'pos1', '>='), 
    ('xx', 'xx', '=='), 
    use_numba=False)
)

2.25 s ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
(d1
.conditional_join(
    d2, 
    ('start', 'pos1', '<='), 
    ('end', 'pos1', '>='), 
    ('xx', 'xx', '=='), 
    use_numba=True)
)

281 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The above numba implementation depends on a number of conditions:

Only a single equi join is supported
The columns in the equi join should be either numeric or datetime data types
The columns in the equi join are duplicated with large ranges, but small unique values

Summary#

This blog post shows how to efficiently join on non-equi conditions, using conditional_join. If you are comfortable with SQL, you could get more performance with DuckDB, which allows querying Pandas DataFrames efficiently with SQL, supports non-equi joins as well, and has a performant implementation for range joins.