Read Multiple CSV Files into one Frame in Python#

Pandas#

Via read_csv#

import pandas as pd
from pathlib import Path

# create a Path instance and filter for only csv files
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]

# lump into one table
all_csvs = pd.concat(all_csvs)

all_csvs
City Date Sales StoreID
0 Oakland 1/1/2014 $9.83 982
1 Oakland 1/1/2014 $28.18 983
2 Oakland 1/1/2014 $6.83 982
3 Oakland 1/1/2014 $43.90 982
4 Oakland 1/1/2014 $17.16 980
... ... ... ... ...
223593 Portland 2/3/2015 $41.03 1004
223594 Portland 2/3/2015 $162.54 1005
223595 Portland 2/3/2015 $38.38 1002
223596 Portland 2/3/2015 $96.40 1003
223597 Portland 2/3/2015 $8.38 1005

880349 rows × 4 columns

Via Pyjanitor’s read_csvs#

The above step can be simplified with pyjanitor’s read_csvs function:

# pip install pyjanitor
from janitor import read_csvs

files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
df = read_csvs(files)
df
City Date Sales StoreID
0 Oakland 1/1/2014 $9.83 982
1 Oakland 1/1/2014 $28.18 983
2 Oakland 1/1/2014 $6.83 982
3 Oakland 1/1/2014 $43.90 982
4 Oakland 1/1/2014 $17.16 980
... ... ... ... ...
880344 Portland 2/3/2015 $41.03 1004
880345 Portland 2/3/2015 $162.54 1005
880346 Portland 2/3/2015 $38.38 1002
880347 Portland 2/3/2015 $96.40 1003
880348 Portland 2/3/2015 $8.38 1005

880349 rows × 4 columns

Via read_csv and the command line#

import subprocess
from io import StringIO

data = subprocess.run("awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv", 
                      shell=True, 
                      capture_output=True,
                      text=True).stdout

df = pd.read_csv(StringIO(data))

df
City Date Sales StoreID
0 Oakland 1/1/2014 $9.83 982
1 Oakland 1/1/2014 $28.18 983
2 Oakland 1/1/2014 $6.83 982
3 Oakland 1/1/2014 $43.90 982
4 Oakland 1/1/2014 $17.16 980
... ... ... ... ...
880344 Tacoma 12/31/2015 $33.68 1523
880345 Tacoma 12/31/2015 $215.98 1521
880346 Tacoma 12/31/2015 $236.86 1521
880347 Tacoma 12/31/2015 $33.02 1522
880348 Tacoma 12/31/2015 $11.32 1523

880349 rows × 4 columns

Via Pyjanitor’s read_commandline#

Reading via the command line can be simplified via pyjanitor’s read_commandline function:

# pip install pyjanitor
from janitor import read_commandline

cmd = "awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv"
df = read_commandline(cmd)
df
City Date Sales StoreID
0 Oakland 1/1/2014 $9.83 982
1 Oakland 1/1/2014 $28.18 983
2 Oakland 1/1/2014 $6.83 982
3 Oakland 1/1/2014 $43.90 982
4 Oakland 1/1/2014 $17.16 980
... ... ... ... ...
880344 Tacoma 12/31/2015 $33.68 1523
880345 Tacoma 12/31/2015 $215.98 1521
880346 Tacoma 12/31/2015 $236.86 1521
880347 Tacoma 12/31/2015 $33.02 1522
880348 Tacoma 12/31/2015 $11.32 1523

880349 rows × 4 columns

datatable#

Via iread#

from datatable import fread, iread, rbind

files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

# iread returns an iterator of all files
all_csvs = iread(tuple(files))

# combine into one table with rbind
all_csvs = rbind(all_csvs)

all_csvs
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/datatable/__init__.py:181: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
  init_styles()
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/datatable/__init__.py:181: DeprecationWarning: Importing update_display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
  init_styles()
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Oakland1/1/2014$9.83982
1Oakland1/1/2014$28.18983
2Oakland1/1/2014$6.83982
3Oakland1/1/2014$43.90982
4Oakland1/1/2014$17.16980
5Oakland1/1/2014$14.29982
6Oakland1/1/2014$6.48982
7Oakland1/1/2014$232.13982
8Oakland1/1/2014$209.28981
9Oakland1/1/2014$12.11981
10Oakland1/1/2014$107.96982
11Oakland1/1/2014$47.64980
12Oakland1/1/2014$15.44981
13Oakland1/1/2014$52.72982
14Oakland1/1/2014$3.25981
880,344Portland2/3/2015$41.031004
880,345Portland2/3/2015$162.541005
880,346Portland2/3/2015$38.381002
880,347Portland2/3/2015$96.401003
880,348Portland2/3/2015$8.381005

Via fread#

files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

all_csvs = [fread(filename) for filename in files]

# combine into one frame
rbind(all_csvs)
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Oakland1/1/2014$9.83982
1Oakland1/1/2014$28.18983
2Oakland1/1/2014$6.83982
3Oakland1/1/2014$43.90982
4Oakland1/1/2014$17.16980
5Oakland1/1/2014$14.29982
6Oakland1/1/2014$6.48982
7Oakland1/1/2014$232.13982
8Oakland1/1/2014$209.28981
9Oakland1/1/2014$12.11981
10Oakland1/1/2014$107.96982
11Oakland1/1/2014$47.64980
12Oakland1/1/2014$15.44981
13Oakland1/1/2014$52.72982
14Oakland1/1/2014$3.25981
880,344Portland2/3/2015$41.031004
880,345Portland2/3/2015$162.541005
880,346Portland2/3/2015$38.381002
880,347Portland2/3/2015$96.401003
880,348Portland2/3/2015$8.381005

Via fread and the command line#

# less verbose than the Pandas option
cmd = "awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv"
all_csvs = fread(cmd = cmd)
all_csvs
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Oakland1/1/2014$9.83982
1Oakland1/1/2014$28.18983
2Oakland1/1/2014$6.83982
3Oakland1/1/2014$43.90982
4Oakland1/1/2014$17.16980
5Oakland1/1/2014$14.29982
6Oakland1/1/2014$6.48982
7Oakland1/1/2014$232.13982
8Oakland1/1/2014$209.28981
9Oakland1/1/2014$12.11981
10Oakland1/1/2014$107.96982
11Oakland1/1/2014$47.64980
12Oakland1/1/2014$15.44981
13Oakland1/1/2014$52.72982
14Oakland1/1/2014$3.25981
880,344Tacoma12/31/2015$33.681523
880,345Tacoma12/31/2015$215.981521
880,346Tacoma12/31/2015$236.861521
880,347Tacoma12/31/2015$33.021522
880,348Tacoma12/31/2015$11.321523

Resources Used#

datatable - fread

datatable - iread

pandas - read_csv

python - subprocess

python - pathlib

StackOverflow - read csv files via the command line