Read Multiple CSV Files into one Frame in Python#
Link to Source data#
Pandas#
Via read_csv#
import pandas as pd
from pathlib import Path
# create a Path instance and filter for only csv files
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]
# lump into one table
all_csvs = pd.concat(all_csvs)
all_csvs
City | Date | Sales | StoreID | |
---|---|---|---|---|
0 | Tacoma | 1/1/2014 | $17.30 | 1522 |
1 | Tacoma | 1/1/2014 | $15.77 | 1520 |
2 | Tacoma | 1/1/2014 | $151.36 | 1520 |
3 | Tacoma | 1/1/2014 | $168.47 | 1522 |
4 | Tacoma | 1/1/2014 | $150.71 | 1520 |
... | ... | ... | ... | ... |
223593 | Portland | 2/3/2015 | $41.03 | 1004 |
223594 | Portland | 2/3/2015 | $162.54 | 1005 |
223595 | Portland | 2/3/2015 | $38.38 | 1002 |
223596 | Portland | 2/3/2015 | $96.40 | 1003 |
223597 | Portland | 2/3/2015 | $8.38 | 1005 |
880349 rows × 4 columns
Via Pyjanitor’s read_csvs#
The above step can be simplified with pyjanitor’s read_csvs function:
# pip install pyjanitor
from janitor import read_csvs
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
df = read_csvs(files)
df
City | Date | Sales | StoreID | |
---|---|---|---|---|
0 | Tacoma | 1/1/2014 | $17.30 | 1522 |
1 | Tacoma | 1/1/2014 | $15.77 | 1520 |
2 | Tacoma | 1/1/2014 | $151.36 | 1520 |
3 | Tacoma | 1/1/2014 | $168.47 | 1522 |
4 | Tacoma | 1/1/2014 | $150.71 | 1520 |
... | ... | ... | ... | ... |
880344 | Portland | 2/3/2015 | $41.03 | 1004 |
880345 | Portland | 2/3/2015 | $162.54 | 1005 |
880346 | Portland | 2/3/2015 | $38.38 | 1002 |
880347 | Portland | 2/3/2015 | $96.40 | 1003 |
880348 | Portland | 2/3/2015 | $8.38 | 1005 |
880349 rows × 4 columns
Via read_csv and the command line#
import subprocess
from io import StringIO
data = subprocess.run("awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv",
shell=True,
capture_output=True,
text=True).stdout
df = pd.read_csv(StringIO(data))
df
City | Date | Sales | StoreID | |
---|---|---|---|---|
0 | Oakland | 1/1/2014 | $9.83 | 982 |
1 | Oakland | 1/1/2014 | $28.18 | 983 |
2 | Oakland | 1/1/2014 | $6.83 | 982 |
3 | Oakland | 1/1/2014 | $43.90 | 982 |
4 | Oakland | 1/1/2014 | $17.16 | 980 |
... | ... | ... | ... | ... |
880344 | Tacoma | 12/31/2015 | $33.68 | 1523 |
880345 | Tacoma | 12/31/2015 | $215.98 | 1521 |
880346 | Tacoma | 12/31/2015 | $236.86 | 1521 |
880347 | Tacoma | 12/31/2015 | $33.02 | 1522 |
880348 | Tacoma | 12/31/2015 | $11.32 | 1523 |
880349 rows × 4 columns
Via Pyjanitor’s read_commandline#
Reading via the command line can be simplified via pyjanitor’s read_commandline function:
# pip install pyjanitor
from janitor import read_commandline
cmd = "awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv"
df = read_commandline(cmd)
df
City | Date | Sales | StoreID | |
---|---|---|---|---|
0 | Oakland | 1/1/2014 | $9.83 | 982 |
1 | Oakland | 1/1/2014 | $28.18 | 983 |
2 | Oakland | 1/1/2014 | $6.83 | 982 |
3 | Oakland | 1/1/2014 | $43.90 | 982 |
4 | Oakland | 1/1/2014 | $17.16 | 980 |
... | ... | ... | ... | ... |
880344 | Tacoma | 12/31/2015 | $33.68 | 1523 |
880345 | Tacoma | 12/31/2015 | $215.98 | 1521 |
880346 | Tacoma | 12/31/2015 | $236.86 | 1521 |
880347 | Tacoma | 12/31/2015 | $33.02 | 1522 |
880348 | Tacoma | 12/31/2015 | $11.32 | 1523 |
880349 rows × 4 columns
datatable#
Via iread#
from datatable import fread, iread, rbind
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
# iread returns an iterator of all files
all_csvs = iread(tuple(files))
# combine into one table with rbind
all_csvs = rbind(all_csvs)
all_csvs
/Users/samuel.oranyeli/mambaforge/envs/blogger/lib/python3.10/site-packages/datatable/__init__.py:181: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
init_styles()
/Users/samuel.oranyeli/mambaforge/envs/blogger/lib/python3.10/site-packages/datatable/__init__.py:181: DeprecationWarning: Importing update_display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
init_styles()
City | Date | Sales | StoreID | |
---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | Tacoma | 1/1/2014 | $17.30 | 1522 |
1 | Tacoma | 1/1/2014 | $15.77 | 1520 |
2 | Tacoma | 1/1/2014 | $151.36 | 1520 |
3 | Tacoma | 1/1/2014 | $168.47 | 1522 |
4 | Tacoma | 1/1/2014 | $150.71 | 1520 |
5 | Tacoma | 1/1/2014 | $22.86 | 1523 |
6 | Tacoma | 1/1/2014 | $18.66 | 1521 |
7 | Tacoma | 1/1/2014 | $39.76 | 1520 |
8 | Tacoma | 1/1/2014 | $132.75 | 1523 |
9 | Tacoma | 1/1/2014 | $39.58 | 1521 |
10 | Tacoma | 1/1/2014 | $159.14 | 1523 |
11 | Tacoma | 1/1/2014 | $33.29 | 1522 |
12 | Tacoma | 1/1/2014 | $38.40 | 1522 |
13 | Tacoma | 1/1/2014 | $33.46 | 1520 |
14 | Tacoma | 1/1/2014 | $23.85 | 1522 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
880,344 | Portland | 2/3/2015 | $41.03 | 1004 |
880,345 | Portland | 2/3/2015 | $162.54 | 1005 |
880,346 | Portland | 2/3/2015 | $38.38 | 1002 |
880,347 | Portland | 2/3/2015 | $96.40 | 1003 |
880,348 | Portland | 2/3/2015 | $8.38 | 1005 |
Via fread#
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
all_csvs = [fread(filename) for filename in files]
# combine into one frame
rbind(all_csvs)
City | Date | Sales | StoreID | |
---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | Tacoma | 1/1/2014 | $17.30 | 1522 |
1 | Tacoma | 1/1/2014 | $15.77 | 1520 |
2 | Tacoma | 1/1/2014 | $151.36 | 1520 |
3 | Tacoma | 1/1/2014 | $168.47 | 1522 |
4 | Tacoma | 1/1/2014 | $150.71 | 1520 |
5 | Tacoma | 1/1/2014 | $22.86 | 1523 |
6 | Tacoma | 1/1/2014 | $18.66 | 1521 |
7 | Tacoma | 1/1/2014 | $39.76 | 1520 |
8 | Tacoma | 1/1/2014 | $132.75 | 1523 |
9 | Tacoma | 1/1/2014 | $39.58 | 1521 |
10 | Tacoma | 1/1/2014 | $159.14 | 1523 |
11 | Tacoma | 1/1/2014 | $33.29 | 1522 |
12 | Tacoma | 1/1/2014 | $38.40 | 1522 |
13 | Tacoma | 1/1/2014 | $33.46 | 1520 |
14 | Tacoma | 1/1/2014 | $23.85 | 1522 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
880,344 | Portland | 2/3/2015 | $41.03 | 1004 |
880,345 | Portland | 2/3/2015 | $162.54 | 1005 |
880,346 | Portland | 2/3/2015 | $38.38 | 1002 |
880,347 | Portland | 2/3/2015 | $96.40 | 1003 |
880,348 | Portland | 2/3/2015 | $8.38 | 1005 |
Via fread and the command line#
# less verbose than the Pandas option
cmd = "awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv"
all_csvs = fread(cmd = cmd)
all_csvs
City | Date | Sales | StoreID | |
---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | Oakland | 1/1/2014 | $9.83 | 982 |
1 | Oakland | 1/1/2014 | $28.18 | 983 |
2 | Oakland | 1/1/2014 | $6.83 | 982 |
3 | Oakland | 1/1/2014 | $43.90 | 982 |
4 | Oakland | 1/1/2014 | $17.16 | 980 |
5 | Oakland | 1/1/2014 | $14.29 | 982 |
6 | Oakland | 1/1/2014 | $6.48 | 982 |
7 | Oakland | 1/1/2014 | $232.13 | 982 |
8 | Oakland | 1/1/2014 | $209.28 | 981 |
9 | Oakland | 1/1/2014 | $12.11 | 981 |
10 | Oakland | 1/1/2014 | $107.96 | 982 |
11 | Oakland | 1/1/2014 | $47.64 | 980 |
12 | Oakland | 1/1/2014 | $15.44 | 981 |
13 | Oakland | 1/1/2014 | $52.72 | 982 |
14 | Oakland | 1/1/2014 | $3.25 | 981 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
880,344 | Tacoma | 12/31/2015 | $33.68 | 1523 |
880,345 | Tacoma | 12/31/2015 | $215.98 | 1521 |
880,346 | Tacoma | 12/31/2015 | $236.86 | 1521 |
880,347 | Tacoma | 12/31/2015 | $33.02 | 1522 |
880,348 | Tacoma | 12/31/2015 | $11.32 | 1523 |
Polars#
import polars as pl
files = Path("./Data_files/multiple_csvs/*.csv")
pl.read_csv(files)
shape: (880_349, 4)
City | Date | Sales | StoreID |
---|---|---|---|
str | str | str | i64 |
"Oakland" | "1/1/2014" | "$9.83 " | 982 |
"Oakland" | "1/1/2014" | "$28.18 " | 983 |
"Oakland" | "1/1/2014" | "$6.83 " | 982 |
"Oakland" | "1/1/2014" | "$43.90 " | 982 |
"Oakland" | "1/1/2014" | "$17.16 " | 980 |
… | … | … | … |
"Tacoma" | "12/31/2015" | "$33.68 " | 1523 |
"Tacoma" | "12/31/2015" | "$215.98 " | 1521 |
"Tacoma" | "12/31/2015" | "$236.86 " | 1521 |
"Tacoma" | "12/31/2015" | "$33.02 " | 1522 |
"Tacoma" | "12/31/2015" | "$11.32 " | 1523 |
Resources Used#
datatable - fread
datatable - iread
pandas - read_csv
polars - read_csv
python - subprocess
python - pathlib
StackOverflow - read csv files via the command line
Comments#