Read Multiple CSV Files into one Frame in Python

Read Multiple CSV Files into one Frame in Python#

Link to Source data #

Pandas#

Via read_csv#

import pandas as pd
from pathlib import Path

# create a Path instance and filter for only csv files
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]

# lump into one table
all_csvs = pd.concat(all_csvs)

all_csvs

	City	Date	Sales	StoreID
0	Tacoma	1/1/2014	$17.30	1522
1	Tacoma	1/1/2014	$15.77	1520
2	Tacoma	1/1/2014	$151.36	1520
3	Tacoma	1/1/2014	$168.47	1522
4	Tacoma	1/1/2014	$150.71	1520
...	...	...	...	...
223593	Portland	2/3/2015	$41.03	1004
223594	Portland	2/3/2015	$162.54	1005
223595	Portland	2/3/2015	$38.38	1002
223596	Portland	2/3/2015	$96.40	1003
223597	Portland	2/3/2015	$8.38	1005

880349 rows × 4 columns

Via Pyjanitor’s read_csvs#

The above step can be simplified with pyjanitor’s read_csvs function:

# pip install pyjanitor
from janitor import read_csvs

files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
df = read_csvs(files)
df

	City	Date	Sales	StoreID
0	Tacoma	1/1/2014	$17.30	1522
1	Tacoma	1/1/2014	$15.77	1520
2	Tacoma	1/1/2014	$151.36	1520
3	Tacoma	1/1/2014	$168.47	1522
4	Tacoma	1/1/2014	$150.71	1520
...	...	...	...	...
880344	Portland	2/3/2015	$41.03	1004
880345	Portland	2/3/2015	$162.54	1005
880346	Portland	2/3/2015	$38.38	1002
880347	Portland	2/3/2015	$96.40	1003
880348	Portland	2/3/2015	$8.38	1005

880349 rows × 4 columns

Via read_csv and the command line#

import subprocess
from io import StringIO

data = subprocess.run("awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv", 
                      shell=True, 
                      capture_output=True,
                      text=True).stdout

df = pd.read_csv(StringIO(data))

df

	City	Date	Sales	StoreID
0	Oakland	1/1/2014	$9.83	982
1	Oakland	1/1/2014	$28.18	983
2	Oakland	1/1/2014	$6.83	982
3	Oakland	1/1/2014	$43.90	982
4	Oakland	1/1/2014	$17.16	980
...	...	...	...	...
880344	Tacoma	12/31/2015	$33.68	1523
880345	Tacoma	12/31/2015	$215.98	1521
880346	Tacoma	12/31/2015	$236.86	1521
880347	Tacoma	12/31/2015	$33.02	1522
880348	Tacoma	12/31/2015	$11.32	1523

880349 rows × 4 columns

Via Pyjanitor’s read_commandline#

Reading via the command line can be simplified via pyjanitor’s read_commandline function:

# pip install pyjanitor
from janitor import read_commandline

cmd = "awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv"
df = read_commandline(cmd)
df

	City	Date	Sales	StoreID
0	Oakland	1/1/2014	$9.83	982
1	Oakland	1/1/2014	$28.18	983
2	Oakland	1/1/2014	$6.83	982
3	Oakland	1/1/2014	$43.90	982
4	Oakland	1/1/2014	$17.16	980
...	...	...	...	...
880344	Tacoma	12/31/2015	$33.68	1523
880345	Tacoma	12/31/2015	$215.98	1521
880346	Tacoma	12/31/2015	$236.86	1521
880347	Tacoma	12/31/2015	$33.02	1522
880348	Tacoma	12/31/2015	$11.32	1523

880349 rows × 4 columns

datatable#

Via iread#

from datatable import fread, iread, rbind

files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

# iread returns an iterator of all files
all_csvs = iread(tuple(files))

# combine into one table with rbind
all_csvs = rbind(all_csvs)

all_csvs

/Users/samuel.oranyeli/mambaforge/envs/blogger/lib/python3.10/site-packages/datatable/__init__.py:181: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
  init_styles()
/Users/samuel.oranyeli/mambaforge/envs/blogger/lib/python3.10/site-packages/datatable/__init__.py:181: DeprecationWarning: Importing update_display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
  init_styles()

	City	Date	Sales	StoreID
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Tacoma	1/1/2014	$17.30	1522
1	Tacoma	1/1/2014	$15.77	1520
2	Tacoma	1/1/2014	$151.36	1520
3	Tacoma	1/1/2014	$168.47	1522
4	Tacoma	1/1/2014	$150.71	1520
5	Tacoma	1/1/2014	$22.86	1523
6	Tacoma	1/1/2014	$18.66	1521
7	Tacoma	1/1/2014	$39.76	1520
8	Tacoma	1/1/2014	$132.75	1523
9	Tacoma	1/1/2014	$39.58	1521
10	Tacoma	1/1/2014	$159.14	1523
11	Tacoma	1/1/2014	$33.29	1522
12	Tacoma	1/1/2014	$38.40	1522
13	Tacoma	1/1/2014	$33.46	1520
14	Tacoma	1/1/2014	$23.85	1522
⋮	⋮	⋮	⋮	⋮
880,344	Portland	2/3/2015	$41.03	1004
880,345	Portland	2/3/2015	$162.54	1005
880,346	Portland	2/3/2015	$38.38	1002
880,347	Portland	2/3/2015	$96.40	1003
880,348	Portland	2/3/2015	$8.38	1005

Via fread#

files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

all_csvs = [fread(filename) for filename in files]

# combine into one frame
rbind(all_csvs)

	City	Date	Sales	StoreID
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Tacoma	1/1/2014	$17.30	1522
1	Tacoma	1/1/2014	$15.77	1520
2	Tacoma	1/1/2014	$151.36	1520
3	Tacoma	1/1/2014	$168.47	1522
4	Tacoma	1/1/2014	$150.71	1520
5	Tacoma	1/1/2014	$22.86	1523
6	Tacoma	1/1/2014	$18.66	1521
7	Tacoma	1/1/2014	$39.76	1520
8	Tacoma	1/1/2014	$132.75	1523
9	Tacoma	1/1/2014	$39.58	1521
10	Tacoma	1/1/2014	$159.14	1523
11	Tacoma	1/1/2014	$33.29	1522
12	Tacoma	1/1/2014	$38.40	1522
13	Tacoma	1/1/2014	$33.46	1520
14	Tacoma	1/1/2014	$23.85	1522
⋮	⋮	⋮	⋮	⋮
880,344	Portland	2/3/2015	$41.03	1004
880,345	Portland	2/3/2015	$162.54	1005
880,346	Portland	2/3/2015	$38.38	1002
880,347	Portland	2/3/2015	$96.40	1003
880,348	Portland	2/3/2015	$8.38	1005

Via fread and the command line#

# less verbose than the Pandas option
cmd = "awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv"
all_csvs = fread(cmd = cmd)
all_csvs

	City	Date	Sales	StoreID
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	Oakland	1/1/2014	$9.83	982
1	Oakland	1/1/2014	$28.18	983
2	Oakland	1/1/2014	$6.83	982
3	Oakland	1/1/2014	$43.90	982
4	Oakland	1/1/2014	$17.16	980
5	Oakland	1/1/2014	$14.29	982
6	Oakland	1/1/2014	$6.48	982
7	Oakland	1/1/2014	$232.13	982
8	Oakland	1/1/2014	$209.28	981
9	Oakland	1/1/2014	$12.11	981
10	Oakland	1/1/2014	$107.96	982
11	Oakland	1/1/2014	$47.64	980
12	Oakland	1/1/2014	$15.44	981
13	Oakland	1/1/2014	$52.72	982
14	Oakland	1/1/2014	$3.25	981
⋮	⋮	⋮	⋮	⋮
880,344	Tacoma	12/31/2015	$33.68	1523
880,345	Tacoma	12/31/2015	$215.98	1521
880,346	Tacoma	12/31/2015	$236.86	1521
880,347	Tacoma	12/31/2015	$33.02	1522
880,348	Tacoma	12/31/2015	$11.32	1523

Polars#

import polars as pl
files = Path("./Data_files/multiple_csvs/*.csv")
pl.read_csv(files)

shape: (880_349, 4)

City	Date	Sales	StoreID
str	str	str	i64
"Oakland"	"1/1/2014"	"$9.83 "	982
"Oakland"	"1/1/2014"	"$28.18 "	983
"Oakland"	"1/1/2014"	"$6.83 "	982
"Oakland"	"1/1/2014"	"$43.90 "	982
"Oakland"	"1/1/2014"	"$17.16 "	980
…	…	…	…
"Tacoma"	"12/31/2015"	"$33.68 "	1523
"Tacoma"	"12/31/2015"	"$215.98 "	1521
"Tacoma"	"12/31/2015"	"$236.86 "	1521
"Tacoma"	"12/31/2015"	"$33.02 "	1522
"Tacoma"	"12/31/2015"	"$11.32 "	1523

Resources Used#

datatable - fread

datatable - iread

pandas - read_csv

polars - read_csv

python - subprocess

python - pathlib

StackOverflow - read csv files via the command line

Read Multiple CSV Files into one Frame in Python

Contents

Read Multiple CSV Files into one Frame in Python#

Link to Source data #

Pandas#

Via read_csv#

Via Pyjanitor’s read_csvs#

Via read_csv and the command line#

Via Pyjanitor’s read_commandline#

datatable#

Via iread#

Via fread#

Via fread and the command line#

Polars#

Resources Used#

Comments#

Read Multiple CSV Files into one Frame in Python

Contents

Read Multiple CSV Files into one Frame in Python#

Link to Source data#

Pandas#

Via read_csv#

Via Pyjanitor’s read_csvs#

Via read_csv and the command line#

Via Pyjanitor’s read_commandline#

datatable#

Via iread#

Via fread#

Via fread and the command line#

Polars#

Resources Used#

Comments#

Link to Source data #