{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Read Multiple CSV Files into one Frame in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### [Link to Source data](https://people.highline.edu/mgirvin/excelisfun.htm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Pandas**" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-10-31T01:26:23.474681Z", "iopub.status.busy": "2020-10-31T01:26:23.474464Z", "iopub.status.idle": "2020-10-31T01:26:23.477682Z", "shell.execute_reply": "2020-10-31T01:26:23.477047Z", "shell.execute_reply.started": "2020-10-31T01:26:23.474658Z" } }, "source": [ "### Via read_csv" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityDateSalesStoreID
0Tacoma1/1/2014$17.301522
1Tacoma1/1/2014$15.771520
2Tacoma1/1/2014$151.361520
3Tacoma1/1/2014$168.471522
4Tacoma1/1/2014$150.711520
...............
223593Portland2/3/2015$41.031004
223594Portland2/3/2015$162.541005
223595Portland2/3/2015$38.381002
223596Portland2/3/2015$96.401003
223597Portland2/3/2015$8.381005
\n", "

880349 rows × 4 columns

\n", "
" ], "text/plain": [ " City Date Sales StoreID\n", "0 Tacoma 1/1/2014 $17.30 1522\n", "1 Tacoma 1/1/2014 $15.77 1520\n", "2 Tacoma 1/1/2014 $151.36 1520\n", "3 Tacoma 1/1/2014 $168.47 1522\n", "4 Tacoma 1/1/2014 $150.71 1520\n", "... ... ... ... ...\n", "223593 Portland 2/3/2015 $41.03 1004\n", "223594 Portland 2/3/2015 $162.54 1005\n", "223595 Portland 2/3/2015 $38.38 1002\n", "223596 Portland 2/3/2015 $96.40 1003\n", "223597 Portland 2/3/2015 $8.38 1005\n", "\n", "[880349 rows x 4 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from pathlib import Path\n", "\n", "# create a Path instance and filter for only csv files\n", "files = Path(\"./Data_files/multiple_csvs/\").rglob(\"*.csv\")\n", "\n", "# read in all the csv files\n", "all_csvs = [pd.read_csv(file) for file in files]\n", "\n", "# lump into one table\n", "all_csvs = pd.concat(all_csvs)\n", "\n", "all_csvs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Via Pyjanitor's read_csvs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above step can be simplified with [pyjanitor's](https://pyjanitor-devs.github.io/pyjanitor/) [read_csvs](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.read_csvs) function:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityDateSalesStoreID
0Tacoma1/1/2014$17.301522
1Tacoma1/1/2014$15.771520
2Tacoma1/1/2014$151.361520
3Tacoma1/1/2014$168.471522
4Tacoma1/1/2014$150.711520
...............
880344Portland2/3/2015$41.031004
880345Portland2/3/2015$162.541005
880346Portland2/3/2015$38.381002
880347Portland2/3/2015$96.401003
880348Portland2/3/2015$8.381005
\n", "

880349 rows × 4 columns

\n", "
" ], "text/plain": [ " City Date Sales StoreID\n", "0 Tacoma 1/1/2014 $17.30 1522\n", "1 Tacoma 1/1/2014 $15.77 1520\n", "2 Tacoma 1/1/2014 $151.36 1520\n", "3 Tacoma 1/1/2014 $168.47 1522\n", "4 Tacoma 1/1/2014 $150.71 1520\n", "... ... ... ... ...\n", "880344 Portland 2/3/2015 $41.03 1004\n", "880345 Portland 2/3/2015 $162.54 1005\n", "880346 Portland 2/3/2015 $38.38 1002\n", "880347 Portland 2/3/2015 $96.40 1003\n", "880348 Portland 2/3/2015 $8.38 1005\n", "\n", "[880349 rows x 4 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pip install pyjanitor\n", "from janitor import read_csvs\n", "\n", "files = Path(\"./Data_files/multiple_csvs/\").rglob(\"*.csv\")\n", "df = read_csvs(files)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Via read_csv and the command line" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityDateSalesStoreID
0Oakland1/1/2014$9.83982
1Oakland1/1/2014$28.18983
2Oakland1/1/2014$6.83982
3Oakland1/1/2014$43.90982
4Oakland1/1/2014$17.16980
...............
880344Tacoma12/31/2015$33.681523
880345Tacoma12/31/2015$215.981521
880346Tacoma12/31/2015$236.861521
880347Tacoma12/31/2015$33.021522
880348Tacoma12/31/2015$11.321523
\n", "

880349 rows × 4 columns

\n", "
" ], "text/plain": [ " City Date Sales StoreID\n", "0 Oakland 1/1/2014 $9.83 982\n", "1 Oakland 1/1/2014 $28.18 983\n", "2 Oakland 1/1/2014 $6.83 982\n", "3 Oakland 1/1/2014 $43.90 982\n", "4 Oakland 1/1/2014 $17.16 980\n", "... ... ... ... ...\n", "880344 Tacoma 12/31/2015 $33.68 1523\n", "880345 Tacoma 12/31/2015 $215.98 1521\n", "880346 Tacoma 12/31/2015 $236.86 1521\n", "880347 Tacoma 12/31/2015 $33.02 1522\n", "880348 Tacoma 12/31/2015 $11.32 1523\n", "\n", "[880349 rows x 4 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import subprocess\n", "from io import StringIO\n", "\n", "data = subprocess.run(\"awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv\", \n", " shell=True, \n", " capture_output=True,\n", " text=True).stdout\n", "\n", "df = pd.read_csv(StringIO(data))\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Via Pyjanitor's read_commandline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading via the command line can be simplified via [pyjanitor's](https://pyjanitor-devs.github.io/pyjanitor/) [read_commandline](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.read_commandline) function:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityDateSalesStoreID
0Oakland1/1/2014$9.83982
1Oakland1/1/2014$28.18983
2Oakland1/1/2014$6.83982
3Oakland1/1/2014$43.90982
4Oakland1/1/2014$17.16980
...............
880344Tacoma12/31/2015$33.681523
880345Tacoma12/31/2015$215.981521
880346Tacoma12/31/2015$236.861521
880347Tacoma12/31/2015$33.021522
880348Tacoma12/31/2015$11.321523
\n", "

880349 rows × 4 columns

\n", "
" ], "text/plain": [ " City Date Sales StoreID\n", "0 Oakland 1/1/2014 $9.83 982\n", "1 Oakland 1/1/2014 $28.18 983\n", "2 Oakland 1/1/2014 $6.83 982\n", "3 Oakland 1/1/2014 $43.90 982\n", "4 Oakland 1/1/2014 $17.16 980\n", "... ... ... ... ...\n", "880344 Tacoma 12/31/2015 $33.68 1523\n", "880345 Tacoma 12/31/2015 $215.98 1521\n", "880346 Tacoma 12/31/2015 $236.86 1521\n", "880347 Tacoma 12/31/2015 $33.02 1522\n", "880348 Tacoma 12/31/2015 $11.32 1523\n", "\n", "[880349 rows x 4 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pip install pyjanitor\n", "from janitor import read_commandline\n", "\n", "cmd = \"awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv\"\n", "df = read_commandline(cmd)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **datatable**" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-10-31T01:38:57.851388Z", "iopub.status.busy": "2020-10-31T01:38:57.851145Z", "iopub.status.idle": "2020-10-31T01:38:57.854014Z", "shell.execute_reply": "2020-10-31T01:38:57.853407Z", "shell.execute_reply.started": "2020-10-31T01:38:57.851357Z" } }, "source": [ "### Via iread" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Tacoma1/1/2014$17.301522
1Tacoma1/1/2014$15.771520
2Tacoma1/1/2014$151.361520
3Tacoma1/1/2014$168.471522
4Tacoma1/1/2014$150.711520
5Tacoma1/1/2014$22.861523
6Tacoma1/1/2014$18.661521
7Tacoma1/1/2014$39.761520
8Tacoma1/1/2014$132.751523
9Tacoma1/1/2014$39.581521
10Tacoma1/1/2014$159.141523
11Tacoma1/1/2014$33.291522
12Tacoma1/1/2014$38.401522
13Tacoma1/1/2014$33.461520
14Tacoma1/1/2014$23.851522
880,344Portland2/3/2015$41.031004
880,345Portland2/3/2015$162.541005
880,346Portland2/3/2015$38.381002
880,347Portland2/3/2015$96.401003
880,348Portland2/3/2015$8.381005
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from datatable import fread, iread, rbind\n", "\n", "files = Path(\"./Data_files/multiple_csvs/\").rglob(\"*.csv\")\n", "\n", "# iread returns an iterator of all files\n", "all_csvs = iread(tuple(files))\n", "\n", "# combine into one table with rbind\n", "all_csvs = rbind(all_csvs)\n", "\n", "all_csvs" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-10-31T01:39:15.743149Z", "iopub.status.busy": "2020-10-31T01:39:15.742716Z", "iopub.status.idle": "2020-10-31T01:39:15.747343Z", "shell.execute_reply": "2020-10-31T01:39:15.746321Z", "shell.execute_reply.started": "2020-10-31T01:39:15.743100Z" } }, "source": [ "### Via fread" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Tacoma1/1/2014$17.301522
1Tacoma1/1/2014$15.771520
2Tacoma1/1/2014$151.361520
3Tacoma1/1/2014$168.471522
4Tacoma1/1/2014$150.711520
5Tacoma1/1/2014$22.861523
6Tacoma1/1/2014$18.661521
7Tacoma1/1/2014$39.761520
8Tacoma1/1/2014$132.751523
9Tacoma1/1/2014$39.581521
10Tacoma1/1/2014$159.141523
11Tacoma1/1/2014$33.291522
12Tacoma1/1/2014$38.401522
13Tacoma1/1/2014$33.461520
14Tacoma1/1/2014$23.851522
880,344Portland2/3/2015$41.031004
880,345Portland2/3/2015$162.541005
880,346Portland2/3/2015$38.381002
880,347Portland2/3/2015$96.401003
880,348Portland2/3/2015$8.381005
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "files = Path(\"./Data_files/multiple_csvs/\").rglob(\"*.csv\")\n", "\n", "all_csvs = [fread(filename) for filename in files]\n", "\n", "# combine into one frame\n", "rbind(all_csvs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Via fread and the command line" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Oakland1/1/2014$9.83982
1Oakland1/1/2014$28.18983
2Oakland1/1/2014$6.83982
3Oakland1/1/2014$43.90982
4Oakland1/1/2014$17.16980
5Oakland1/1/2014$14.29982
6Oakland1/1/2014$6.48982
7Oakland1/1/2014$232.13982
8Oakland1/1/2014$209.28981
9Oakland1/1/2014$12.11981
10Oakland1/1/2014$107.96982
11Oakland1/1/2014$47.64980
12Oakland1/1/2014$15.44981
13Oakland1/1/2014$52.72982
14Oakland1/1/2014$3.25981
880,344Tacoma12/31/2015$33.681523
880,345Tacoma12/31/2015$215.981521
880,346Tacoma12/31/2015$236.861521
880,347Tacoma12/31/2015$33.021522
880,348Tacoma12/31/2015$11.321523
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# less verbose than the Pandas option\n", "cmd = \"awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv\"\n", "all_csvs = fread(cmd = cmd)\n", "all_csvs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Polars**" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (880_349, 4)
CityDateSalesStoreID
strstrstri64
"Oakland""1/1/2014""$9.83 "982
"Oakland""1/1/2014""$28.18 "983
"Oakland""1/1/2014""$6.83 "982
"Oakland""1/1/2014""$43.90 "982
"Oakland""1/1/2014""$17.16 "980
"Tacoma""12/31/2015""$33.68 "1523
"Tacoma""12/31/2015""$215.98 "1521
"Tacoma""12/31/2015""$236.86 "1521
"Tacoma""12/31/2015""$33.02 "1522
"Tacoma""12/31/2015""$11.32 "1523
" ], "text/plain": [ "shape: (880_349, 4)\n", "┌─────────┬────────────┬──────────┬─────────┐\n", "│ City ┆ Date ┆ Sales ┆ StoreID │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ i64 │\n", "╞═════════╪════════════╪══════════╪═════════╡\n", "│ Oakland ┆ 1/1/2014 ┆ $9.83 ┆ 982 │\n", "│ Oakland ┆ 1/1/2014 ┆ $28.18 ┆ 983 │\n", "│ Oakland ┆ 1/1/2014 ┆ $6.83 ┆ 982 │\n", "│ Oakland ┆ 1/1/2014 ┆ $43.90 ┆ 982 │\n", "│ Oakland ┆ 1/1/2014 ┆ $17.16 ┆ 980 │\n", "│ … ┆ … ┆ … ┆ … │\n", "│ Tacoma ┆ 12/31/2015 ┆ $33.68 ┆ 1523 │\n", "│ Tacoma ┆ 12/31/2015 ┆ $215.98 ┆ 1521 │\n", "│ Tacoma ┆ 12/31/2015 ┆ $236.86 ┆ 1521 │\n", "│ Tacoma ┆ 12/31/2015 ┆ $33.02 ┆ 1522 │\n", "│ Tacoma ┆ 12/31/2015 ┆ $11.32 ┆ 1523 │\n", "└─────────┴────────────┴──────────┴─────────┘" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import polars as pl\n", "files = Path(\"./Data_files/multiple_csvs/*.csv\")\n", "pl.read_csv(files)" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-10-31T08:59:52.887765Z", "iopub.status.busy": "2020-10-31T08:59:52.887501Z", "iopub.status.idle": "2020-10-31T08:59:52.893435Z", "shell.execute_reply": "2020-10-31T08:59:52.892724Z", "shell.execute_reply.started": "2020-10-31T08:59:52.887738Z" } }, "source": [ "## **Resources Used**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "datatable - [fread](https://datatable.readthedocs.io/en/latest/api/dt/fread.html)\n", "\n", "datatable - [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html)\n", "\n", "pandas - [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n", "\n", "polars - [read_csv](https://docs.pola.rs/user-guide/io/multiple/#dealing-with-multiple-files)\n", "\n", "python - [subprocess](https://docs.python.org/3/library/subprocess.html)\n", "\n", "python - [pathlib](https://docs.python.org/3/library/pathlib.html)\n", "\n", "StackOverflow - [read csv files via the command line](https://stackoverflow.com/a/58131427/7175713)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "blogger", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }