{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extract DataFrame from Compressed Data into Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Compressed Data Extraction in Pandas**"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-26T01:27:05.586391Z",
"iopub.status.busy": "2020-06-26T01:27:05.586173Z",
"iopub.status.idle": "2020-06-26T01:27:05.592158Z",
"shell.execute_reply": "2020-06-26T01:27:05.591344Z",
"shell.execute_reply.started": "2020-06-26T01:27:05.586367Z"
}
},
"source": [
"You have data in compressed form (``zip``, ``7z``, ...); how do you read the data into [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)? This blog post will explore the various options possible. The focus will be on the ``zip`` format, as it is widely used and the methods presented here can be adapted for other compression formats. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read Zip Folder that has only one File__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) can easily read a zip folder that contains only one file:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv(\"Data_files/iris.zip\")\n",
"# or : pd.read_csv(\"Data_files/iris.zip\",\n",
"# compression = \"zip\")\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:**\n",
"- Note that [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) successfully inferred the compression type. It pays sometimes though, to be explicit and supply the ``compression`` argument."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read a File from Multiple Files in Zip Folder__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this scenario, We want just the ``setosa.csv`` file. [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) cannot directly read data from a zip folder if there are multiple files; to solve this, we will use the [zipfile](https://docs.python.org/3/library/zipfile.html) module within Python. The [zipfile](https://docs.python.org/3/library/zipfile.html) module offers two routes for reading in zip data : ``ZipFile`` and ``Path`` classes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 1 - Via ``ZipFile`` :"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from zipfile import ZipFile, Path\n",
"\n",
"# pass in the specific file name \n",
"# to the open method\n",
"with ZipFile(\"Data_files/iris_zipped.zip\") as myzip:\n",
" data = myzip.open(\"iris_zipped/setosa.csv\")\n",
"\n",
"#Now, we can read in the data\n",
"df = pd.read_csv(data)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:**\n",
"- The context manager ensures the connection is closed after reading in the data.\n",
"- The complete path to the specific file is passed (``iris_zipped/setosa.csv``)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 2 - Via ``Path``:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from io import StringIO\n",
"\n",
"# the 'at' argument accepts the path to the file\n",
"# within the zip archive\n",
"# this is convenient when you know the filename\n",
"zipped = Path(\"Data_files/iris_zipped.zip\", at=\"iris_zipped/setosa.csv\")\n",
"\n",
"# read into dataframe, with the '.read_text' method\n",
"df = pd.read_csv(StringIO(zipped.read_text()))\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read a Specific File from Multiple Files in Zip Folder based on its Extension__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-07-10T07:44:58.459096Z",
"iopub.status.busy": "2020-07-10T07:44:58.458681Z",
"iopub.status.idle": "2020-07-10T07:44:58.468336Z",
"shell.execute_reply": "2020-07-10T07:44:58.467250Z",
"shell.execute_reply.started": "2020-07-10T07:44:58.459050Z"
}
},
"source": [
"Here, the zip folder contains various file types (``xlsx``, ``csv``, ``json``, ...), and we want only the ``.csv`` file. Let's also assume that we do not know beforehand what the filename is; we do however know that there is just one ``.csv`` file in the zip folder. We can pick out the specific file with the [zipfile](https://docs.python.org/3/library/zipfile.html) module:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 1 - Via ``ZipFile`` :"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#fnmatch helps to filter for specific file types\n",
"import fnmatch\n",
"\n",
"#read data from the csv file\n",
"with ZipFile(\"Data_files/zipped.zip\") as zipped_files:\n",
" \n",
" #get list of files in zip\n",
" file_list = zipped_files.namelist()\n",
" \n",
" #use fnmatch.filter to get the csv file\n",
" csv_file = fnmatch.filter(file_list, \"*.csv\")\n",
" \n",
" #get the csv data\n",
" data = zipped_files.open(*csv_file)\n",
"\n",
"#read into dataframe\n",
"df = pd.read_csv(data)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 2 - Via ``Path``:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zipped = Path(\"Data_files/zipped.zip\")\n",
"\n",
"#filter for csv file\n",
"csv_file = fnmatch.filter(zipped.root.namelist(), \"*.csv\")\n",
"\n",
"#append filepath to zipped, and read into dataframe\n",
"data = zipped.joinpath(*csv_file).read_text()\n",
"\n",
"df = pd.read_csv(StringIO(data))\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read Multiple Files with specific extension in Zip Folder into a Single Dataframe__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this scenario, we want to read in only the ``csv`` files and combine them into one dataframe - [zipfile](https://docs.python.org/3/library/zipfile.html) module to the rescue:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 1 - Via ``ZipFile``:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with ZipFile(\"Data_files/zip_various.zip\") as zipfiles:\n",
"\n",
" file_list = zipfiles.namelist()\n",
" \n",
" #get only the csv files\n",
" csv_files = fnmatch.filter(file_list, \"*.csv\")\n",
" \n",
" #iterate with a list comprehension to get the individual dataframes\n",
" data = [pd.read_csv(zipfiles.open(file_name)) for file_name in csv_files]\n",
"\n",
"#combine into one dataframe\n",
"df = pd.concat(data)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 2 - Via ``Path``:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zipped = Path(\"Data_files/zip_various.zip\")\n",
"\n",
"csv_file = fnmatch.filter(zipped.root.namelist(), \"*.csv\")\n",
"\n",
"data = (zipped.joinpath(file_name).read_text() for file_name in csv_file)\n",
"\n",
"data = (pd.read_csv(StringIO(content)) for content in data)\n",
"\n",
"df = pd.concat(data)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read all the Files in Zip Folder into a Single Dataframe__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 1 - Via ``ZipFile``:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with ZipFile(\"Data_files/iris_zipped.zip\") as zipfiles:\n",
" #the first entry is the zipfile name\n",
" #we'll skip it\n",
" filelist = zipfiles.namelist()[1:]\n",
" \n",
" data = [pd.read_csv(zipfiles.open(file_name)) for file_name in filelist]\n",
"\n",
"df = pd.concat(data)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 2 - Via ``Path``:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zipped = Path(\"Data_files/iris_zipped.zip\")\n",
"\n",
"data = (zipped.joinpath(file_name).read_text() \n",
" for file_name in zipped.root.namelist()[1:])\n",
"\n",
"data = (pd.read_csv(StringIO(content)) for content in data)\n",
"\n",
"df = pd.concat(data)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read Compressed Data through the Command Line__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This process is especially useful when dealing with large files that [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) may struggle with. Preprocess the data, before reading it into Pandas. The command line tools are quite powerful and allow for some efficient data processing."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import subprocess\n",
"\n",
"#read a zip with only one file\n",
"data = subprocess.run(\"unzip -p Data_files/iris.zip\", \n",
" shell = True, \n",
" capture_output = True, \n",
" text = True).stdout\n",
"#read in data to pandas\n",
"df = pd.read_csv(StringIO(data))\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:**\n",
"- The [unzip](https://linux.die.net/man/1/unzip) tool on the command line does the hardwork here."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 6.3 | \n",
" 3.3 | \n",
" 6.0 | \n",
" 2.5 | \n",
" virginica | \n",
"
\n",
" \n",
" 1 | \n",
" 5.8 | \n",
" 2.7 | \n",
" 5.1 | \n",
" 1.9 | \n",
" virginica | \n",
"
\n",
" \n",
" 2 | \n",
" 7.1 | \n",
" 3.0 | \n",
" 5.9 | \n",
" 2.1 | \n",
" virginica | \n",
"
\n",
" \n",
" 3 | \n",
" 6.3 | \n",
" 2.9 | \n",
" 5.6 | \n",
" 1.8 | \n",
" virginica | \n",
"
\n",
" \n",
" 4 | \n",
" 6.5 | \n",
" 3.0 | \n",
" 5.8 | \n",
" 2.2 | \n",
" virginica | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 6.3 3.3 6.0 2.5 virginica\n",
"1 5.8 2.7 5.1 1.9 virginica\n",
"2 7.1 3.0 5.9 2.1 virginica\n",
"3 6.3 2.9 5.6 1.8 virginica\n",
"4 6.5 3.0 5.8 2.2 virginica"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# filter for only rows where species == virginica\n",
"# this comes in handy when dealing with large files \n",
"# that Pandas may struggle with\n",
"data = subprocess.run(\"\"\"\n",
" unzip -p Data_files/iris.zip | awk 'BEGIN {FS = \",\"}; \n",
" {if($5==\"virginica\" || NR == 1) {print}}'\n",
" \"\"\", \n",
" shell = True, \n",
" capture_output = True, \n",
" text = True).stdout\n",
"\n",
"df = pd.read_csv(StringIO(data))\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:**\n",
"- The [AWK](https://www.gnu.org/software/gawk/manual/html_node/Getting-Started.html#Getting-Started) language is used here to filter for the rows and also retain the headers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading via the command line can be simplified with [pyjanitor's](https://pyjanitor-devs.github.io/pyjanitor/) [read_commandline](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.read_commandline) function:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# pip install pyjanitor\n",
"from janitor import read_commandline\n",
"\n",
"#read a zip with only one file\n",
"cmd = \"unzip -p Data_files/iris.zip\"\n",
"df = read_commandline(cmd = cmd)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 6.3 | \n",
" 3.3 | \n",
" 6.0 | \n",
" 2.5 | \n",
" virginica | \n",
"
\n",
" \n",
" 1 | \n",
" 5.8 | \n",
" 2.7 | \n",
" 5.1 | \n",
" 1.9 | \n",
" virginica | \n",
"
\n",
" \n",
" 2 | \n",
" 7.1 | \n",
" 3.0 | \n",
" 5.9 | \n",
" 2.1 | \n",
" virginica | \n",
"
\n",
" \n",
" 3 | \n",
" 6.3 | \n",
" 2.9 | \n",
" 5.6 | \n",
" 1.8 | \n",
" virginica | \n",
"
\n",
" \n",
" 4 | \n",
" 6.5 | \n",
" 3.0 | \n",
" 5.8 | \n",
" 2.2 | \n",
" virginica | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 6.3 3.3 6.0 2.5 virginica\n",
"1 5.8 2.7 5.1 1.9 virginica\n",
"2 7.1 3.0 5.9 2.1 virginica\n",
"3 6.3 2.9 5.6 1.8 virginica\n",
"4 6.5 3.0 5.8 2.2 virginica"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cmd = \"unzip -p Data_files/iris.zip |\"\n",
"cmd += \"\"\"awk 'BEGIN {FS = \",\"}; \"\"\"\n",
"cmd += \"\"\"{if($5==\"virginica\" || NR == 1) {print}}'\"\"\"\n",
"df = read_commandline(cmd = cmd)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-07-30T10:57:18.265994Z",
"iopub.status.busy": "2020-07-30T10:57:18.265751Z",
"iopub.status.idle": "2020-07-30T10:57:18.283104Z",
"shell.execute_reply": "2020-07-30T10:57:18.280684Z",
"shell.execute_reply.started": "2020-07-30T10:57:18.265970Z"
}
},
"source": [
"### __Summary__\n",
"\n",
"This blog post showed various ways to extract data from compressed files. Personally, I feel extraction using the ``ZipFile`` function is simpler and cleaner than the ``Path`` function. Also, we saw the power of the command line and how useful it can be. In a future blog post, I will explore the command line in more detail."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments\n",
""
]
}
],
"metadata": {
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3.9.10 ('pyjanitor')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
},
"vscode": {
"interpreter": {
"hash": "f6988fdc349cebf087982060ba0d703f44fccde5c0343683aa63b0b79fe3c156"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}