{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Extract DataFrame from Compressed Data into Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Compressed Data Extraction in Pandas**" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-26T01:27:05.586391Z", "iopub.status.busy": "2020-06-26T01:27:05.586173Z", "iopub.status.idle": "2020-06-26T01:27:05.592158Z", "shell.execute_reply": "2020-06-26T01:27:05.591344Z", "shell.execute_reply.started": "2020-06-26T01:27:05.586367Z" } }, "source": [ "You have data in compressed form (``zip``, ``7z``, ...); how do you read the data into [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)? This blog post will explore the various options possible. The focus will be on the ``zip`` format, as it is widely used and the methods presented here can be adapted for other compression formats. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read Zip Folder that has only one File__\n", "\n", "![zip-single-file.png](Images/zip_single_file.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) can easily read a zip folder that contains only one file:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"Data_files/iris.zip\")\n", "# or : pd.read_csv(\"Data_files/iris.zip\",\n", "# compression = \"zip\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "- Note that [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) successfully inferred the compression type. It pays sometimes though, to be explicit and supply the ``compression`` argument." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read a File from Multiple Files in Zip Folder__\n", "\n", "![zip-single-file-from-multiple-files.png](Images/zip_single_from_multiple_files.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this scenario, We want just the ``setosa.csv`` file. [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) cannot directly read data from a zip folder if there are multiple files; to solve this, we will use the [zipfile](https://docs.python.org/3/library/zipfile.html) module within Python. The [zipfile](https://docs.python.org/3/library/zipfile.html) module offers two routes for reading in zip data : ``ZipFile`` and ``Path`` classes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1 - Via ``ZipFile`` :" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from zipfile import ZipFile, Path\n", "\n", "# pass in the specific file name \n", "# to the open method\n", "with ZipFile(\"Data_files/iris_zipped.zip\") as myzip:\n", " data = myzip.open(\"iris_zipped/setosa.csv\")\n", "\n", "#Now, we can read in the data\n", "df = pd.read_csv(data)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "- The context manager ensures the connection is closed after reading in the data.\n", "- The complete path to the specific file is passed (``iris_zipped/setosa.csv``)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2 - Via ``Path``:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from io import StringIO\n", "\n", "# the 'at' argument accepts the path to the file\n", "# within the zip archive\n", "# this is convenient when you know the filename\n", "zipped = Path(\"Data_files/iris_zipped.zip\", at=\"iris_zipped/setosa.csv\")\n", "\n", "# read into dataframe, with the '.read_text' method\n", "df = pd.read_csv(StringIO(zipped.read_text()))\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read a Specific File from Multiple Files in Zip Folder based on its Extension__\n", "\n", "![zip-specific-file-ext.png](Images/zip_specific_file_extension.png)" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-10T07:44:58.459096Z", "iopub.status.busy": "2020-07-10T07:44:58.458681Z", "iopub.status.idle": "2020-07-10T07:44:58.468336Z", "shell.execute_reply": "2020-07-10T07:44:58.467250Z", "shell.execute_reply.started": "2020-07-10T07:44:58.459050Z" } }, "source": [ "Here, the zip folder contains various file types (``xlsx``, ``csv``, ``json``, ...), and we want only the ``.csv`` file. Let's also assume that we do not know beforehand what the filename is; we do however know that there is just one ``.csv`` file in the zip folder. We can pick out the specific file with the [zipfile](https://docs.python.org/3/library/zipfile.html) module:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1 - Via ``ZipFile`` :" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#fnmatch helps to filter for specific file types\n", "import fnmatch\n", "\n", "#read data from the csv file\n", "with ZipFile(\"Data_files/zipped.zip\") as zipped_files:\n", " \n", " #get list of files in zip\n", " file_list = zipped_files.namelist()\n", " \n", " #use fnmatch.filter to get the csv file\n", " csv_file = fnmatch.filter(file_list, \"*.csv\")\n", " \n", " #get the csv data\n", " data = zipped_files.open(*csv_file)\n", "\n", "#read into dataframe\n", "df = pd.read_csv(data)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2 - Via ``Path``:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zipped = Path(\"Data_files/zipped.zip\")\n", "\n", "#filter for csv file\n", "csv_file = fnmatch.filter(zipped.root.namelist(), \"*.csv\")\n", "\n", "#append filepath to zipped, and read into dataframe\n", "data = zipped.joinpath(*csv_file).read_text()\n", "\n", "df = pd.read_csv(StringIO(data))\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read Multiple Files with specific extension in Zip Folder into a Single Dataframe__\n", "\n", "![zip-multiple-files-concat.png](Images/zip_multiple_files_concat.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this scenario, we want to read in only the ``csv`` files and combine them into one dataframe - [zipfile](https://docs.python.org/3/library/zipfile.html) module to the rescue:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1 - Via ``ZipFile``:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with ZipFile(\"Data_files/zip_various.zip\") as zipfiles:\n", "\n", " file_list = zipfiles.namelist()\n", " \n", " #get only the csv files\n", " csv_files = fnmatch.filter(file_list, \"*.csv\")\n", " \n", " #iterate with a list comprehension to get the individual dataframes\n", " data = [pd.read_csv(zipfiles.open(file_name)) for file_name in csv_files]\n", "\n", "#combine into one dataframe\n", "df = pd.concat(data)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2 - Via ``Path``:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zipped = Path(\"Data_files/zip_various.zip\")\n", "\n", "csv_file = fnmatch.filter(zipped.root.namelist(), \"*.csv\")\n", "\n", "data = (zipped.joinpath(file_name).read_text() for file_name in csv_file)\n", "\n", "data = (pd.read_csv(StringIO(content)) for content in data)\n", "\n", "df = pd.concat(data)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read all the Files in Zip Folder into a Single Dataframe__\n", "\n", "![zip-single-file-from-multiple-files.png](Images/zip_single_from_multiple_files.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1 - Via ``ZipFile``:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with ZipFile(\"Data_files/iris_zipped.zip\") as zipfiles:\n", " #the first entry is the zipfile name\n", " #we'll skip it\n", " filelist = zipfiles.namelist()[1:]\n", " \n", " data = [pd.read_csv(zipfiles.open(file_name)) for file_name in filelist]\n", "\n", "df = pd.concat(data)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2 - Via ``Path``:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zipped = Path(\"Data_files/iris_zipped.zip\")\n", "\n", "data = (zipped.joinpath(file_name).read_text() \n", " for file_name in zipped.root.namelist()[1:])\n", "\n", "data = (pd.read_csv(StringIO(content)) for content in data)\n", "\n", "df = pd.concat(data)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read Compressed Data through the Command Line__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This process is especially useful when dealing with large files that [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) may struggle with. Preprocess the data, before reading it into Pandas. The command line tools are quite powerful and allow for some efficient data processing." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import subprocess\n", "\n", "#read a zip with only one file\n", "data = subprocess.run(\"unzip -p Data_files/iris.zip\", \n", " shell = True, \n", " capture_output = True, \n", " text = True).stdout\n", "#read in data to pandas\n", "df = pd.read_csv(StringIO(data))\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "- The [unzip](https://linux.die.net/man/1/unzip) tool on the command line does the hardwork here." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
06.33.36.02.5virginica
15.82.75.11.9virginica
27.13.05.92.1virginica
36.32.95.61.8virginica
46.53.05.82.2virginica
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 6.3 3.3 6.0 2.5 virginica\n", "1 5.8 2.7 5.1 1.9 virginica\n", "2 7.1 3.0 5.9 2.1 virginica\n", "3 6.3 2.9 5.6 1.8 virginica\n", "4 6.5 3.0 5.8 2.2 virginica" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# filter for only rows where species == virginica\n", "# this comes in handy when dealing with large files \n", "# that Pandas may struggle with\n", "data = subprocess.run(\"\"\"\n", " unzip -p Data_files/iris.zip | awk 'BEGIN {FS = \",\"}; \n", " {if($5==\"virginica\" || NR == 1) {print}}'\n", " \"\"\", \n", " shell = True, \n", " capture_output = True, \n", " text = True).stdout\n", "\n", "df = pd.read_csv(StringIO(data))\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "- The [AWK](https://www.gnu.org/software/gawk/manual/html_node/Getting-Started.html#Getting-Started) language is used here to filter for the rows and also retain the headers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading via the command line can be simplified with [pyjanitor's](https://pyjanitor-devs.github.io/pyjanitor/) [read_commandline](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.read_commandline) function:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pip install pyjanitor\n", "from janitor import read_commandline\n", "\n", "#read a zip with only one file\n", "cmd = \"unzip -p Data_files/iris.zip\"\n", "df = read_commandline(cmd = cmd)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
06.33.36.02.5virginica
15.82.75.11.9virginica
27.13.05.92.1virginica
36.32.95.61.8virginica
46.53.05.82.2virginica
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 6.3 3.3 6.0 2.5 virginica\n", "1 5.8 2.7 5.1 1.9 virginica\n", "2 7.1 3.0 5.9 2.1 virginica\n", "3 6.3 2.9 5.6 1.8 virginica\n", "4 6.5 3.0 5.8 2.2 virginica" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cmd = \"unzip -p Data_files/iris.zip |\"\n", "cmd += \"\"\"awk 'BEGIN {FS = \",\"}; \"\"\"\n", "cmd += \"\"\"{if($5==\"virginica\" || NR == 1) {print}}'\"\"\"\n", "df = read_commandline(cmd = cmd)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-30T10:57:18.265994Z", "iopub.status.busy": "2020-07-30T10:57:18.265751Z", "iopub.status.idle": "2020-07-30T10:57:18.283104Z", "shell.execute_reply": "2020-07-30T10:57:18.280684Z", "shell.execute_reply.started": "2020-07-30T10:57:18.265970Z" } }, "source": [ "### __Summary__\n", "\n", "This blog post showed various ways to extract data from compressed files. Personally, I feel extraction using the ``ZipFile`` function is simpler and cleaner than the ``Path`` function. Also, we saw the power of the command line and how useful it can be. In a future blog post, I will explore the command line in more detail." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3.9.10 ('pyjanitor')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" }, "vscode": { "interpreter": { "hash": "f6988fdc349cebf087982060ba0d703f44fccde5c0343683aa63b0b79fe3c156" } } }, "nbformat": 4, "nbformat_minor": 4 }