{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Extract DataFrame from Compressed Data into Python Datatable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Compressed Data Extraction in Python Datatable**" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-26T01:27:05.586391Z", "iopub.status.busy": "2020-06-26T01:27:05.586173Z", "iopub.status.idle": "2020-06-26T01:27:05.592158Z", "shell.execute_reply": "2020-06-26T01:27:05.591344Z", "shell.execute_reply.started": "2020-06-26T01:27:05.586367Z" } }, "source": [ "In a previous [blog post](Extract-DataFrame-from-Compressed-Data-into-Pandas.ipynb), I discussed the various ways to extract data from compressed files into [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html). This blog post explores those options, this time with [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest). This package is a reimplementation in Python of [Rdatatable](https://rdatatable.gitlab.io/data.table/), aiming to replicate its speed and concise syntax. Have a look at my other [blog post](Selecting-and-Grouping-Data-with-Python-Datatable.ipynb) that explores the basics of [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest).\n", "Most of the examples will focus on the ``zip`` format, as it is widely used and the methods presented here can be adapted for other compression formats. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read Zip Folder that has only one File__\n", "\n", "![zip-single-file.png](Images/zip_single_file.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We simply pass the zipfile : " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from datatable import dt, fread, iread\n", "\n", "df = fread(\"Data_files/iris.zip\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-30T10:38:07.718461Z", "iopub.status.busy": "2020-07-30T10:38:07.718233Z", "iopub.status.idle": "2020-07-30T10:38:07.722533Z", "shell.execute_reply": "2020-07-30T10:38:07.721728Z", "shell.execute_reply.started": "2020-07-30T10:38:07.718439Z" } }, "source": [ "**Notes: **\n", "- [fread](https://datatable.readthedocs.io/en/latest/api/dt/fread.html#) and [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) are file reading features of [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest), similar to [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read a File from Multiple Files in Zip Folder__\n", "\n", "![zip-single-file-from-multiple-files.png](Images/zip_single_from_multiple_files.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this scenario, We want just the ``setosa.csv`` file. [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest) can read this with ease - simply pass the filepath to [fread](https://datatable.readthedocs.io/en/latest/api/dt/fread.html#) -> ``archive_name.zip/path_to_file``:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = fread(\"Data_files/iris_zipped.zip/iris_zipped/setosa.csv\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is easier than working with [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html), where we had to import the [zipfile](https://docs.python.org/3/library/zipfile.html) to do this." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read a Specific File from Multiple Files in Zip Folder based on its Extension__\n", "\n", "![zip-specific-file-ext.png](Images/zip_specific_file_extension.png)" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-10T07:44:58.459096Z", "iopub.status.busy": "2020-07-10T07:44:58.458681Z", "iopub.status.idle": "2020-07-10T07:44:58.468336Z", "shell.execute_reply": "2020-07-10T07:44:58.467250Z", "shell.execute_reply.started": "2020-07-10T07:44:58.459050Z" } }, "source": [ "Here, the zip folder contains various file types (``xlsx``, ``csv``, ``json``, ...), and we want only the ``.csv`` file. Let's also assume that we do not know beforehand what the filename is; we do however know that there is just one ``.csv`` file in the zip folder. There are two options : " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1 : We can pick out the specific file with the [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) function. [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) returns a lazy iterator of all the data frames in the zip folder; as such we can filter for the specific file we are interested in :" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "IOWarning: Could not read Data_files/zipped.zip/zipped/iris.json: Too few fields on line 151: expected 6 but found only 5 (with sep=','). Set fill=True to ignore this error. << {\"sepalLength\": 5.9, \"sepalWidth\": 3.0, \"petalLength\": 5.1, \"petalWidth\": 1.8, \"species\": \"virginica\"}>>\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for file in iread(\"Data_files/zipped.zip\"):\n", " # every file has a source attribute\n", " # that describes the file,\n", " # including the file type and name\n", " if file.source.endswith(\".csv\"):\n", " df = file\n", " \n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "- The ``.json`` format is not supported by [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2 : Use the ``cmd`` argument. This passes the processing to the command line. The command line is a powerful tool that can help with a lot of data processing. No need for the [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) function, just let the command line do the work." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = fread(cmd = \"\"\"unzip -p Data_files/zipped.zip '*.csv' \"\"\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note :**\n", "- In the code above, we used the [unzip](https://linux.die.net/man/1/unzip) tool to filter for the ``csv`` file and extract the data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read Multiple Files with specific extension in Zip Folder into a Single Dataframe__\n", "\n", "![zip-multiple-files-concat.png](Images/zip_multiple_files_concat.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this scenario, we want to read in only the ``csv`` files and combine them into one dataframe - We will explore the [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) option and the command line option:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1 - [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) :" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "csv_files = [file for file in iread(\"Data_files/zip_various.zip\")\n", " if file.source.endswith(\".csv\")]\n", "\n", "# combine the dataframes into one with rbind \n", "# rbind is similar to pd.concat(*frames, axis=0) :\n", "df = dt.rbind(csv_files)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2 - Command Line :" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = fread(cmd = \"\"\"\n", " unzip -p Data_files/zip_various.zip '*.csv' | \n", " awk 'BEGIN {FS = \",\"}; {if ($1!=\"sepal_length\" || NR==1) {print}}' \n", " \"\"\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "- The [AWK](https://www.gnu.org/software/gawk/manual/html_node/Getting-Started.html#Getting-Started) language is used here to split the columns on the `,` delimiter, and to get the header for the rows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Read all the Files in Zip Folder into a Single Dataframe__\n", "\n", "![zip-single-file-from-multiple-files.png](Images/zip_single_from_multiple_files.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1 - [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) :" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#get a sequence of the files and combine with rbind\n", "df = dt.rbind(tuple(iread(\"Data_files/iris_zipped.zip\")))\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2 - Command Line :" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = fread(cmd = \"\"\"\n", " unzip -p Data_files/iris_zipped.zip | \n", " awk 'BEGIN {FS = \",\"}; {if ($1!=\"sepal_length\" || NR==1) {print}}' \n", " \"\"\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __What about other Formats?__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The major compression archive formats supported by ``fread`` are : ``.tar``, ``.gz``, ``.zip``, ``.gz2``, and ``.tgz``. So the above steps will work for these formats. But there are other compresssion formats, including ``.7z``. How do we read that? The command line!" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-24T08:16:19.932765Z", "iopub.status.busy": "2020-07-24T08:16:19.932331Z", "iopub.status.idle": "2020-07-24T08:16:19.939527Z", "shell.execute_reply": "2020-07-24T08:16:19.938383Z", "shell.execute_reply.started": "2020-07-24T08:16:19.932715Z" } }, "source": [ "- Extract archive with a single file :\n", "\n", "![7z-single-file.png](Images/7z_single.png)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#ensure 7z is already installed on your system\n", "df = fread(cmd = \"7z e -so Data_files/iris_single_7z.7z\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Extract specific file :\n", "\n", "![7z_multiple.png](Images/7z_multiple.png)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
06.33.362.5virginica
15.82.75.11.9virginica
27.135.92.1virginica
36.32.95.61.8virginica
46.535.82.2virginica
57.636.62.1virginica
64.92.54.51.7virginica
77.32.96.31.8virginica
86.72.55.81.8virginica
97.23.66.12.5virginica
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = fread(cmd = \"\"\" 7z e -so Data_files/iris_7z.7z \"virginica.csv\" -r \"\"\")\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-24T08:30:06.571549Z", "iopub.status.busy": "2020-07-24T08:30:06.571227Z", "iopub.status.idle": "2020-07-24T08:30:06.580578Z", "shell.execute_reply": "2020-07-24T08:30:06.579815Z", "shell.execute_reply.started": "2020-07-24T08:30:06.571511Z" } }, "source": [ "### __Summary__\n", "\n", "[Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest) offers convenient features in reading data from compressed file formats, including the ability to read in data via the command line. Compared to [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html), [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest) offers more flexibility and power when dealing with compressed file formats." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 4 }