{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extract DataFrame from Compressed Data into Python Datatable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Compressed Data Extraction in Python Datatable**"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-26T01:27:05.586391Z",
"iopub.status.busy": "2020-06-26T01:27:05.586173Z",
"iopub.status.idle": "2020-06-26T01:27:05.592158Z",
"shell.execute_reply": "2020-06-26T01:27:05.591344Z",
"shell.execute_reply.started": "2020-06-26T01:27:05.586367Z"
}
},
"source": [
"In a previous [blog post](Extract-DataFrame-from-Compressed-Data-into-Pandas.ipynb), I discussed the various ways to extract data from compressed files into [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html). This blog post explores those options, this time with [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest). This package is a reimplementation in Python of [Rdatatable](https://rdatatable.gitlab.io/data.table/), aiming to replicate its speed and concise syntax. Have a look at my other [blog post](Selecting-and-Grouping-Data-with-Python-Datatable.ipynb) that explores the basics of [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest).\n",
"Most of the examples will focus on the ``zip`` format, as it is widely used and the methods presented here can be adapted for other compression formats. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read Zip Folder that has only one File__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We simply pass the zipfile : "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datatable import dt, fread, iread\n",
"\n",
"df = fread(\"Data_files/iris.zip\")\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-07-30T10:38:07.718461Z",
"iopub.status.busy": "2020-07-30T10:38:07.718233Z",
"iopub.status.idle": "2020-07-30T10:38:07.722533Z",
"shell.execute_reply": "2020-07-30T10:38:07.721728Z",
"shell.execute_reply.started": "2020-07-30T10:38:07.718439Z"
}
},
"source": [
"**Notes: **\n",
"- [fread](https://datatable.readthedocs.io/en/latest/api/dt/fread.html#) and [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) are file reading features of [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest), similar to [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read a File from Multiple Files in Zip Folder__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this scenario, We want just the ``setosa.csv`` file. [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest) can read this with ease - simply pass the filepath to [fread](https://datatable.readthedocs.io/en/latest/api/dt/fread.html#) -> ``archive_name.zip/path_to_file``:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = fread(\"Data_files/iris_zipped.zip/iris_zipped/setosa.csv\")\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is easier than working with [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html), where we had to import the [zipfile](https://docs.python.org/3/library/zipfile.html) to do this."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read a Specific File from Multiple Files in Zip Folder based on its Extension__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-07-10T07:44:58.459096Z",
"iopub.status.busy": "2020-07-10T07:44:58.458681Z",
"iopub.status.idle": "2020-07-10T07:44:58.468336Z",
"shell.execute_reply": "2020-07-10T07:44:58.467250Z",
"shell.execute_reply.started": "2020-07-10T07:44:58.459050Z"
}
},
"source": [
"Here, the zip folder contains various file types (``xlsx``, ``csv``, ``json``, ...), and we want only the ``.csv`` file. Let's also assume that we do not know beforehand what the filename is; we do however know that there is just one ``.csv`` file in the zip folder. There are two options : "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 1 : We can pick out the specific file with the [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) function. [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) returns a lazy iterator of all the data frames in the zip folder; as such we can filter for the specific file we are interested in :"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"IOWarning: Could not read Data_files/zipped.zip/zipped/iris.json: Too few fields on line 151: expected 6 but found only 5 (with sep=','). Set fill=True to ignore this error. << {\"sepalLength\": 5.9, \"sepalWidth\": 3.0, \"petalLength\": 5.1, \"petalWidth\": 1.8, \"species\": \"virginica\"}>>\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"for file in iread(\"Data_files/zipped.zip\"):\n",
" # every file has a source attribute\n",
" # that describes the file,\n",
" # including the file type and name\n",
" if file.source.endswith(\".csv\"):\n",
" df = file\n",
" \n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:**\n",
"- The ``.json`` format is not supported by [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 2 : Use the ``cmd`` argument. This passes the processing to the command line. The command line is a powerful tool that can help with a lot of data processing. No need for the [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) function, just let the command line do the work."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = fread(cmd = \"\"\"unzip -p Data_files/zipped.zip '*.csv' \"\"\")\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note :**\n",
"- In the code above, we used the [unzip](https://linux.die.net/man/1/unzip) tool to filter for the ``csv`` file and extract the data. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read Multiple Files with specific extension in Zip Folder into a Single Dataframe__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this scenario, we want to read in only the ``csv`` files and combine them into one dataframe - We will explore the [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) option and the command line option:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 1 - [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) :"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"csv_files = [file for file in iread(\"Data_files/zip_various.zip\")\n",
" if file.source.endswith(\".csv\")]\n",
"\n",
"# combine the dataframes into one with rbind \n",
"# rbind is similar to pd.concat(*frames, axis=0) :\n",
"df = dt.rbind(csv_files)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 2 - Command Line :"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = fread(cmd = \"\"\"\n",
" unzip -p Data_files/zip_various.zip '*.csv' | \n",
" awk 'BEGIN {FS = \",\"}; {if ($1!=\"sepal_length\" || NR==1) {print}}' \n",
" \"\"\")\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:**\n",
"- The [AWK](https://www.gnu.org/software/gawk/manual/html_node/Getting-Started.html#Getting-Started) language is used here to split the columns on the `,` delimiter, and to get the header for the rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Read all the Files in Zip Folder into a Single Dataframe__\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 1 - [iread](https://datatable.readthedocs.io/en/latest/api/dt/iread.html#) :"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#get a sequence of the files and combine with rbind\n",
"df = dt.rbind(tuple(iread(\"Data_files/iris_zipped.zip\")))\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 2 - Command Line :"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = fread(cmd = \"\"\"\n",
" unzip -p Data_files/iris_zipped.zip | \n",
" awk 'BEGIN {FS = \",\"}; {if ($1!=\"sepal_length\" || NR==1) {print}}' \n",
" \"\"\")\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __What about other Formats?__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The major compression archive formats supported by ``fread`` are : ``.tar``, ``.gz``, ``.zip``, ``.gz2``, and ``.tgz``. So the above steps will work for these formats. But there are other compresssion formats, including ``.7z``. How do we read that? The command line!"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-07-24T08:16:19.932765Z",
"iopub.status.busy": "2020-07-24T08:16:19.932331Z",
"iopub.status.idle": "2020-07-24T08:16:19.939527Z",
"shell.execute_reply": "2020-07-24T08:16:19.938383Z",
"shell.execute_reply.started": "2020-07-24T08:16:19.932715Z"
}
},
"source": [
"- Extract archive with a single file :\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#ensure 7z is already installed on your system\n",
"df = fread(cmd = \"7z e -so Data_files/iris_single_7z.7z\")\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Extract specific file :\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 6.3 | 3.3 | 6 | 2.5 | virginica |
\n",
" 1 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
\n",
" 2 | 7.1 | 3 | 5.9 | 2.1 | virginica |
\n",
" 3 | 6.3 | 2.9 | 5.6 | 1.8 | virginica |
\n",
" 4 | 6.5 | 3 | 5.8 | 2.2 | virginica |
\n",
" 5 | 7.6 | 3 | 6.6 | 2.1 | virginica |
\n",
" 6 | 4.9 | 2.5 | 4.5 | 1.7 | virginica |
\n",
" 7 | 7.3 | 2.9 | 6.3 | 1.8 | virginica |
\n",
" 8 | 6.7 | 2.5 | 5.8 | 1.8 | virginica |
\n",
" 9 | 7.2 | 3.6 | 6.1 | 2.5 | virginica |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = fread(cmd = \"\"\" 7z e -so Data_files/iris_7z.7z \"virginica.csv\" -r \"\"\")\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-07-24T08:30:06.571549Z",
"iopub.status.busy": "2020-07-24T08:30:06.571227Z",
"iopub.status.idle": "2020-07-24T08:30:06.580578Z",
"shell.execute_reply": "2020-07-24T08:30:06.579815Z",
"shell.execute_reply.started": "2020-07-24T08:30:06.571511Z"
}
},
"source": [
"### __Summary__\n",
"\n",
"[Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest) offers convenient features in reading data from compressed file formats, including the ability to read in data via the command line. Compared to [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html), [Datatable](https://datatable.readthedocs.io/en/latest/?badge=latest) offers more flexibility and power when dealing with compressed file formats."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments\n",
""
]
}
],
"metadata": {
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 4
}