|
7 | 7 | "source": [
|
8 | 8 | "# How to Build Time Series Applications in CrateDB\n",
|
9 | 9 | "\n",
|
10 |
| - "This notebook guides you through an example of how to import and work with\n", |
| 10 | + "This notebook guides you through an example of how to batch import \n", |
11 | 11 | "time series data in CrateDB. It uses Dask to import data into CrateDB.\n",
|
12 | 12 | "Dask is a framework to parallelize operations on pandas Dataframes.\n",
|
13 | 13 | "\n",
|
| 14 | + "## Important Note\n", |
| 15 | + "If you are running this notebook on a (free) Google Colab environment, you \n", |
| 16 | + "might not see the parallelized execution of Dask operations due to constrained\n", |
| 17 | + "CPU availability.\n", |
| 18 | + "\n", |
| 19 | + "We therefore recommend to run this notebook either locally or on an environment\n", |
| 20 | + "that provides sufficient CPU capacity to demonstrate the parallel execution of\n", |
| 21 | + "dataframe operations as well as write operations to CrateDB.\n", |
| 22 | + "\n", |
14 | 23 | "## Dataset\n",
|
15 | 24 | "This notebook uses a daily weather data set provided on kaggle.com. This dataset\n",
|
16 | 25 | "offers a collection of **daily weather readings from major cities around the\n",
|
|
57 | 66 | },
|
58 | 67 | "outputs": [],
|
59 | 68 | "source": [
|
60 |
| - "#!pip install dask pandas==2.0.0 'sqlalchemy[crate]'" |
| 69 | + "!pip install dask 'pandas==2.0.0' 'crate[sqlalchemy]' 'cratedb-toolkit==0.0.10' 'pueblo>=0.0.7' kaggle" |
61 | 70 | ]
|
62 | 71 | },
|
63 | 72 | {
|
|
75 | 84 | "- Countries (countries.csv)\n",
|
76 | 85 | "\n",
|
77 | 86 | "The subsequent code cell acquires the dataset directly from kaggle.com.\n",
|
| 87 | + "In order to import the data automatically, you need to create a (free)\n", |
| 88 | + "API key in your kaggle.com user settings. \n", |
| 89 | + "\n", |
78 | 90 | "To properly configure the notebook to use corresponding credentials\n",
|
79 | 91 | "after signing up on Kaggle, define the `KAGGLE_USERNAME` and\n",
|
80 | 92 | "`KAGGLE_KEY` environment variables. Alternatively, put them into the\n",
|
|
85 | 97 | " \"key\": \"2b1dac2af55caaf1f34df76236fada4a\"\n",
|
86 | 98 | "}\n",
|
87 | 99 | "```\n",
|
| 100 | + "\n", |
88 | 101 | "Another variant is to acquire the dataset files manually, and extract\n",
|
89 | 102 | "them into a folder called `DOWNLOAD`. In this case, you can deactivate\n",
|
90 | 103 | "those two lines of code, in order to skip automatic dataset acquisition."
|
91 | 104 | ]
|
92 | 105 | },
|
93 | 106 | {
|
94 | 107 | "cell_type": "code",
|
95 |
| - "execution_count": null, |
96 |
| - "outputs": [], |
| 108 | + "execution_count": 3, |
| 109 | + "id": "8fcc014a", |
| 110 | + "metadata": {}, |
| 111 | + "outputs": [ |
| 112 | + { |
| 113 | + "name": "stdout", |
| 114 | + "output_type": "stream", |
| 115 | + "text": [ |
| 116 | + "Dataset URL: https://www.kaggle.com/datasets/guillemservera/global-daily-climate-data\n" |
| 117 | + ] |
| 118 | + } |
| 119 | + ], |
97 | 120 | "source": [
|
| 121 | + "from pueblo.util.environ import getenvpass\n", |
98 | 122 | "from cratedb_toolkit.datasets import load_dataset\n",
|
99 | 123 | "\n",
|
| 124 | + "# Uncomment and execute the following lines to get prompted for kaggle user name and key\n", |
| 125 | + "# getenvpass(\"KAGGLE_USERNAME\", prompt=\"Kaggle.com User Name:\")\n", |
| 126 | + "# getenvpass(\"KAGGLE_KEY\", prompt=\"Kaggle.com Key:\")\n", |
| 127 | + "\n", |
100 | 128 | "dataset = load_dataset(\"kaggle://guillemservera/global-daily-climate-data/daily_weather.parquet\")\n",
|
101 | 129 | "dataset.acquire()"
|
102 |
| - ], |
103 |
| - "metadata": { |
104 |
| - "collapsed": false |
105 |
| - } |
| 130 | + ] |
106 | 131 | },
|
107 | 132 | {
|
108 | 133 | "cell_type": "code",
|
109 |
| - "execution_count": 88, |
| 134 | + "execution_count": 6, |
| 135 | + "id": "d9e2916d", |
| 136 | + "metadata": {}, |
110 | 137 | "outputs": [],
|
111 | 138 | "source": [
|
112 | 139 | "from dask import dataframe as dd\n",
|
113 | 140 | "from dask.diagnostics import ProgressBar\n",
|
114 | 141 | "\n",
|
| 142 | + "# Use multiprocessing of dask\n", |
| 143 | + "import dask.multiprocessing\n", |
| 144 | + "dask.config.set(scheduler=dask.multiprocessing.get)\n", |
| 145 | + "\n", |
115 | 146 | "# Show a progress bar for dask activities\n",
|
116 | 147 | "pbar = ProgressBar()\n",
|
117 | 148 | "pbar.register()"
|
118 |
| - ], |
119 |
| - "metadata": { |
120 |
| - "collapsed": false |
121 |
| - } |
| 149 | + ] |
122 | 150 | },
|
123 | 151 | {
|
124 | 152 | "cell_type": "code",
|
125 |
| - "execution_count": 56, |
| 153 | + "execution_count": 9, |
126 | 154 | "id": "a506f7c9",
|
127 | 155 | "metadata": {},
|
128 | 156 | "outputs": [
|
129 | 157 | {
|
130 | 158 | "name": "stdout",
|
131 | 159 | "output_type": "stream",
|
132 | 160 | "text": [
|
133 |
| - "[########################################] | 100% Completed | 6.26 ss\n", |
134 |
| - "[########################################] | 100% Completed | 6.37 s\n", |
135 |
| - "[########################################] | 100% Completed | 6.47 s\n", |
136 |
| - "[########################################] | 100% Completed | 6.47 s\n", |
| 161 | + "[########################################] | 100% Completed | 127.49 s\n", |
| 162 | + "[########################################] | 100% Completed | 127.49 s\n", |
137 | 163 | "<class 'dask.dataframe.core.DataFrame'>\n",
|
138 | 164 | "Index: 27635763 entries, 0 to 24220\n",
|
139 | 165 | "Data columns (total 14 columns):\n",
|
|
155 | 181 | "13 sunshine_total_min 1021461 non-null float64\n",
|
156 | 182 | "dtypes: category(3), datetime64[ns](1), float64(10)\n",
|
157 | 183 | "memory usage: 2.6 GB\n",
|
158 |
| - "[########################################] | 100% Completed | 5.37 ss\n", |
159 |
| - "[########################################] | 100% Completed | 5.48 s\n", |
160 |
| - "[########################################] | 100% Completed | 5.58 s\n", |
161 |
| - "[########################################] | 100% Completed | 5.68 s\n" |
| 184 | + "[########################################] | 100% Completed | 4.82 ss\n", |
| 185 | + "[########################################] | 100% Completed | 4.89 s\n" |
162 | 186 | ]
|
163 | 187 | },
|
164 | 188 | {
|
|
311 | 335 | "4 NaN NaN NaN "
|
312 | 336 | ]
|
313 | 337 | },
|
314 |
| - "execution_count": 56, |
| 338 | + "execution_count": 9, |
315 | 339 | "metadata": {},
|
316 | 340 | "output_type": "execute_result"
|
317 | 341 | }
|
|
490 | 514 | },
|
491 | 515 | {
|
492 | 516 | "cell_type": "markdown",
|
| 517 | + "id": "ea1dfadc", |
| 518 | + "metadata": {}, |
493 | 519 | "source": [
|
494 | 520 | "### Connect to CrateDB\n",
|
495 | 521 | "\n",
|
496 | 522 | "This code uses SQLAlchemy to connect to CrateDB."
|
497 |
| - ], |
498 |
| - "metadata": { |
499 |
| - "collapsed": false |
500 |
| - } |
| 523 | + ] |
501 | 524 | },
|
502 | 525 | {
|
503 | 526 | "cell_type": "code",
|
|
0 commit comments