Skip to content

Commit 43adcf2

Browse files
ckurzeamotl
authored andcommitted
Finalization in wording
1 parent e387a5f commit 43adcf2

File tree

1 file changed

+50
-27
lines changed

1 file changed

+50
-27
lines changed

topic/timeseries/dask-weather-data-import.ipynb

Lines changed: 50 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,19 @@
77
"source": [
88
"# How to Build Time Series Applications in CrateDB\n",
99
"\n",
10-
"This notebook guides you through an example of how to import and work with\n",
10+
"This notebook guides you through an example of how to batch import \n",
1111
"time series data in CrateDB. It uses Dask to import data into CrateDB.\n",
1212
"Dask is a framework to parallelize operations on pandas Dataframes.\n",
1313
"\n",
14+
"## Important Note\n",
15+
"If you are running this notebook on a (free) Google Colab environment, you \n",
16+
"might not see the parallelized execution of Dask operations due to constrained\n",
17+
"CPU availability.\n",
18+
"\n",
19+
"We therefore recommend to run this notebook either locally or on an environment\n",
20+
"that provides sufficient CPU capacity to demonstrate the parallel execution of\n",
21+
"dataframe operations as well as write operations to CrateDB.\n",
22+
"\n",
1423
"## Dataset\n",
1524
"This notebook uses a daily weather data set provided on kaggle.com. This dataset\n",
1625
"offers a collection of **daily weather readings from major cities around the\n",
@@ -57,7 +66,7 @@
5766
},
5867
"outputs": [],
5968
"source": [
60-
"#!pip install dask pandas==2.0.0 'sqlalchemy[crate]'"
69+
"!pip install dask 'pandas==2.0.0' 'crate[sqlalchemy]' 'cratedb-toolkit==0.0.10' 'pueblo>=0.0.7' kaggle"
6170
]
6271
},
6372
{
@@ -75,6 +84,9 @@
7584
"- Countries (countries.csv)\n",
7685
"\n",
7786
"The subsequent code cell acquires the dataset directly from kaggle.com.\n",
87+
"In order to import the data automatically, you need to create a (free)\n",
88+
"API key in your kaggle.com user settings. \n",
89+
"\n",
7890
"To properly configure the notebook to use corresponding credentials\n",
7991
"after signing up on Kaggle, define the `KAGGLE_USERNAME` and\n",
8092
"`KAGGLE_KEY` environment variables. Alternatively, put them into the\n",
@@ -85,55 +97,69 @@
8597
" \"key\": \"2b1dac2af55caaf1f34df76236fada4a\"\n",
8698
"}\n",
8799
"```\n",
100+
"\n",
88101
"Another variant is to acquire the dataset files manually, and extract\n",
89102
"them into a folder called `DOWNLOAD`. In this case, you can deactivate\n",
90103
"those two lines of code, in order to skip automatic dataset acquisition."
91104
]
92105
},
93106
{
94107
"cell_type": "code",
95-
"execution_count": null,
96-
"outputs": [],
108+
"execution_count": 3,
109+
"id": "8fcc014a",
110+
"metadata": {},
111+
"outputs": [
112+
{
113+
"name": "stdout",
114+
"output_type": "stream",
115+
"text": [
116+
"Dataset URL: https://www.kaggle.com/datasets/guillemservera/global-daily-climate-data\n"
117+
]
118+
}
119+
],
97120
"source": [
121+
"from pueblo.util.environ import getenvpass\n",
98122
"from cratedb_toolkit.datasets import load_dataset\n",
99123
"\n",
124+
"# Uncomment and execute the following lines to get prompted for kaggle user name and key\n",
125+
"# getenvpass(\"KAGGLE_USERNAME\", prompt=\"Kaggle.com User Name:\")\n",
126+
"# getenvpass(\"KAGGLE_KEY\", prompt=\"Kaggle.com Key:\")\n",
127+
"\n",
100128
"dataset = load_dataset(\"kaggle://guillemservera/global-daily-climate-data/daily_weather.parquet\")\n",
101129
"dataset.acquire()"
102-
],
103-
"metadata": {
104-
"collapsed": false
105-
}
130+
]
106131
},
107132
{
108133
"cell_type": "code",
109-
"execution_count": 88,
134+
"execution_count": 6,
135+
"id": "d9e2916d",
136+
"metadata": {},
110137
"outputs": [],
111138
"source": [
112139
"from dask import dataframe as dd\n",
113140
"from dask.diagnostics import ProgressBar\n",
114141
"\n",
142+
"# Use multiprocessing of dask\n",
143+
"import dask.multiprocessing\n",
144+
"dask.config.set(scheduler=dask.multiprocessing.get)\n",
145+
"\n",
115146
"# Show a progress bar for dask activities\n",
116147
"pbar = ProgressBar()\n",
117148
"pbar.register()"
118-
],
119-
"metadata": {
120-
"collapsed": false
121-
}
149+
]
122150
},
123151
{
124152
"cell_type": "code",
125-
"execution_count": 56,
153+
"execution_count": 9,
126154
"id": "a506f7c9",
127155
"metadata": {},
128156
"outputs": [
129157
{
130158
"name": "stdout",
131159
"output_type": "stream",
132160
"text": [
133-
"[########################################] | 100% Completed | 6.26 ss\n",
134-
"[########################################] | 100% Completed | 6.37 s\n",
135-
"[########################################] | 100% Completed | 6.47 s\n",
136-
"[########################################] | 100% Completed | 6.47 s\n",
161+
"[########################################] | 100% Completed | 127.49 s\n",
162+
"[########################################] | 100% Completed | 127.49 s\n",
137163
"<class 'dask.dataframe.core.DataFrame'>\n",
138164
"Index: 27635763 entries, 0 to 24220\n",
139165
"Data columns (total 14 columns):\n",
@@ -155,10 +181,8 @@
155181
"13 sunshine_total_min 1021461 non-null float64\n",
156182
"dtypes: category(3), datetime64[ns](1), float64(10)\n",
157183
"memory usage: 2.6 GB\n",
158-
"[########################################] | 100% Completed | 5.37 ss\n",
159-
"[########################################] | 100% Completed | 5.48 s\n",
160-
"[########################################] | 100% Completed | 5.58 s\n",
161-
"[########################################] | 100% Completed | 5.68 s\n"
184+
"[########################################] | 100% Completed | 4.82 ss\n",
185+
"[########################################] | 100% Completed | 4.89 s\n"
162186
]
163187
},
164188
{
@@ -311,7 +335,7 @@
311335
"4 NaN NaN NaN "
312336
]
313337
},
314-
"execution_count": 56,
338+
"execution_count": 9,
315339
"metadata": {},
316340
"output_type": "execute_result"
317341
}
@@ -490,14 +514,13 @@
490514
},
491515
{
492516
"cell_type": "markdown",
517+
"id": "ea1dfadc",
518+
"metadata": {},
493519
"source": [
494520
"### Connect to CrateDB\n",
495521
"\n",
496522
"This code uses SQLAlchemy to connect to CrateDB."
497-
],
498-
"metadata": {
499-
"collapsed": false
500-
}
523+
]
501524
},
502525
{
503526
"cell_type": "code",

0 commit comments

Comments
 (0)