Skip to content

Commit 2b27cfb

Browse files
authored
Merge pull request #203 from comet-ml/add-more-ray-train-examples
Add Ray + PYL example
2 parents 502b032 + 10101b6 commit 2b27cfb

File tree

2 files changed

+288
-0
lines changed

2 files changed

+288
-0
lines changed

.github/workflows/test-examples.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ jobs:
3535
- integrations/model-training/pytorch/notebooks/Histogram_Logging_Pytorch.ipynb
3636
- integrations/model-training/ray-train/notebooks/Comet_with_ray_train_huggingface_transformers.ipynb
3737
- integrations/model-training/ray-train/notebooks/Comet_with_ray_train_keras.ipynb
38+
- integrations/model-training/ray-train/notebooks/Comet_with_ray_train_pytorch_lightning.ipynb
3839
- integrations/model-training/ray-train/notebooks/Comet_with_ray_train_xgboost.ipynb
3940
- integrations/model-training/tensorflow/notebooks/Comet_and_Tensorflow.ipynb
4041
- integrations/model-training/transformers/notebooks/Comet_with_Hugging_Face_Trainer.ipynb
Lines changed: 287 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,287 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"<img src=\"https://cdn.comet.ml/img/notebook_logo.png\">"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"[Comet](https://www.comet.com/site/products/ml-experiment-tracking/?utm_campaign=ray_train&utm_medium=colab) is an MLOps Platform that is designed to help Data Scientists and Teams build better models faster! Comet provides tooling to track, Explain, Manage, and Monitor your models in a single place! It works with Jupyter Notebooks and Scripts and most importantly it's 100% free to get started!\n",
15+
"\n",
16+
"[Ray Train](https://docs.ray.io/en/latest/train/train.html) abstracts away the complexity of setting up a distributed training system.\n",
17+
"\n",
18+
"Instrument your runs with Comet to start managing experiments, create dataset versions and track hyperparameters for faster and easier reproducibility and collaboration.\n",
19+
"\n",
20+
"[Find more information about our integration with Ray Train](https://www.comet.ml/docs/v2/integrations/ml-frameworks/ray/)\n",
21+
"\n",
22+
"Get a preview for what's to come. Check out a completed experiment created from this notebook [here](https://www.comet.com/examples/comet-example-ray-train-keras/99d169308c854be7ac222c995a2bfa26?experiment-tab=systemMetrics).\n",
23+
"\n",
24+
"This example is based on the [following Ray Train Lightning example](https://docs.ray.io/en/latest/train/getting-started-pytorch-lightning.html)."
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"metadata": {
30+
"id": "ZYchV5RWwdv5"
31+
},
32+
"source": [
33+
"# Install Dependencies"
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": null,
39+
"metadata": {
40+
"id": "DJnmqphuY2eI"
41+
},
42+
"outputs": [],
43+
"source": [
44+
"%pip install \"comet_ml>=3.47.1\" \"ray[air]>=2.1.0\" \"lightning\" \"torchvision\""
45+
]
46+
},
47+
{
48+
"cell_type": "markdown",
49+
"metadata": {
50+
"id": "crOcPHobwhGL"
51+
},
52+
"source": [
53+
"# Initialize Comet"
54+
]
55+
},
56+
{
57+
"cell_type": "code",
58+
"execution_count": null,
59+
"metadata": {
60+
"id": "HNQRM0U3caiY"
61+
},
62+
"outputs": [],
63+
"source": [
64+
"import comet_ml\n",
65+
"import comet_ml.integration.ray\n",
66+
"\n",
67+
"comet_ml.login()"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"metadata": {
73+
"id": "cgqwGSwtzVWD"
74+
},
75+
"source": [
76+
"# Import Dependencies"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": null,
82+
"metadata": {
83+
"id": "e-5rRYaUw5AF"
84+
},
85+
"outputs": [],
86+
"source": [
87+
"import os\n",
88+
"import tempfile\n",
89+
"\n",
90+
"import torch\n",
91+
"from torch.utils.data import DataLoader\n",
92+
"from torchvision.models import resnet18\n",
93+
"from torchvision.datasets import FashionMNIST\n",
94+
"from torchvision.transforms import ToTensor, Normalize, Compose\n",
95+
"import lightning.pytorch as pl\n",
96+
"\n",
97+
"import ray.train.lightning\n",
98+
"from ray.train.torch import TorchTrainer\n",
99+
"from ray.train import ScalingConfig, RunConfig"
100+
]
101+
},
102+
{
103+
"cell_type": "markdown",
104+
"metadata": {},
105+
"source": [
106+
"# Prepare your model"
107+
]
108+
},
109+
{
110+
"cell_type": "code",
111+
"execution_count": null,
112+
"metadata": {},
113+
"outputs": [],
114+
"source": [
115+
"# Model, Loss, Optimizer\n",
116+
"class ImageClassifier(pl.LightningModule):\n",
117+
" def __init__(self):\n",
118+
" super(ImageClassifier, self).__init__()\n",
119+
" self.model = resnet18(num_classes=10)\n",
120+
" self.model.conv1 = torch.nn.Conv2d(\n",
121+
" 1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False\n",
122+
" )\n",
123+
" self.criterion = torch.nn.CrossEntropyLoss()\n",
124+
"\n",
125+
" def forward(self, x):\n",
126+
" return self.model(x)\n",
127+
"\n",
128+
" def training_step(self, batch, batch_idx):\n",
129+
" x, y = batch\n",
130+
" outputs = self.forward(x)\n",
131+
" loss = self.criterion(outputs, y)\n",
132+
" self.log(\"ligthning_loss\", loss, on_step=True, prog_bar=True)\n",
133+
" return loss\n",
134+
"\n",
135+
" def configure_optimizers(self):\n",
136+
" return torch.optim.Adam(self.model.parameters(), lr=0.001)"
137+
]
138+
},
139+
{
140+
"cell_type": "markdown",
141+
"metadata": {
142+
"id": "TJuThf1TxP_G"
143+
},
144+
"source": [
145+
"# Define your distributed training function\n",
146+
"\n",
147+
"This function is gonna be distributed and executed on each distributed worker."
148+
]
149+
},
150+
{
151+
"cell_type": "code",
152+
"execution_count": null,
153+
"metadata": {},
154+
"outputs": [],
155+
"source": [
156+
"def train_func(config):\n",
157+
" from comet_ml.integration.ray import comet_worker_logger\n",
158+
" from lightning.pytorch.loggers import CometLogger\n",
159+
"\n",
160+
" with comet_worker_logger(config) as experiment:\n",
161+
" # Data\n",
162+
" transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])\n",
163+
" data_dir = os.path.join(tempfile.gettempdir(), \"data\")\n",
164+
" train_data = FashionMNIST(\n",
165+
" root=data_dir, train=True, download=True, transform=transform\n",
166+
" )\n",
167+
" train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)\n",
168+
"\n",
169+
" # Training\n",
170+
" model = ImageClassifier()\n",
171+
"\n",
172+
" comet_logger = CometLogger()\n",
173+
"\n",
174+
" # Temporary workaround, can be removed once\n",
175+
" # https://github.com/Lightning-AI/pytorch-lightning/pull/20275 has\n",
176+
" # been merged and released\n",
177+
" comet_logger._experiment = experiment\n",
178+
"\n",
179+
" # [1] Configure PyTorch Lightning Trainer.\n",
180+
" trainer = pl.Trainer(\n",
181+
" max_epochs=config[\"epochs\"],\n",
182+
" devices=\"auto\",\n",
183+
" accelerator=\"auto\",\n",
184+
" strategy=ray.train.lightning.RayDDPStrategy(),\n",
185+
" plugins=[ray.train.lightning.RayLightningEnvironment()],\n",
186+
" callbacks=[ray.train.lightning.RayTrainReportCallback()],\n",
187+
" logger=comet_logger,\n",
188+
" # [1a] Optionally, disable the default checkpointing behavior\n",
189+
" # in favor of the `RayTrainReportCallback` above.\n",
190+
" enable_checkpointing=False,\n",
191+
" log_every_n_steps=2,\n",
192+
" )\n",
193+
" trainer = ray.train.lightning.prepare_trainer(trainer)\n",
194+
" trainer.fit(model, train_dataloaders=train_dataloader)"
195+
]
196+
},
197+
{
198+
"cell_type": "markdown",
199+
"metadata": {},
200+
"source": [
201+
"# Define the function that schedule the distributed job"
202+
]
203+
},
204+
{
205+
"cell_type": "code",
206+
"execution_count": null,
207+
"metadata": {},
208+
"outputs": [],
209+
"source": [
210+
"def train(num_workers: int = 2, use_gpu: bool = False, epochs=1):\n",
211+
" scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)\n",
212+
" config = {\"use_gpu\": use_gpu, \"epochs\": epochs}\n",
213+
"\n",
214+
" callback = comet_ml.integration.ray.CometTrainLoggerCallback(\n",
215+
" config, project_name=\"comet-example-ray-train-pytorch-lightning\"\n",
216+
" )\n",
217+
"\n",
218+
" ray_trainer = TorchTrainer(\n",
219+
" train_func,\n",
220+
" scaling_config=scaling_config,\n",
221+
" train_loop_config=config,\n",
222+
" run_config=RunConfig(callbacks=[callback]),\n",
223+
" )\n",
224+
" result = ray_trainer.fit()"
225+
]
226+
},
227+
{
228+
"cell_type": "markdown",
229+
"metadata": {},
230+
"source": [
231+
"# Train the model\n",
232+
"\n",
233+
"Ray will wait indefinitely if we request more num_workers that the available resources, the code below ensure we never request more CPU than available locally."
234+
]
235+
},
236+
{
237+
"cell_type": "code",
238+
"execution_count": null,
239+
"metadata": {},
240+
"outputs": [],
241+
"source": [
242+
"ideal_num_workers = 2\n",
243+
"\n",
244+
"available_local_cpu_count = os.cpu_count() - 1\n",
245+
"num_workers = min(ideal_num_workers, available_local_cpu_count)\n",
246+
"\n",
247+
"if num_workers < 1:\n",
248+
" num_workers = 1\n",
249+
"\n",
250+
"train(num_workers, use_gpu=False, epochs=3)"
251+
]
252+
},
253+
{
254+
"cell_type": "code",
255+
"execution_count": null,
256+
"metadata": {},
257+
"outputs": [],
258+
"source": [
259+
"comet_ml.end()"
260+
]
261+
}
262+
],
263+
"metadata": {
264+
"colab": {
265+
"provenance": []
266+
},
267+
"kernelspec": {
268+
"display_name": "Python 3 (ipykernel)",
269+
"language": "python",
270+
"name": "python3"
271+
},
272+
"language_info": {
273+
"codemirror_mode": {
274+
"name": "ipython",
275+
"version": 3
276+
},
277+
"file_extension": ".py",
278+
"mimetype": "text/x-python",
279+
"name": "python",
280+
"nbconvert_exporter": "python",
281+
"pygments_lexer": "ipython3",
282+
"version": "3.11.3"
283+
}
284+
},
285+
"nbformat": 4,
286+
"nbformat_minor": 4
287+
}

0 commit comments

Comments
 (0)