3
3
Tensor2Tensor supports running on Google Cloud Platforms TPUs, chips specialized
4
4
for ML training.
5
5
6
- Not all models are supported but we've tested so far with Transformer (sequence
7
- model) as well as Xception (image model).
6
+ Models and hparams that are known to work on TPU:
7
+ * ` transformer ` with ` transformer_tpu `
8
+ * ` xception ` with ` xception_base `
9
+ * ` resnet50 ` with ` resnet_base `
8
10
9
11
To run on TPUs, you need to be part of the alpha program; if you're not, these
10
12
commands won't work for you currently, but access will expand soon, so get
11
13
excited for your future ML supercomputers in the cloud.
12
14
13
15
## Tutorial: Transformer En-De translation on TPU
14
16
17
+ Update ` gcloud ` : ` gcloud components update `
18
+
15
19
Set your default zone to a TPU-enabled zone. TPU machines are only available in
16
20
certain zones for now.
17
21
```
@@ -40,29 +44,32 @@ gcloud alpha compute tpus create \
40
44
To see all TPU instances running: ` gcloud alpha compute tpus list ` . The
41
45
` TPU_IP ` should be unique amongst the list and follow the format ` 10.240.i.2 ` .
42
46
43
- Generate data to GCS
44
- If you already have the data locally, use ` gsutil cp ` to cp to GCS.
47
+ SSH in with port forwarding for TensorBoard
45
48
```
46
- DATA_DIR=gs://my-bucket/t2t/data/
47
- t2t-datagen --problem=translate_ende_wmt8k --data_dir=$DATA_DIR
49
+ gcloud compute ssh $USER-vm -- -L 6006:localhost:6006
48
50
```
49
51
50
- SSH in with port forwarding for TensorBoard
52
+ Now that you're on the cloud instance, install T2T:
51
53
```
52
- gcloud compute ssh $USER-vm -L 6006:localhost:6006
54
+ pip install tensor2tensor --user
55
+ # If your python bin dir isn't already in your path
56
+ export PATH=$HOME/.local/bin:$PATH
53
57
```
54
58
55
- Now that you're on the cloud instance, install T2T:
59
+ Generate data to GCS
60
+ If you already have the data, use ` gsutil cp ` to copy to GCS.
56
61
```
57
- pip install tensor2tensor
62
+ GCS_BUCKET=gs://my-bucket
63
+ DATA_DIR=$GCS_BUCKET/t2t/data/
64
+ t2t-datagen --problem=translate_ende_wmt8k --data_dir=$DATA_DIR
58
65
```
59
66
60
67
Setup some vars used below. ` TPU_IP ` and ` DATA_DIR ` should be the same as what
61
68
was used above. Note that the ` DATA_DIR ` and ` OUT_DIR ` must be GCS buckets.
62
69
```
63
70
TPU_IP=<IP of TPU machine>
64
- DATA_DIR=gs://my-bucket /t2t/data/
65
- OUT_DIR=gs://my-bucket /t2t/training/
71
+ DATA_DIR=$GCS_BUCKET /t2t/data/
72
+ OUT_DIR=$GCS_BUCKET /t2t/training/
66
73
TPU_MASTER=grpc://$TPU_IP:8470
67
74
```
68
75
@@ -73,25 +80,26 @@ tensorboard --logdir=$OUT_DIR > /tmp/tensorboard_logs.txt 2>&1 &
73
80
74
81
Train and evaluate.
75
82
```
76
- t2t-tpu-trainer \
77
- --master=$TPU_MASTER \
78
- --data_dir=$DATA_DIR \
79
- --output_dir=$OUT_DIR \
80
- --problems=translate_ende_wmt8k \
83
+ t2t-trainer \
81
84
--model=transformer \
82
- --hparams_set=transformer_tiny_tpu \
85
+ --hparams_set=transformer_tpu \
86
+ --problems=translate_ende_wmt8k \
83
87
--train_steps=10 \
84
88
--eval_steps=10 \
85
89
--local_eval_frequency=10 \
86
- --iterations_per_loop=10
90
+ --iterations_per_loop=10 \
91
+ --master=$TPU_MASTER \
92
+ --use_tpu=True \
93
+ --data_dir=$DATA_DIR \
94
+ --output_dir=$OUT_DIR
87
95
```
88
96
89
97
The above command will train for 10 steps, then evaluate for 10 steps. You can
90
98
(and should) increase the number of total training steps with the
91
99
` --train_steps ` flag. Evaluation will happen every ` --local_eval_frequency `
92
100
steps, each time for ` --eval_steps ` . When you increase then number of training
93
101
steps, also increase ` --iterations_per_loop ` , which controls how frequently the
94
- TPU machine returns control to the Python code (1000 seems like a fine number).
102
+ TPU machine returns control to the host machine (1000 seems like a fine number).
95
103
96
104
Back on your local machine, open your browser and navigate to ` localhost:6006 `
97
105
for TensorBoard.
0 commit comments