Skip to content

Commit 1d9401d

Browse files
authored
Update README.md (#520)
1 parent a2484b3 commit 1d9401d

File tree

1 file changed

+12
-97
lines changed

1 file changed

+12
-97
lines changed

README.md

+12-97
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@
88
<br>
99
</p>
1010

11-
Generate text with distributed **Llama 2 (70B)**, **Stable Beluga 2**, **Falcon**, **Guanaco-65B** or **BLOOM-176B** and fine‑tune them for your own tasks &mdash; right from your desktop computer or Google Colab:
11+
Generate text with distributed **Llama 2** (70B), **Falcon** (40B+), **BLOOM** (176B) (or their derivatives), and fine‑tune them for your own tasks &mdash; right from your desktop computer or Google Colab:
1212

1313
```python
1414
from transformers import AutoTokenizer
1515
from petals import AutoDistributedModelForCausalLM
1616

1717
# Choose any model available at https://health.petals.dev
18-
model_name = "petals-team/StableBeluga2"
18+
model_name = "petals-team/StableBeluga2" # This one is fine-tuned Llama 2 (70B)
1919

2020
# Connect to a distributed network hosting model layers
2121
tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -31,9 +31,9 @@ print(tokenizer.decode(outputs[0])) # A cat sat on a mat...
3131
🚀 &nbsp;<b><a href="https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing">Try now in Colab</a></b>
3232
</p>
3333

34-
🦙 **Want to run Llama 2?** Request access to its weights at the ♾️ [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and 🤗 [Model Hub](https://huggingface.co/meta-llama/Llama-2-70b-hf), then run `huggingface-cli login` in the terminal before loading the model. Or just try it in our [chatbot app](https://chat.petals.dev).
34+
🔏 **Privacy.** Your data will be processed with the help of other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.
3535

36-
🔏 **Privacy.** Your data will be processed by other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.
36+
🦙 **Want to run Llama 2?** Request access to its weights at the ♾️ [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and 🤗 [Model Hub](https://huggingface.co/meta-llama/Llama-2-70b-hf), then run `huggingface-cli login` in the terminal before loading the model. Or just try it in our [chatbot app](https://chat.petals.dev).
3737

3838
💬 **Any questions?** Ping us in [our Discord](https://discord.gg/KdThf2bWVU)!
3939

@@ -81,9 +81,8 @@ python3 -m petals.cli.run_server petals-team/StableBeluga2
8181

8282
## How does it work?
8383

84-
- Petals runs large language models like [Llama](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) and [BLOOM](https://huggingface.co/bigscience/bloom) **collaboratively** — you load a small part of the model, then join people serving the other parts to run inference or fine-tuning.
85-
- Single-batch inference runs at **up to 6 steps/sec** for **Llama 2** (70B) and &approx; 1 step/sec for BLOOM-176B. This is [up to 10x faster](https://github.com/bigscience-workshop/petals#benchmarks) than offloading, enough to build [chatbots](https://chat.petals.dev) and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
86-
- Beyond classic language model APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch.
84+
- You load a small part of the model, then join a [network](https://health.petals.dev) of people serving the other parts. Single‑batch inference runs at up to **6 tokens/sec** for **Llama 2** (70B) and up to **4 tokens/sec** for **Falcon** (180B) — enough for [chatbots](https://chat.petals.dev) and interactive apps.
85+
- You can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of **PyTorch** and **🤗 Transformers**.
8786

8887
<p align="center">
8988
<img src="https://i.imgur.com/RTYF3yW.png" width="800">
@@ -113,99 +112,15 @@ Advanced guides:
113112
- Launch a private swarm: [guide](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm)
114113
- Run a custom model: [guide](https://github.com/bigscience-workshop/petals/wiki/Run-a-custom-model-with-Petals)
115114

116-
## Benchmarks
117-
118-
The benchmarks below are for BLOOM-176B:
119-
120-
<table align="center">
121-
<tr>
122-
<th colspan="2">Network</th>
123-
<th colspan="2">Single-batch inference<br>(steps/s)</th>
124-
<th colspan="2">Parallel forward<br>(tokens/s)</th>
125-
</tr>
126-
<tr>
127-
<th rowspan="2">Bandwidth</th>
128-
<th rowspan="2">Round-trip<br>latency</th>
129-
<th colspan="2">Sequence length</th>
130-
<th colspan="2">Batch size</th>
131-
</tr>
132-
<tr align="center">
133-
<td>128</td>
134-
<td>2048</td>
135-
<td>1</td>
136-
<td>64</td>
137-
</tr>
138-
<tr>
139-
<th colspan="6">Offloading, max. possible speed on 1x A100 <sup>1</sup></th>
140-
</tr>
141-
<tr align="center">
142-
<td>256 Gbit/s</td>
143-
<td></td>
144-
<td>0.18</td>
145-
<td>0.18</td>
146-
<td>2.7</td>
147-
<td>170.3</td>
148-
</tr>
149-
<tr align="center">
150-
<td>128 Gbit/s</td>
151-
<td></td>
152-
<td>0.09</td>
153-
<td>0.09</td>
154-
<td>2.4</td>
155-
<td>152.8</td>
156-
</tr>
157-
<tr>
158-
<th colspan="6">Petals on 14 heterogeneous servers across Europe and North America <sup>2</sup></th>
159-
</tr>
160-
<tr align="center">
161-
<td colspan="2">Real world</td>
162-
<td>0.83</td>
163-
<td>0.79</td>
164-
<td>32.6</td>
165-
<td>179.4</td>
166-
</tr>
167-
<tr>
168-
<th colspan="6">Petals on 3 servers, with one A100 each <sup>3</sup></th>
169-
</tr>
170-
<tr align="center">
171-
<td>1 Gbit/s</td>
172-
<td>&lt; 5 ms</td>
173-
<td>1.71</td>
174-
<td>1.54</td>
175-
<td>70.0</td>
176-
<td>253.6</td>
177-
</tr>
178-
<tr align="center">
179-
<td>100 Mbit/s</td>
180-
<td>&lt; 5 ms</td>
181-
<td>1.66</td>
182-
<td>1.49</td>
183-
<td>56.4</td>
184-
<td>182.0</td>
185-
</tr>
186-
<tr align="center">
187-
<td>100 Mbit/s</td>
188-
<td>100 ms</td>
189-
<td>1.23</td>
190-
<td>1.11</td>
191-
<td>19.7</td>
192-
<td>112.2</td>
193-
</tr>
194-
</table>
195-
196-
<sup>1</sup> **An upper bound for offloading performance.** We base our offloading numbers on the best possible hardware setup for offloading: CPU RAM offloading via PCIe 4.0 with 16 PCIe lanes per GPU and PCIe switches for pairs of GPUs. We assume zero latency for the upper bound estimation. In 8-bit, the model uses 1 GB of memory per billion parameters. PCIe 4.0 with 16 lanes has a throughput of 256 Gbit/s, so offloading 176B parameters takes 5.5 seconds. The throughput is twice as slow (128 Gbit/s) if we have two GPUs behind the same PCIe switch.
197-
198-
<sup>2</sup> **A real-world distributed setting** with 14 servers holding 2× RTX 3060, 4× 2080Ti, 2× 3090, 2× A4000, and 4× A5000 GPUs. These are personal servers and servers from university labs, spread across Europe and North America and connected to the Internet at speeds of 100–1000 Mbit/s. 4 servers operate from under firewalls.
199-
200-
<sup>3</sup> **An optimistic setup** that requires least communication. The client nodes have 8 CPU cores and no GPU.
201-
202-
We provide more evaluations and discuss these results in more detail in **Section 3.3** of our [paper](https://arxiv.org/pdf/2209.01188.pdf).
203-
204-
## 🛠️ Contributing
115+
### Benchmarks
116+
117+
Please see **Section 3.3** of our [paper](https://arxiv.org/pdf/2209.01188.pdf).
118+
119+
### 🛠️ Contributing
205120

206121
Please see our [FAQ](https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#contributing) on contributing.
207122

208-
## 📜 Citation
123+
### 📜 Citation
209124

210125
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel.
211126
[Petals: Collaborative Inference and Fine-tuning of Large Models.](https://arxiv.org/abs/2209.01188)

0 commit comments

Comments
 (0)