You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* new card for mbart and mbart50
* removed comment BADGES
* Update mBart overview
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix typo (MBart to mBart)
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* maybe fix typo
* update typo and combine notes
* changed notes
* changed the example sentence
* fixed grammatical error and removed some lines from notes example
* missed one word
* removed documentation resources and added some lines of example code back in notes.
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
[mBART](https://huggingface.co/papers/2001.08210) is a multilingual machine translation model that pretrains the entire translation model (encoder-decoder) unlike previous methods that only focused on parts of the model. The model is trained on a denoising objective which reconstructs the corrupted text. This allows mBART to handle the source language and the target text to translate to.
30
30
31
-
The MBart model was presented in [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
32
-
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
31
+
[mBART-50](https://huggingface.co/paper/2008.00401) is pretrained on an additional 25 languages.
33
32
34
-
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
35
-
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
36
-
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
37
-
on the encoder, decoder, or reconstructing parts of the text.
33
+
You can find all the original mBART checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=mbart) organization.
38
34
39
-
This model was contributed by [valhalla](https://huggingface.co/valhalla). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/mbart)
35
+
> [!TIP]
36
+
> Click on the mBART models in the right sidebar for more examples of applying mBART to different language tasks.
40
37
41
-
### Training of MBart
38
+
The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class.
42
39
43
-
MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
44
-
model is multilingual it expects the sequences in a different format. A special language id token is added in both the
45
-
source and target text. The source text format is `X [eos, src_lang_code]` where `X` is the source text. The
46
-
target text format is `[tgt_lang_code] X [eos]`. `bos` is never used.
40
+
<hfoptionsid="usage">
41
+
<hfoptionid="Pipeline">
47
42
48
-
The regular [`~MBartTokenizer.__call__`] will encode source text format passed as first argument or with the `text`
49
-
keyword, and target text format passed with the `text_label` keyword argument.
"Şeful ONU declară că nu există o soluţie militară în Siria"
47
+
pipeline = pipeline(
48
+
task="translation",
49
+
model="facebook/mbart-large-50-many-to-many-mmt",
50
+
device=0,
51
+
torch_dtype=torch.float16,
52
+
src_lang="en_XX",
53
+
tgt_lang="fr_XX",
54
+
)
55
+
print(pipeline("UN Chief Says There Is No Military Solution in Syria"))
81
56
```
82
57
83
-
## Overview of MBart-50
84
-
85
-
MBart-50 was introduced in the [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
86
-
Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extending
87
-
its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
88
-
languages.
58
+
</hfoption>
59
+
<hfoptionid="AutoModel">
89
60
90
-
According to the abstract
61
+
```py
62
+
import torch
63
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
91
64
92
-
*Multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one
93
-
direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models
94
-
can be extended to incorporate additional languages without loss of performance. Multilingual finetuning improves on
95
-
average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while
96
-
improving 9.3 BLEU on average over bilingual baselines from scratch.*
65
+
article_en ="UN Chief Says There Is No Military Solution in Syria"
97
66
67
+
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", torch_dtype=torch.bfloat16, attn_implementation="sdpa", device_map="auto")
- You can check the full list of language codes via `tokenizer.lang_code_to_id.keys()`.
82
+
- mBART requires a special language id token in the source and target text during training. The source text format is `X [eos, src_lang_code]` where `X` is the source text. The target text format is `[tgt_lang_code] X [eos]`. The `bos` token is never used. The [`~PreTrainedTokenizerBase._call_`] encodes the source text format passed as the first argument or with the `text` keyword. The target text format is passed with the `text_label` keyword.
83
+
- Set the `decoder_start_token_id` to the target language id for mBART.
124
84
125
-
- Generation
85
+
```py
86
+
import torch
87
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
126
88
127
-
To generate using the mBART-50 multilingual translation models, `eos_token_id` is used as the
128
-
`decoder_start_token_id` and the target language id is forced as the first generated token. To force the
129
-
target language id as the first generated token, pass the *forced_bos_token_id* parameter to the *generate* method.
130
-
The following example shows how to translate between Hindi to French and Arabic to English using the
- mBART-50 has a different text format. The language id token is used as the prefix for the source and target text. The text formatis`[lang_code] X [eos]` where `lang_code`is the source language idfor the source text and target language idfor the target text. `X`is the source or target text respectively.
100
+
- Set the `eos_token_id`as the `decoder_start_token_id`for mBART-50. The target language idis used as the first generated token by passing `forced_bos_token_id` to [`~GenerationMixin.generate`].
0 commit comments