Skip to content

Updated Albert model Card #37753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
184 changes: 114 additions & 70 deletions docs/source/en/model_doc/albert.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,100 +14,144 @@ rendered properly in your Markdown viewer.

-->

<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white" >
<img alt= "TensorFlow" src= "https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white" >
<img alt= "Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style…Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC">
<img alt="SDPA" src= "https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white" >
</div>
</div>

# ALBERT

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=
">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
[Albert](https://huggingface.co/papers/1909.11942)

## Overview
The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://huggingface.co/papers/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.

The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
speed of BERT:
ALBERT was created to address problems like -- GPU/TPU memory limitations, longer training times, and unexpected model degradation in BERT. ALBERT uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:

- Splitting the embedding matrix into two smaller matrices.
- Using repeating layers split among groups.
- **Factorized embedding parameterization:** The large vocabulary embedding matrix is decomposed into two smaller matrices, reducing memory consumption.
- **Cross-layer parameter sharing:** Instead of learning separate parameters for each transformer layer, ALBERT shares parameters across layers, further reducing the number of learnable weights.

The abstract from the paper is the following:
ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time.

*Increasing model size when pretraining natural language representations often results in improved performance on
downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
SQuAD benchmarks while having fewer parameters compared to BERT-large.*
You can find all the original ALBERT checkpoints [HERE](https://huggingface.co/collections/google/albert-release-64ff65ba18830fabea2f2cec)

This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
> [!TIP]
> Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different natural language processing (NLP) tasks.

## Usage tips
The example below demonstrates how to generate text based with [`Pipeline`], [`AutoModel`] class or from command line.

- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
than the left.
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
number of (repeating) layers.
- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters.
- Layers are split in groups that share parameters (to save memory).
Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.
<hfoptions id="usage">
<hfoption id="Pipeline>

### Using Scaled Dot Product Attention (SDPA)
```py
import torch
from transformers import pipeline

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
page for more information.
#Initialize fill-mask pipeline
albert_fill_mask = pipeline(
task="fill-mask",
model="albert-base-v2",
device=0 if torch.cuda.is_available() else -1
)

SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
# Masked prompt (use [MASK] token)
prompt = "Plants create energy through a process known as [MASK]."
results = albert_fill_mask(prompt, top_k=5) # Get top 5 predictions

for result in results:
print(f"Prediction: {result['token_str']} | Score: {result['score']:.4f}")
```
from transformers import AlbertModel
model = AlbertModel.from_pretrained("albert/albert-base-v1", torch_dtype=torch.float16, attn_implementation="sdpa")
...

</hfoption>
<hfoption id="AutoModel">

```py
import torch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be simplified a bit without the comments and include SDPA:

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AutoModelForMaskedLM.from_pretrained(
    "albert/albert-base-v2",
    torch_dtype=torch.float16,
    attn_implementation="sdpa",
    device_map="auto"
)

prompt = "Plants create energy through a process known as [MASK]."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model(**inputs)
    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    predictions = outputs.logits[0, mask_token_index]

top_k = torch.topk(predictions, k=5).indices.tolist()
for token_id in top_k[0]:
    print(f"Prediction: {tokenizer.decode([token_id])}")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of .to("cuda") i wrote .to(model.device) . I hope it's okay

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load ALBERT (v2) and its tokenizer
tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AutoModelForMaskedLM.from_pretrained(
"albert/albert-base-v2",
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)

# Masked language modeling prompt
prompt = "Plants create energy through a process known as [MASK]."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Predict the masked token
with torch.no_grad():
outputs = model(**inputs)
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
predictions = outputs.logits[0, mask_token_index] # Get logits for [MASK] token

# Decode top predictions (k=5)
top_k = torch.topk(predictions, k=5).indices.tolist()
for token_id in top_k[0]:
print(f"Prediction: {tokenizer.decode([token_id])}")
```

</hfoption>
<hfoption id="transformers-cli">

```bash
transformers-cli mask \
--model albert-base-v2 \
--text "The capital of France is [MASK]." \
--device cuda \ # Optional
--torch_dtype float16
```

For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
</hfoption>

On a local benchmark (GeForce RTX 2060-8GB, PyTorch 2.3.1, OS Ubuntu 20.04) with `float16`, we saw the
following speedups during training and inference.
</hfoptions>

#### Training for 100 iterations
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its ok to not have a quantization example since the model isn't that large


|batch_size|seq_len|Time per batch (eager - s)| Time per batch (sdpa - s)| Speedup (%)| Eager peak mem (MB)| sdpa peak mem (MB)| Mem saving (%)|
|----------|-------|--------------------------|--------------------------|------------|--------------------|-------------------|---------------|
|2 |256 |0.028 |0.024 |14.388 |358.411 |321.088 |11.624 |
|2 |512 |0.049 |0.041 |17.681 |753.458 |602.660 |25.022 |
|4 |256 |0.044 |0.039 |12.246 |679.534 |602.660 |12.756 |
|4 |512 |0.090 |0.076 |18.472 |1434.820 |1134.140 |26.512 |
|8 |256 |0.081 |0.072 |12.664 |1283.825 |1134.140 |13.198 |
|8 |512 |0.170 |0.143 |18.957 |2820.398 |2219.695 |27.062 |
The example below uses [torch.quantization](https://pytorch.org/docs/stable/generated/torch.ao.quantization.quantize_dynamic.html) to only quantize the weights to int8. In te example quantization was applied only on Linear layers

#### Inference with 50 batches

|batch_size|seq_len|Per token latency eager (ms)|Per token latency SDPA (ms)|Speedup (%) |Mem eager (MB)|Mem BT (MB)|Mem saved (%)|
|----------|-------|----------------------------|---------------------------|------------|--------------|-----------|-------------|
|4 |128 |0.083 |0.071 |16.967 |48.319 |48.45 |-0.268 |
|4 |256 |0.148 |0.127 |16.37 |63.4 |63.922 |-0.817 |
|4 |512 |0.31 |0.247 |25.473 |110.092 |94.343 |16.693 |
|8 |128 |0.137 |0.124 |11.102 |63.4 |63.66 |-0.409 |
|8 |256 |0.271 |0.231 |17.271 |91.202 |92.246 |-1.132 |
|8 |512 |0.602 |0.48 |25.47 |186.159 |152.564 |22.021 |
|16 |128 |0.252 |0.224 |12.506 |91.202 |91.722 |-0.567 |
|16 |256 |0.526 |0.448 |17.604 |148.378 |150.467 |-1.388 |
|16 |512 |1.203 |0.96 |25.365 |338.293 |271.102 |24.784 |
```py
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

# Load model ---loaded v1 version as it was answering good
model = AutoModelForMaskedLM.from_pretrained("albert/albert-base-v1")
tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v1")

# Quantize the model (PyTorch native)
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # Quantize only linear layers
dtype=torch.qint8
)

# Verify
print(f"Size before: {model.get_memory_footprint()/1e6:.1f}MB")
print(f"Size after: {quantized_model.get_memory_footprint()/1e6:.1f}MB")

# Usage example
inputs = tokenizer("Albert Einstein was born in [MASK].", return_tensors="pt")
with torch.no_grad():
outputs = quantized_model(**inputs)
print(tokenizer.decode(outputs.logits[0].argmax(-1)))
```

> ALBERT is not compatible with `AttentionMaskVisualizer` as it uses masked self-attention rather than causal attention. So don't have `_update_causal_mask` method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> ALBERT is not compatible with `AttentionMaskVisualizer` as it uses masked self-attention rather than causal attention. So don't have `_update_causal_mask` method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since almost every model's model card will have an AttentionMaskVisualizer but ALBERT is not getting it, isn't it good to explain why it don't have any?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AttentionMaskVisualizer has only been implemented for ~50/300 models so far so I think its okay to leave it out at the moment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok alright. thanks for your response.


This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
## Notes

- All tokens attend to all others (like BERT).
- ALBERT supports a maximum sequence length of 512 tokens.
- Cannot be used for autoregressive generation (unlike GPT)
- ALBERT requires absolute positional embeddings, and it expects right-padding (i.e., pad tokens should be added at the end, not the beginning).
- ALBERT uses token_type_ids, just like BERT. So you should indicate which token belongs to which segment (e.g., sentence A vs. sentence B) when doing tasks like question answering or sentence-pair classification.
- ALBERT uses a different pretraining objective called Sentence Order Prediction (SOP) instead of Next Sentence Prediction (NSP), so fine-tuned models might behave slightly differently from BERT when modeling inter-sentence relationships.
Comment on lines +98 to +102
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ALBERT supports a maximum sequence length of 512 tokens.
- Cannot be used for autoregressive generation (unlike GPT)
- ALBERT requires absolute positional embeddings, and it expects right-padding (i.e., pad tokens should be added at the end, not the beginning).
- ALBERT uses token_type_ids, just like BERT. So you should indicate which token belongs to which segment (e.g., sentence A vs. sentence B) when doing tasks like question answering or sentence-pair classification.
- ALBERT uses a different pretraining objective called Sentence Order Prediction (SOP) instead of Next Sentence Prediction (NSP), so fine-tuned models might behave slightly differently from BERT when modeling inter-sentence relationships.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please briefly explain why it's better not to mention internal differences like SOP or token_type_ids in the model card? I want to make sure I follow the best style when contributing next time.


## Resources

Expand Down