From 0dfb72c81fbf137ed04d5a779f60b1b27275e4bd Mon Sep 17 00:00:00 2001 From: souvikchand Date: Thu, 24 Apr 2025 17:56:27 +0530 Subject: [PATCH 01/13] Updated Albert model Card --- docs/source/en/model_doc/albert.md | 184 ++++++++++++++++++----------- 1 file changed, 114 insertions(+), 70 deletions(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 21cd57675e53..48cceff635a2 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -14,100 +14,144 @@ rendered properly in your Markdown viewer. --> +
+
+ PyTorch + TensorFlow + Flax + SDPA +
+
+ # ALBERT -
-PyTorch -TensorFlow -Flax -SDPA -
+[Albert](https://huggingface.co/papers/1909.11942) -## Overview +The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://huggingface.co/papers/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. -The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, -Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training -speed of BERT: +ALBERT was created to address problems like -- GPU/TPU memory limitations, longer training times, and unexpected model degradation in BERT. ALBERT uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT: -- Splitting the embedding matrix into two smaller matrices. -- Using repeating layers split among groups. +- **Factorized embedding parameterization:** The large vocabulary embedding matrix is decomposed into two smaller matrices, reducing memory consumption. +- **Cross-layer parameter sharing:** Instead of learning separate parameters for each transformer layer, ALBERT shares parameters across layers, further reducing the number of learnable weights. -The abstract from the paper is the following: +ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time. -*Increasing model size when pretraining natural language representations often results in improved performance on -downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, -longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction -techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows -that our proposed methods lead to models that scale much better compared to the original BERT. We also use a -self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks -with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and -SQuAD benchmarks while having fewer parameters compared to BERT-large.* +You can find all the original ALBERT checkpoints [HERE](https://huggingface.co/collections/google/albert-release-64ff65ba18830fabea2f2cec) -This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by -[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT). +> [!TIP] +> Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different natural language processing (NLP) tasks. -## Usage tips +The example below demonstrates how to generate text based with [`Pipeline`], [`AutoModel`] class or from command line. -- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather - than the left. -- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains - similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same - number of (repeating) layers. -- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters. -- Layers are split in groups that share parameters (to save memory). -Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not. + +=2.1.1` when an implementation is available, but you may also set -`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. +# Masked prompt (use [MASK] token) +prompt = "Plants create energy through a process known as [MASK]." +results = albert_fill_mask(prompt, top_k=5) # Get top 5 predictions +for result in results: + print(f"Prediction: {result['token_str']} | Score: {result['score']:.4f}") ``` -from transformers import AlbertModel -model = AlbertModel.from_pretrained("albert/albert-base-v1", torch_dtype=torch.float16, attn_implementation="sdpa") -... + + + + +```py +import torch +from transformers import AutoModelForMaskedLM, AutoTokenizer + +# Load ALBERT (v2) and its tokenizer +tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2") +model = AutoModelForMaskedLM.from_pretrained( + "albert/albert-base-v2", + torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, + device_map="auto" +) + +# Masked language modeling prompt +prompt = "Plants create energy through a process known as [MASK]." +inputs = tokenizer(prompt, return_tensors="pt").to(model.device) + +# Predict the masked token +with torch.no_grad(): + outputs = model(**inputs) + mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] + predictions = outputs.logits[0, mask_token_index] # Get logits for [MASK] token + +# Decode top predictions (k=5) +top_k = torch.topk(predictions, k=5).indices.tolist() +for token_id in top_k[0]: + print(f"Prediction: {tokenizer.decode([token_id])}") +``` + + + + +```bash +transformers-cli mask \ + --model albert-base-v2 \ + --text "The capital of France is [MASK]." \ + --device cuda \ # Optional + --torch_dtype float16 ``` -For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`). + -On a local benchmark (GeForce RTX 2060-8GB, PyTorch 2.3.1, OS Ubuntu 20.04) with `float16`, we saw the -following speedups during training and inference. + -#### Training for 100 iterations +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends. -|batch_size|seq_len|Time per batch (eager - s)| Time per batch (sdpa - s)| Speedup (%)| Eager peak mem (MB)| sdpa peak mem (MB)| Mem saving (%)| -|----------|-------|--------------------------|--------------------------|------------|--------------------|-------------------|---------------| -|2 |256 |0.028 |0.024 |14.388 |358.411 |321.088 |11.624 | -|2 |512 |0.049 |0.041 |17.681 |753.458 |602.660 |25.022 | -|4 |256 |0.044 |0.039 |12.246 |679.534 |602.660 |12.756 | -|4 |512 |0.090 |0.076 |18.472 |1434.820 |1134.140 |26.512 | -|8 |256 |0.081 |0.072 |12.664 |1283.825 |1134.140 |13.198 | -|8 |512 |0.170 |0.143 |18.957 |2820.398 |2219.695 |27.062 | +The example below uses [torch.quantization](https://pytorch.org/docs/stable/generated/torch.ao.quantization.quantize_dynamic.html) to only quantize the weights to int8. In te example quantization was applied only on Linear layers -#### Inference with 50 batches -|batch_size|seq_len|Per token latency eager (ms)|Per token latency SDPA (ms)|Speedup (%) |Mem eager (MB)|Mem BT (MB)|Mem saved (%)| -|----------|-------|----------------------------|---------------------------|------------|--------------|-----------|-------------| -|4 |128 |0.083 |0.071 |16.967 |48.319 |48.45 |-0.268 | -|4 |256 |0.148 |0.127 |16.37 |63.4 |63.922 |-0.817 | -|4 |512 |0.31 |0.247 |25.473 |110.092 |94.343 |16.693 | -|8 |128 |0.137 |0.124 |11.102 |63.4 |63.66 |-0.409 | -|8 |256 |0.271 |0.231 |17.271 |91.202 |92.246 |-1.132 | -|8 |512 |0.602 |0.48 |25.47 |186.159 |152.564 |22.021 | -|16 |128 |0.252 |0.224 |12.506 |91.202 |91.722 |-0.567 | -|16 |256 |0.526 |0.448 |17.604 |148.378 |150.467 |-1.388 | -|16 |512 |1.203 |0.96 |25.365 |338.293 |271.102 |24.784 | +```py +from transformers import AutoModelForMaskedLM, AutoTokenizer +import torch + +# Load model ---loaded v1 version as it was answering good +model = AutoModelForMaskedLM.from_pretrained("albert/albert-base-v1") +tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v1") + +# Quantize the model (PyTorch native) +quantized_model = torch.quantization.quantize_dynamic( + model, + {torch.nn.Linear}, # Quantize only linear layers + dtype=torch.qint8 +) + +# Verify +print(f"Size before: {model.get_memory_footprint()/1e6:.1f}MB") +print(f"Size after: {quantized_model.get_memory_footprint()/1e6:.1f}MB") + +# Usage example +inputs = tokenizer("Albert Einstein was born in [MASK].", return_tensors="pt") +with torch.no_grad(): + outputs = quantized_model(**inputs) + print(tokenizer.decode(outputs.logits[0].argmax(-1))) +``` + +> ALBERT is not compatible with `AttentionMaskVisualizer` as it uses masked self-attention rather than causal attention. So don't have `_update_causal_mask` method. -This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by -[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT). +## Notes +- All tokens attend to all others (like BERT). +- ALBERT supports a maximum sequence length of 512 tokens. +- Cannot be used for autoregressive generation (unlike GPT) +- ALBERT requires absolute positional embeddings, and it expects right-padding (i.e., pad tokens should be added at the end, not the beginning). +- ALBERT uses token_type_ids, just like BERT. So you should indicate which token belongs to which segment (e.g., sentence A vs. sentence B) when doing tasks like question answering or sentence-pair classification. +- ALBERT uses a different pretraining objective called Sentence Order Prediction (SOP) instead of Next Sentence Prediction (NSP), so fine-tuned models might behave slightly differently from BERT when modeling inter-sentence relationships. ## Resources From 1f2a8095dd3c6553a553bebfc520c355249ce051 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 01:21:23 +0530 Subject: [PATCH 02/13] Update docs/source/en/model_doc/albert.md added the quotes in Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/albert.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 48cceff635a2..aafd6ac36331 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -44,7 +44,7 @@ You can find all the original ALBERT checkpoints [HERE](https://huggingface.co/c The example below demonstrates how to generate text based with [`Pipeline`], [`AutoModel`] class or from command line. - ```py import torch From be83fe959cfb553878277142831cb102522a10a1 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 01:43:05 +0530 Subject: [PATCH 03/13] Update docs/source/en/model_doc/albert.md updated checkpoints Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/albert.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index aafd6ac36331..3d5bf30a448b 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -36,7 +36,7 @@ ALBERT was created to address problems like -- GPU/TPU memory limitations, longe ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time. -You can find all the original ALBERT checkpoints [HERE](https://huggingface.co/collections/google/albert-release-64ff65ba18830fabea2f2cec) +You can find all the original ALBERT checkpoints under the [ALBERT community](https://huggingface.co/albert) organization. > [!TIP] > Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different natural language processing (NLP) tasks. From 077ac58575131b5de3365dab53d7081dd4c85101 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 01:45:13 +0530 Subject: [PATCH 04/13] Update docs/source/en/model_doc/albert.md changed !Tips description Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/albert.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 3d5bf30a448b..9618ecb2a64f 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -39,7 +39,7 @@ ALBERT uses absolute position embeddings (like BERT) so padding is applied at ri You can find all the original ALBERT checkpoints under the [ALBERT community](https://huggingface.co/albert) organization. > [!TIP] -> Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different natural language processing (NLP) tasks. +> Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different language tasks. The example below demonstrates how to generate text based with [`Pipeline`], [`AutoModel`] class or from command line. From dd024e3340a9cf3f74836695859a7cfb9ff432f7 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 01:46:51 +0530 Subject: [PATCH 05/13] Update docs/source/en/model_doc/albert.md updated text Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/albert.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 9618ecb2a64f..99f0a5a1d0bf 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -41,7 +41,7 @@ You can find all the original ALBERT checkpoints under the [ALBERT community](ht > [!TIP] > Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different language tasks. -The example below demonstrates how to generate text based with [`Pipeline`], [`AutoModel`] class or from command line. +The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line. From 21e3c7aa88b4b5e39c86113a96fc5cf35e29460a Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 01:52:18 +0530 Subject: [PATCH 06/13] Update docs/source/en/model_doc/albert.md updated transformer-cli implementation Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/albert.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 99f0a5a1d0bf..386848c42a37 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -100,11 +100,7 @@ for token_id in top_k[0]: ```bash -transformers-cli mask \ - --model albert-base-v2 \ - --text "The capital of France is [MASK]." \ - --device cuda \ # Optional - --torch_dtype float16 +echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers-cli run --task fill-mask --model albert-base-v2 --device 0 ``` From 69baa2971901fc1c745020f6867222b36058bba5 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 01:58:25 +0530 Subject: [PATCH 07/13] Update docs/source/en/model_doc/albert.md changed text Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/albert.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 386848c42a37..b7909965e1a2 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -25,7 +25,7 @@ rendered properly in your Markdown viewer. # ALBERT -[Albert](https://huggingface.co/papers/1909.11942) +[ALBERT](https://huggingface.co/papers/1909.11942) is designed to address memory limitations of scaling and training of [BERT](./bert). It adds two parameter reduction techniques. The first, factorized embedding parametrization, splits the larger vocabulary embedding matrix into two smaller matrices so you can grow the hidden size without adding a lot more parameters. The second, cross-layer parameter sharing, allows layer to share parameters which keeps the number of learnable parameters lower. The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://huggingface.co/papers/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. From 7ba1110083bb16158124fe01c7b80702b1b59af0 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 02:00:18 +0530 Subject: [PATCH 08/13] Update docs/source/en/model_doc/albert.md removed repeated description Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/albert.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index b7909965e1a2..06f2eed73e25 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -27,7 +27,6 @@ rendered properly in your Markdown viewer. [ALBERT](https://huggingface.co/papers/1909.11942) is designed to address memory limitations of scaling and training of [BERT](./bert). It adds two parameter reduction techniques. The first, factorized embedding parametrization, splits the larger vocabulary embedding matrix into two smaller matrices so you can grow the hidden size without adding a lot more parameters. The second, cross-layer parameter sharing, allows layer to share parameters which keeps the number of learnable parameters lower. -The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://huggingface.co/papers/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. ALBERT was created to address problems like -- GPU/TPU memory limitations, longer training times, and unexpected model degradation in BERT. ALBERT uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT: From 155b733538548ca1e8cbd67aafed3ca24c0120a8 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 02:20:15 +0530 Subject: [PATCH 09/13] Update albert.md removed lines --- docs/source/en/model_doc/albert.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 06f2eed73e25..a61d4e6bd92d 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -28,13 +28,6 @@ rendered properly in your Markdown viewer. [ALBERT](https://huggingface.co/papers/1909.11942) is designed to address memory limitations of scaling and training of [BERT](./bert). It adds two parameter reduction techniques. The first, factorized embedding parametrization, splits the larger vocabulary embedding matrix into two smaller matrices so you can grow the hidden size without adding a lot more parameters. The second, cross-layer parameter sharing, allows layer to share parameters which keeps the number of learnable parameters lower. -ALBERT was created to address problems like -- GPU/TPU memory limitations, longer training times, and unexpected model degradation in BERT. ALBERT uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT: - -- **Factorized embedding parameterization:** The large vocabulary embedding matrix is decomposed into two smaller matrices, reducing memory consumption. -- **Cross-layer parameter sharing:** Instead of learning separate parameters for each transformer layer, ALBERT shares parameters across layers, further reducing the number of learnable weights. - -ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time. - You can find all the original ALBERT checkpoints under the [ALBERT community](https://huggingface.co/albert) organization. > [!TIP] @@ -147,6 +140,7 @@ with torch.no_grad(): - ALBERT requires absolute positional embeddings, and it expects right-padding (i.e., pad tokens should be added at the end, not the beginning). - ALBERT uses token_type_ids, just like BERT. So you should indicate which token belongs to which segment (e.g., sentence A vs. sentence B) when doing tasks like question answering or sentence-pair classification. - ALBERT uses a different pretraining objective called Sentence Order Prediction (SOP) instead of Next Sentence Prediction (NSP), so fine-tuned models might behave slightly differently from BERT when modeling inter-sentence relationships. +- ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time. ## Resources From b1420c5f28a90c175fef46b819b507a93fe408d5 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Fri, 25 Apr 2025 20:42:40 +0530 Subject: [PATCH 10/13] Update albert.md updated pipeline code --- docs/source/en/model_doc/albert.md | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index a61d4e6bd92d..b0574f30d982 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -42,19 +42,13 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline import torch from transformers import pipeline -#Initialize fill-mask pipeline -albert_fill_mask = pipeline( +pipeline = pipeline( task="fill-mask", model="albert-base-v2", - device=0 if torch.cuda.is_available() else -1 + torch_dtype=torch.float16, + device=0 ) - -# Masked prompt (use [MASK] token) -prompt = "Plants create energy through a process known as [MASK]." -results = albert_fill_mask(prompt, top_k=5) # Get top 5 predictions - -for result in results: - print(f"Prediction: {result['token_str']} | Score: {result['score']:.4f}") +pipeline("Plants create [MASK] through a process known as photosynthesis.", top_k=5) ``` From df6376e9c125ad9c120a6bf538d4287e023b4b2b Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Sat, 26 Apr 2025 12:15:09 +0530 Subject: [PATCH 11/13] Update albert.md updated auto model code, removed quantization as model size is not large, removed the attention visualizer part --- docs/source/en/model_doc/albert.md | 43 +++--------------------------- 1 file changed, 4 insertions(+), 39 deletions(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index b0574f30d982..336637d6166a 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -58,25 +58,22 @@ pipeline("Plants create [MASK] through a process known as photosynthesis.", top_ import torch from transformers import AutoModelForMaskedLM, AutoTokenizer -# Load ALBERT (v2) and its tokenizer tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2") model = AutoModelForMaskedLM.from_pretrained( "albert/albert-base-v2", - torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, + torch_dtype=torch.float16, + attn_implementation="sdpa", device_map="auto" ) -# Masked language modeling prompt prompt = "Plants create energy through a process known as [MASK]." -inputs = tokenizer(prompt, return_tensors="pt").to(model.device) +inputs = tokenizer(prompt, return_tensors="pt").to(model.device) -# Predict the masked token with torch.no_grad(): outputs = model(**inputs) mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] - predictions = outputs.logits[0, mask_token_index] # Get logits for [MASK] token + predictions = outputs.logits[0, mask_token_index] -# Decode top predictions (k=5) top_k = torch.topk(predictions, k=5).indices.tolist() for token_id in top_k[0]: print(f"Prediction: {tokenizer.decode([token_id])}") @@ -93,38 +90,6 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran -Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends. - -The example below uses [torch.quantization](https://pytorch.org/docs/stable/generated/torch.ao.quantization.quantize_dynamic.html) to only quantize the weights to int8. In te example quantization was applied only on Linear layers - - -```py -from transformers import AutoModelForMaskedLM, AutoTokenizer -import torch - -# Load model ---loaded v1 version as it was answering good -model = AutoModelForMaskedLM.from_pretrained("albert/albert-base-v1") -tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v1") - -# Quantize the model (PyTorch native) -quantized_model = torch.quantization.quantize_dynamic( - model, - {torch.nn.Linear}, # Quantize only linear layers - dtype=torch.qint8 -) - -# Verify -print(f"Size before: {model.get_memory_footprint()/1e6:.1f}MB") -print(f"Size after: {quantized_model.get_memory_footprint()/1e6:.1f}MB") - -# Usage example -inputs = tokenizer("Albert Einstein was born in [MASK].", return_tensors="pt") -with torch.no_grad(): - outputs = quantized_model(**inputs) - print(tokenizer.decode(outputs.logits[0].argmax(-1))) -``` - -> ALBERT is not compatible with `AttentionMaskVisualizer` as it uses masked self-attention rather than causal attention. So don't have `_update_causal_mask` method. ## Notes From 354697691e759dced4764b4a759d233c6ae49dbf Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Sat, 26 Apr 2025 12:21:39 +0530 Subject: [PATCH 12/13] Update docs/source/en/model_doc/albert.md updated notes Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/albert.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 336637d6166a..6cc142d5f40e 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -93,7 +93,8 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran ## Notes -- All tokens attend to all others (like BERT). +- Inputs should be padded on the right because BERT uses absolute position embeddings. +- The embedding size `E` is different from the hidden size `H` because the embeddings are context independent (one embedding vector represents one token) and the hidden states are context dependent (one hidden state represents a sequence of tokens). The embedding matrix is also larger because `V x E` where `V` is the vocabulary size. As a result, it's more logical if `H >> E`. If `E < H`, the model has less parameters. - ALBERT supports a maximum sequence length of 512 tokens. - Cannot be used for autoregressive generation (unlike GPT) - ALBERT requires absolute positional embeddings, and it expects right-padding (i.e., pad tokens should be added at the end, not the beginning). From d36e6384654f8afa9e6fc5dd1ae1372cd38f26b5 Mon Sep 17 00:00:00 2001 From: souvikchand <96312748+souvikchand@users.noreply.github.com> Date: Sat, 26 Apr 2025 12:36:55 +0530 Subject: [PATCH 13/13] Update albert.md reduced a repeating point in notes --- docs/source/en/model_doc/albert.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/albert.md b/docs/source/en/model_doc/albert.md index 6cc142d5f40e..fd593f15d173 100644 --- a/docs/source/en/model_doc/albert.md +++ b/docs/source/en/model_doc/albert.md @@ -100,7 +100,7 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran - ALBERT requires absolute positional embeddings, and it expects right-padding (i.e., pad tokens should be added at the end, not the beginning). - ALBERT uses token_type_ids, just like BERT. So you should indicate which token belongs to which segment (e.g., sentence A vs. sentence B) when doing tasks like question answering or sentence-pair classification. - ALBERT uses a different pretraining objective called Sentence Order Prediction (SOP) instead of Next Sentence Prediction (NSP), so fine-tuned models might behave slightly differently from BERT when modeling inter-sentence relationships. -- ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time. + ## Resources