Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Set the default behavior, in case people don't have core.autocrlf set.
* text=auto

# Explicitly declare text files you want to always be normalized and converted
# to native line endings on checkout.
*.c text
*.h text

# Declare files that will always have CRLF line endings on checkout.
*.sln text eol=crlf

# Denote all files that are truly binary and should not be modified.
*.png binary
*.jpg binary
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
- Precision: The proportion of positive predictions that are actually positive.
- Recall: The proportion of actual positives that are correctly predicted.
- F1 score: A harmonic mean of precision and recall.
- Please refer the section on [Evaluation Metrics for the Classification Problem](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/tree/articles/Articles/Evaluation%20Metrics/Classification).
- Please refer the section on [Evaluation Metrics for the Classification Problem](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/tree/articles/Articles/Evaluation%20Metrics/Classification).

- **Generative Language Models**
- [Perplexity](https://en.wikipedia.org/wiki/Perplexity): A measure of how well a language model predicts a sequence of words.
Expand Down
2 changes: 1 addition & 1 deletion Articles/Interview Preparation/Generative Models.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 1. What is the difference between generative and discriminative models?
Answer: Generative models, such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), are designed to generate new data samples by understanding and capturing the underlying data distribution. Discriminative models, on the other hand, focus on distinguishing between different classes or categories within the data.

![Difference between generative and discriminative models](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/741f02a4-de87-4150-ba8f-b3b5a7760098)
![Difference between generative and discriminative models](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/741f02a4-de87-4150-ba8f-b3b5a7760098)
2 changes: 1 addition & 1 deletion Articles/Interview Preparation/Large Language Models.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Pre-trained language models have revolutionized NLP by providing a robust founda

## 2. What are the primary distinctions between models such as GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers)?

![bert_vs_gpt](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/f5632f46-0986-4cbf-9e47-98e4b7274679)
![bert_vs_gpt](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/f5632f46-0986-4cbf-9e47-98e4b7274679)

Answer: GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) stand as two foundational architectures within the realm of NLP (Natural Language Processing). Each model presents its own distinctive approach and capabilities. Although both models utilize the Transformer architecture for text processing, they are engineered for diverse objectives and function in contrasting manners.

Expand Down
2 changes: 1 addition & 1 deletion Articles/Interview Preparation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

| Topic | Questions|
| ------ | :-----: |
| Generative Models | [🔗](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/tree/main/Articles/Interview%20Preparation)|
| Generative Models | [🔗](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/tree/main/Articles/Interview%20Preparation)|
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
- Loop tiling
- Operator fusion
- Quantization
![Inference Optimizations](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/ccd14769-1652-410b-8862-ffce67e8dde6)
![Inference Optimizations](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/ccd14769-1652-410b-8862-ffce67e8dde6)

# On-Device Privacy
| Aspect | Description |
Expand Down
4 changes: 2 additions & 2 deletions Articles/Model Compression/Knowledge Distillation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
- Knowledge distillation is a technique for transferring knowledge from a large model (teacher) to a smaller model (student), resulting in smaller and more efficient models. [Hinton et al., 2015](https://arxiv.org/abs/1503.02531)
- "Knowledge distillation is a process of transferring knowledge from a large model (teacher) to a smaller model (student). The student model can learn to produce similar output responses (response-based distillation), reproduce similar intermediate layers (feature-based distillation), or reproduce the interaction between layers (relation-based distillation)." [aiedge.io](https://newsletter.theaiedge.io/)
- The image below, which is sourced from [AiEdge.io](https://newsletter.theaiedge.io/), does an excellent job of visualizing the concept of knowledge distillation.
![knowledge_distilation](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/205a960c-c1ce-4d18-9c3a-feab2df12f45)
![knowledge_distilation](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/205a960c-c1ce-4d18-9c3a-feab2df12f45)
- Knowledge distillation is a technique that allows us to deploy large deep learning models in production by training a smaller model (student) to mimic the performance of a larger model (teacher).
> The key idea of knowledge distillation is to train the student model with the soft target of the teacher model's output probability distribution, instead of the same labeled data as the teacher.
- During a standard training process, the teacher model learns to discriminate between many classes by maximizing the probability of the correct label. This side effect, where the model assigns smaller probabilities to other classes, can give us valuable insights into how the model generalizes. For example, an image of a cat is more likely to be mistaken for a tiger than a chair, even though the probability of both mistakes is low. We can use this knowledge to train a student model that is more accurate and robust.
Expand Down Expand Up @@ -44,4 +44,4 @@ where, $F(x_i)$ = probability distribution over the labels created by passing ex
- A teacher model can be used to transfer knowledge to a student model. The teacher model is first trained on a large set of labeled data. Then, it is used to generate soft labels for a smaller set of unlabeled data. These soft labels can then be used to train the student model. This approach allows the student model to learn from the knowledge of the teacher model, even though it is not trained on as much data.
- [Parthasarathi and Strom (2019)](https://arxiv.org/pdf/1904.01624.pdf) used a two-step approach to train an acoustic model for speech recognition. First, they trained a powerful teacher model on a small set of annotated data. This teacher model was then used to label a much larger set of unannotated data. Finally, they trained a leaner, more efficient student model on the combined dataset of annotated and unlabeled data.

![Distillation As Semi-supervised Learning](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/32d6f8a2-e9aa-4fa0-b485-7a1f1e648320)
![Distillation As Semi-supervised Learning](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/32d6f8a2-e9aa-4fa0-b485-7a1f1e648320)
12 changes: 6 additions & 6 deletions Articles/Model Compression/Mixed Precision Training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
- Using smaller floating point numbers can lead to rounding errors that are large enough to cause underflow. This is a problem because many gradient update values during backpropagation are very small but not zero. Rounding errors can accumulate during backpropagation, turning these values into zeroes or NaNs. This can lead to inaccurate gradient updates and prevent the network from converging.
- The researchers "[Mixed Precision Training](https://arxiv.org/pdf/1710.03740.pdf)" found that using `fp16` "half-precision" floating point numbers for all computations can lose information, as it cannot represent gradient updates smaller than "$2^{-24}$" value. This information loss can affect the accuracy of the model, as around 5% of all gradient updates made by their example network were smaller than this threshold.

![mixed_precision](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/1316fdde-2bbc-49f9-99fd-aac4a462cbf6)
![mixed_precision](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/1316fdde-2bbc-49f9-99fd-aac4a462cbf6)

- Mixed precision training is a technique that uses `fp16` to speed up model training without sacrificing accuracy. It does this by combining three different techniques:
- Maintain two copies of the weights matrix:
Expand All @@ -55,7 +55,7 @@
### How Tensor Cores Actually Works
- Mixed precision training (an `fp16` matrix is half the size of a `fp32` one) can reduce the memory requirements for deep learning models, but it can only speed up training if the GPU has special hardware support for half-precision operations. Tensor cores in recent NVIDIA GPUs provide this support, and can significantly speed up mixed precision training.
- Tensor cores are a type of processor that is specifically designed to perform a single operation very quickly: multiplying two 4x4 matrices of floating-point numbers in half precision (`fp16`) and adding the result to a third 4x4 matrix of floating-point numbers in either half precision or single precision (`fp32`). This operation is called a "fused multiply add".
![How Tensor Cores Actually Works](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/c4958145-628a-4e22-9f06-61544eb02c81)
![How Tensor Cores Actually Works](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/c4958145-628a-4e22-9f06-61544eb02c81)
- Tensor cores are a type of processor that can be used to accelerate matrix multiplication operations in half precision. This makes them ideal for accelerating backpropagation, which is a computationally intensive process that is used to train neural networks.

> Note: Tensor cores are only useful for accelerating matrix multiplication operations if the input matrices are in half precision. If you are training a neural network on a GPU with tensor cores and not using mixed precision training, you are wasting the potential of the GPU because the tensor cores will not be used.
Expand Down Expand Up @@ -132,10 +132,10 @@ torch.cuda.amp.GradScaler(
### `autocast` Context Manager
- The `torch.cuda.amp.autocast` context manager is a powerful tool for improving the performance of PyTorch models. It automatically casts operations to fp16, which can significantly speed up training without sacrificing accuracy. However, not all operations are safe to run in fp16, so it is important to check the amp [module documentation](https://pytorch.org/docs/master/amp.html#autocast-op-reference) for a list of supported operations.
- The list of operations that autocast can cast to fp16 is dominated by matrix multiplication and convolutions. The simple linear function is also supported.
![image](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/ff9bff6e-2a18-4359-ba58-89fa8aee2ee0)
![image](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/ff9bff6e-2a18-4359-ba58-89fa8aee2ee0)

- The operations listed above are safe to use in `FP16`, and they have up-casting rules to ensure that they are not affected by a mixture of `FP16` and `FP32` inputs. These operations include two other fundamental linear algebraic operations: matrix/vector dot products and vector cross products.
![image](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/7bbd5347-b609-4a22-9d35-30a71ef383b5)
![image](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/7bbd5347-b609-4a22-9d35-30a71ef383b5)

- The following operations are not safe to use in `FP16`: logarithms, exponents, trigonometric functions, normal functions, discrete functions, and large sums. These operations must be performed in `FP32` to avoid errors.
- Convolutional layers are the most likely layers to benefit from autocasting, as they rely on safe FP16 operations. Activation functions, on the other hand, may not benefit as much from autocasting, as they often use unsafe FP16 operations.
Expand Down Expand Up @@ -169,7 +169,7 @@ with torch.cuda.amp.autocast():
| BERT | Natural language processing [transformer](https://jalammar.github.io/illustrated-transformer/) model([bert-base-uncased](https://huggingface.co/bert-base-uncased)) |[Twitter Sentiment Extraction](https://www.kaggle.com/c/tweet-sentiment-extraction) competition on Kaggle |[🔗](hhttps://github.com/spellml/tweet-sentiment-extraction)|

The results:
![result](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/60086032-8772-4e0b-9102-7f319217ffce)
![result](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/60086032-8772-4e0b-9102-7f319217ffce)

- Observation from results:
- Mixed precision training does not provide any benefits for the feedforward network because it is too small.
Expand All @@ -183,7 +183,7 @@ The results:
- PyTorch reserves GPU memory at the start of training to protect the training script from other processes that may try to use up too much memory and cause it to crash.
- Enabling mixed precision training can free up GPU memory, which can allow you to train larger models or use larger batch sizes.
- Both UNet and BERT benefited from mixed precision training, but UNet benefited more. The reason for this is not clear to me, as PyTorch memory allocation behavior is not well-understood.
![result_memory](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/06155d01-2f4f-4be6-8fe0-088bcbe59483)
![result_memory](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/assets/40186859/06155d01-2f4f-4be6-8fe0-088bcbe59483)

# Conclusion
- The [PyTorch official website](https://pytorch.org/tutorials/#model-optimization) has tutorials that can help you get started with quantizing your models in PyTorch.
Expand Down
2 changes: 1 addition & 1 deletion Articles/Model Compression/Pruning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
- Structured pruning, a dynamic research field lacking a clear API, involves selecting a metric to assess the significance of each neuron. Subsequently, neurons with lower information content can be pruned, with potentially useful metrics encompassing the [Shapley value](https://christophm.github.io/interpretable-ml-book/shapley.html), a Taylor approximation measuring a neuron's impact on loss sensitivity, or even random selection. Notably, the [TorchPruner](https://github.com/marcoancona/TorchPruner) library automatically incorporates some of these metrics for `nn.Linear` and convolution modules, while the [Torch-Pruning](https://github.com/vainf/torch-pruning) library offers support for additional operations. Among the notable earlier contributions, one involves filter pruning in convnets using the L1 norm of filter weights.
- Unstructured pruning is a technique for reducing the size of a neural network by zeroing out weights with small magnitudes. It can be done during or after training, and the target sparsity can be adjusted to achieve the desired balance between model size and accuracy. However, [there is some confusion](https://arxiv.org/abs/2003.03033) in this area, so it is important to consult the documentation for [TensorFlow](https://www.tensorflow.org/model_optimization/guide/pruning/) and [PyTorch](https://pytorch.org/tutorials/intermediate/pruning_tutorial.html) before using unstructured pruning.

## Fine Tuning | [What is Fine Tuning](https://github.com/ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/tree/main/Articles/Training/Fine%20Tuning%20Models)
## Fine Tuning | [What is Fine Tuning](https://github.com/jzsmoreno/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing/tree/main/Articles/Training/Fine%20Tuning%20Models)
- After pruning a neural network, it is [standard practice](https://arxiv.org/pdf/2003.02389.pdf) to retrain the network. The best method is to reset the learning rate to its original value and start training from scratch. Optionally, you can also reset the weights of the unpruned parts of the network to their values earlier in training. This is essentially training the lottery ticket subnetwork that we have identified.
- For example: let's say we have a neural network with 1000 weights. We use pruning to remove 90% of the weights, leaving us with 100 weights. We then retrain the network with a reset learning rate and the weights from the earlier training. This helps the lottery ticket subnetwork to learn how to perform the task at hand more effectively.

Expand Down
Loading