[Inactive] Changes for basic LLaDA style diffusion masking support #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

gopeshh wants to merge 3 commits into main from gopeshh/masked_diffusion

Collaborator

gopeshh commented Apr 21, 2025

✨ Description

Cleaned up the code a bit:

Added Diffusion config object as we discussed
removed noise schedules for v1
Moved loss calculation to head.py (as I noticed language modelling loss is computed there)
Moved bidirectional attention to preprocessing.py file as it seems like the attention mask is computed there

Of course still a WIP but feel free to leave comments and suggestions

These are changes to address this PR: #208 (comment)


          changes for basic LLaDA style diffusion masking support

db28a11

gopeshh requested a review from tscholak

April 21, 2025 12:12


          tests for masking and MLM loss

3d44671

PierreAndreNoel reviewed

View reviewed changes

PierreAndreNoel left a comment

This is just quick feedback as I am very busy with other things, but please remind me to come back here next week and I'll dig deeper in.

fast_llm/layers/transformer/preprocessing.py

    
                      t = torch.rand(batch_size, device=device)

                      p_mask = (1 - diffusion_config.epsilon) * t + diffusion_config.epsilon

PierreAndreNoel Apr 22, 2025

Some questions/thoughts (I am just browsing quickly, and I am not looking at the paper right now):

Why is the lower bound epsilon and the upper bound max_mask_prob?
My guts tell me you never want the mask probability to be exactly 1, for the same kind of reasons you don't want it to be exactly 0.
This approach using torch.min will put a discrete probability for p_mask to be exactly max_mask_prob.

Collaborator Author

gopeshh May 6, 2025

Are you saying this coz we could have many timesteps with the exact masking level set to max_mask_prob? So are you suggesting some soft clipping instead of a hard upper bound?

fast_llm/layers/transformer/preprocessing.py

    
                      masked_indices = torch.rand((batch_size, seq_len), device=device) < p_mask

                      if diffusion_config.pad_prob > 0:

PierreAndreNoel Apr 22, 2025

Meta: I currently can't comment about padding; it will have to wait for next week, as I need to re-read the paper better (our own work doesn't do padding).

Contributor

nitsanluke May 8, 2025

Is this to include variable length sequences for 1% of the data?

Collaborator Author

gopeshh May 13, 2025

Yeah exactly!

fast_llm/layers/transformer/preprocessing.py

    
                      p_mask = torch.min(p_mask, torch.tensor(diffusion_config.max_mask_prob))

                      p_mask = p_mask[:, None].expand(-1, seq_len)

                      masked_indices = torch.rand((batch_size, seq_len), device=device) < p_mask

PierreAndreNoel Apr 22, 2025

Assuming True means "masked".

fast_llm/layers/transformer/preprocessing.py

    
                          attention_mask = torch.ones((batch_size, 1, seq_len, seq_len), device=device, dtype=torch.bool)

                      else:

                          # Causal attention

                          attention_mask = torch.ones((batch_size, 1, seq_len, seq_len), device=device, dtype=torch.bool).tril_()

PierreAndreNoel Apr 22, 2025

My understanding is that you never want such a triangular causal attention, as this would give a strictly worse model than an autoregressive model.

Suppose that, at inference, tokens are unmasked in the order (4, 2, 3, 0, 1). Token 4 is unmasked first, but this triangular matrix prevents all other tokens from ever "seeing" it.

What is the closest case that makes sense to me would be to permute the rows and columns of the triangular matrix using (4,2,3,0,1), so that token 2 can see token 4, token 3 can see tokens 2 and 4, etc.

Collaborator Author

gopeshh May 6, 2025

Yes, permuted rows and columns makes sense - so we can preserve the order in which it was unmasked. I will update this.
I guess this idea is similar to this paper? https://arxiv.org/abs/1906.08237

fast_llm/layers/transformer/preprocessing.py

    
                      kwargs['masked_indices'] = masked_indices

                      kwargs['p_mask'] = p_mask

                      if self._config.diffusion.bidirectional_attention:

PierreAndreNoel Apr 22, 2025

You may want a string instead of a boolean, as there are many possible attention choices (e.g., blocks) that may come up. Also see the next comment below.

Collaborator Author

gopeshh May 6, 2025

I agree, will change this!

nitsanluke reviewed

View reviewed changes

fast_llm/layers/language_model/head.py

    
                      masked_p = p_mask[masked_indices]

                      # Compute MLM loss

                      loss, grad = cross_entropy_forward_backward(

Contributor

nitsanluke May 8, 2025

Is the loss already divided by masked_p? https://github.com/ML-GSAI/SMDM/blob/583aa4716d17728dbb825aec6c24a121164d616a/pretrain/train_mdm.py#L274

nitsanluke mentioned this pull request

Converter for Llama based Masked Diffusion Models (Based on Dream) #263

Merged

25 tasks


          config changes

1391d47

Collaborator

jlamypoirier commented Jun 13, 2025

Is this still relevant?

jlamypoirier changed the title ~~Changes for basic LLaDA style diffusion masking support~~ [Inactive] Changes for basic LLaDA style diffusion masking support

gopeshh closed this

gopeshh deleted the gopeshh/masked_diffusion branch

June 22, 2025 12:25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet