Skip to content

How to bring my tokenizer and set vocabulary size accordingly for training a model with loaded weights #2087

@zvimarko

Description

@zvimarko

Hi,

I totally new in the LLM word and, first, I try training just a LLM (regular) for few files I have.

I run SentencePiece (BPE algorithm) on my few text files - that yields a tokenizer. I viewed the tokenizing table and it is just 65 rows - meaning vocabulary size = 65.

My question - how do you continue to train a model but with your own specific tokenizer and vocabulary size that is different (smaller) than the one the model was trained on.

What are the files in python or config that I should modify ?

Generally speaking, at first glance I thought I would follow these steps:

  1. Load the model.
  2. Change the layer (which I do not know which one yet) that corresponding to vocabulary size.
  3. Load the weights into the model - hopefully I might get an warring that some wights have not been assigned du to the fact that I decrease vocabulary size to be smaller (just 65 in my case).
  4. Continue training.

I thank everybody who can guide me how to do so - and share his/her knowledge.
If there are any links or videos - please share these, too.

Thanks a lot

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions