How to bring my tokenizer and set vocabulary size accordingly for training a model with loaded weights

Hi,

I totally new in the LLM word and, first, I try training just a LLM (regular) for few files I have.

I run SentencePiece (BPE algorithm) on my few text files - that yields a tokenizer. I viewed the tokenizing table and it is just 65 rows - meaning vocabulary size = 65.

My question - how do you continue to train a model but with your own specific tokenizer and vocabulary size that is different (smaller) than the one the model was trained on.

**What are the files in python or config that I should modify ?**


**Generally speaking, at first glance I thought I would follow these steps:**
1) Load the model.
2) Change the layer (which I do not know which one yet) that corresponding to vocabulary size.
3) Load the weights into the model - hopefully I might get an warring that some wights have not been assigned du to the fact that I decrease vocabulary size to be smaller (just 65 in my case).
4) Continue training.


**I thank everybody who can guide me how to do so - and share his/her knowledge.
If there are any links or videos - please share these, too.**

**Thanks a lot** 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to bring my tokenizer and set vocabulary size accordingly for training a model with loaded weights #2087

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to bring my tokenizer and set vocabulary size accordingly for training a model with loaded weights #2087

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions