You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I totally new in the LLM word and, first, I try training just a LLM (regular) for few files I have.
I run SentencePiece (BPE algorithm) on my few text files - that yields a tokenizer. I viewed the tokenizing table and it is just 65 rows - meaning vocabulary size = 65.
My question - how do you continue to train a model but with your own specific tokenizer and vocabulary size that is different (smaller) than the one the model was trained on.
What are the files in python or config that I should modify ?
Generally speaking, at first glance I thought I would follow these steps:
Load the model.
Change the layer (which I do not know which one yet) that corresponding to vocabulary size.
Load the weights into the model - hopefully I might get an warring that some wights have not been assigned du to the fact that I decrease vocabulary size to be smaller (just 65 in my case).
Continue training.
I thank everybody who can guide me how to do so - and share his/her knowledge.
If there are any links or videos - please share these, too.