-
Notifications
You must be signed in to change notification settings - Fork 40
FEATURE: add context and llm controls to researcher, fix username filter #1401
Conversation
Adds context length controls to researcher (max tokens per post and batch) Allow picking LLM for researcher Fix bug where unicode usernames were not working Fix documentation of OR logic
|
|
||
| # Create test content with long text to test token truncation | ||
| topic = Fabricate(:topic, category: category, tags: [tag_research]) | ||
| long_content = "zz " * 100 # This will exceed our token limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 is abit of a magic number here. I wonder if we can generate the long content here based on a constant so that updating the constant in the future doesn't break this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is fine, cause later on we check for it only showing up 48 times... it is a bit magic and may break if we swap tokenizers, but overall is pretty safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually is this limit based on https://github.com/discourse/discourse-ai/pull/1401/files#diff-a4d7be1d2a5b82a80a1a4693975f7ac231a577ced032e4d941f332292fb56339R47? If so, then I think this is fine.
| register_filter(/\Ausernames?:(.+)\z/i) do |relation, username, filter| | ||
| user_ids = User.where(username_lower: username.split(",").map(&:downcase)).pluck(:id) | ||
| if user_ids.empty? | ||
| relation.where("1 = 0") | ||
| else | ||
| relation.where("1 = 0") # No results if user doesn't exist | ||
| relation.where("posts.user_id IN (?)", user_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this to the following and let the postgres planner optimize the query.
register_filter(/\Ausernames?:(.+)\z/i) do |relation, username, filter|
relation.where("posts.user_id IN (?)", User.where(username_lower: username.split(",").map(&:downcase)).select(:id))
end
Also slightly safer as we are not plucking the User#ids into memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started off with that and it failed, so I ended up simplifying, I think the odds of getting a problem here memory wise is very low given llm is streaming in usernames very unlikely the list will get too long.
Adds context length controls to researcher (max tokens per post and batch)
Allow picking LLM for researcher
Fix bug where unicode usernames were not working
Fix documentation of OR logic