Clarification on Filtering Molecules by Score Without Bucket Interference #287
Replies: 3 comments 1 reply
-
|
REINVENT writes out all generated molecules to a CSV file whose prefix you set with |
Beta Was this translation helpful? Give feedback.
-
|
Hi Halx, Thank you so much for your response. I used this filter command "(head -n 1 CH1_stage2_1.csv && awk -F',' '$4 > 0.5 {print $0}' CH1_stage2_1.csv | tail -n +2 | sort -t',' -k5,5 -u -k4,4nr) > CH1_stage2_1_filtered.csv" to post process my molecules (I want to retain molecules with scoring >0.5 , remove duplicate while keeping the highest scoring SMILES) for this csv it will output 20 molecules then I test another RL run with a different batch size with the same command it also ouput 20 molecules. My hypothesis is that the higher the batch size the more "good molecules" >0.5 will be collected. This prompts me to look into further and I see in the vim of staged2_learning.toml, the bucket size is 20 so I doubt if that's the reason my post process step always output 20 molecules. stage2_run.tomlrun_type = "staged_learning" [parameters] [learning_strategy] Use a stricter, diversity-pushing DF here[diversity_filter] [[stage]] [stage.scoring] Update: I revised my filtering command to ( see blow) and I got 60 molecules >0.5. I would appreciate if you could verify if there is any discrepancy between the 2 filtering commands that leads to 2 different outcomes of molecules |
Beta Was this translation helpful? Give feedback.
-
|
Hi Halx, can you please take a look and let me know? I highly appreciate your guidance and support! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! I’ve been testing REINVENT with different batch sizes (e.g., 10, 100, 150, 175, 200) and noticed that after filtering for molecules with scores > 0.5, the number of retained molecules remains constant (around 20), which seems tied to my bucket size (set to 20).
Could you suggest a proper way to extract all molecules scoring above a threshold (e.g., 0.5) without being constrained by the diversity filter or bucket size? My goal is to evaluate the model’s learning progress across all generated samples, not just the top molecule per bucket.
Is there a recommended configuration parameter (e.g., in the diversity_filter or scoring_function) or a specific stage in the pipeline where I can capture these “raw” outputs before the bucket-level pruning?
Thank you for your time and guidance!
Thanks,
Khanh
Beta Was this translation helpful? Give feedback.
All reactions