Clarification on Filtering Molecules by Score Without Bucket Interference #287

vivianzheng-st · 2025-10-05T00:32:01Z

vivianzheng-st
Oct 5, 2025

Hi! I’ve been testing REINVENT with different batch sizes (e.g., 10, 100, 150, 175, 200) and noticed that after filtering for molecules with scores > 0.5, the number of retained molecules remains constant (around 20), which seems tied to my bucket size (set to 20).

Could you suggest a proper way to extract all molecules scoring above a threshold (e.g., 0.5) without being constrained by the diversity filter or bucket size? My goal is to evaluate the model’s learning progress across all generated samples, not just the top molecule per bucket.

Is there a recommended configuration parameter (e.g., in the diversity_filter or scoring_function) or a specific stage in the pipeline where I can capture these “raw” outputs before the bucket-level pruning?

Thank you for your time and guidance!

Thanks,
Khanh

halx · 2025-10-06T05:52:18Z

halx
Oct 6, 2025
Maintainer

REINVENT writes out all generated molecules to a CSV file whose prefix you set with summary_csv_prefix in the [parameters]. I can't see why filtering on that would amount to the low number you say you get. You would need to provide much more detail if that is the case.

0 replies

vivianzheng-st · 2025-10-06T15:19:46Z

vivianzheng-st
Oct 6, 2025
Author

Hi Halx,

Thank you so much for your response. I used this filter command "(head -n 1 CH1_stage2_1.csv && awk -F',' '$4 > 0.5 {print $0}' CH1_stage2_1.csv | tail -n +2 | sort -t',' -k5,5 -u -k4,4nr) > CH1_stage2_1_filtered.csv" to post process my molecules (I want to retain molecules with scoring >0.5 , remove duplicate while keeping the highest scoring SMILES) for this csv it will output 20 molecules then I test another RL run with a different batch size with the same command it also ouput 20 molecules. My hypothesis is that the higher the batch size the more "good molecules" >0.5 will be collected. This prompts me to look into further and I see in the vim of staged2_learning.toml, the bucket size is 20 so I doubt if that's the reason my post process step always output 20 molecules.

stage2_run.toml

run_type = "staged_learning"
device = "cpu"
tb_logdir = "tb_logs_CH1_stage2"

[parameters]
summary_csv_prefix = "CH1_stage2"
prior_file = "libinvent_transformer_pubchem.prior"
agent_file = "stage1.chkpt"
smiles_file = "scaffolds4attpoint.smi"
batch_size = 100
unique_sequences = true
randomize_smiles = true

[learning_strategy]
type = "dap"
sigma = 128
rate = 0.0001

Use a stricter, diversity-pushing DF here

[diversity_filter]
type = "ScaffoldSimilarity" # or IdenticalTopologicalScaffold
bucket_size = 20
minscore = 0.50
minsimilarity = 0.4
penalty_multiplier = 0.5

[[stage]]
chkpt_file = "stage2.chkpt"
termination = "simple"
#min_steps = 50
max_steps = 200
max_score = 0.65

[stage.scoring]
type = "geometric_mean"
filename = "stage2_scoring.toml" # your Stage-2 scorefile
filetype = "toml"

CH1_stage2_1.csv

Update: I revised my filtering command to ( see blow) and I got 60 molecules >0.5. I would appreciate if you could verify if there is any discrepancy between the 2 filtering commands that leads to 2 different outcomes of molecules
(
head -n 1 scoringCH1stage2.csv
awk -F',' 'NR>1 && NF>5 && $4 > 0.5 && $5 != "" {print $0}' scoringCH1stage2.csv
| sort -t',' -k4,4nr
| sort -t',' -k5,5 -u
) > scoringCH1stage2_filtered.csv

1 reply

halx Oct 9, 2025
Maintainer

I would suggest that you find a more robust solution to work with CSV files. While there are some options to do that with awk in a sensible manner, this would still mean to write a lot of boilerplate code. Programming languages like Python or R are much more suitable for such tasks.

vivianzheng-st · 2025-10-08T00:02:36Z

vivianzheng-st
Oct 8, 2025
Author

Hi Halx, can you please take a look and let me know? I highly appreciate your guidance and support!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on Filtering Molecules by Score Without Bucket Interference #287

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarification on Filtering Molecules by Score Without Bucket Interference #287

Uh oh!

vivianzheng-st Oct 5, 2025

Replies: 3 comments · 1 reply

Uh oh!

halx Oct 6, 2025 Maintainer

Uh oh!

vivianzheng-st Oct 6, 2025 Author

stage2_run.toml

Use a stricter, diversity-pushing DF here

Uh oh!

halx Oct 9, 2025 Maintainer

Uh oh!

vivianzheng-st Oct 8, 2025 Author

vivianzheng-st
Oct 5, 2025

Replies: 3 comments 1 reply

halx
Oct 6, 2025
Maintainer

vivianzheng-st
Oct 6, 2025
Author

halx Oct 9, 2025
Maintainer

vivianzheng-st
Oct 8, 2025
Author