Skip to content

Conversation

speakinside
Copy link

The SMARTS expressions with commas in the parameters like "FilterSmartsInclusionList" will be wrongly split right now, since all the commas are split in the current code.
As far as I know, the comma in SMARTS expressions only appears in paired square brackets.
Therefore, I added a new function to prevent splitting the commas in SMARTS expressions based on the above fact.
Hope it can be useful.

@sneumann
Copy link
Member

Hi, cool, thanks for the contribution, I enabled the test run.
It would be great to also have a unit test for that, could you conjure one and add it to the PR ?
Yours, Steffen

@speakinside
Copy link
Author

Sure, one test case added

@sneumann
Copy link
Member

Hi, I can confirm that things compile, the test fails with current master and succeeds with your patch.
Before merging I need to understand if possibly the comma splitting might originally have been a feature, in that you could provide multiple, comma-separated smarts, which would then be AND (or OR ?!) combined when going through the candidates. I can't remember and I am not sure @c-ruttkies is still reading the issues.
Yours, Steffen

@schymane
Copy link
Collaborator

I was wondering the same when I saw this...

@speakinside
Copy link
Author

I found the problem when I used the "FilterSmartsInclusionList" parameter to pre-filter the candidates.

the value I used for this parameter was

[$(P(=O)(-O)(-O)-O-[$(c1c(-[CH3])cccc1),$(c1cc(-[CH3])ccc1),$(c1ccc(-[CH3])cc1)]),$(P(=O)(-O)(-O-[CH3])-O-c1ccccc1)], P(=O)(-O-[#6])(-O-[#6])-O-[#6]

So the right split would be [$(P(=O)(-O)(-O)-O-[$(c1c(-[CH3])cccc1),$(c1cc(-[CH3])ccc1),$(c1ccc(-[CH3])cc1)]),$(P(=O)(-O)(-O-[CH3])-O-c1ccccc1)] and P(=O)(-O-[#6])(-O-[#6])-O-[#6].

Any other comma split would result in incorrect SMARTS expressions, causing errors in the constructor function of PreProcessingCandidateSmartsExclusionFilter, and that was the case in the original code.

Also, I suspected the "FilterSmartsExclusionList" parameter would face the same issue.

So, I doubt splitting all commas would be a feature.

@schymane
Copy link
Collaborator

OK so part of the problem is that (as you mentioned) commas are valid separators within SMARTS expressions contained within square brackets (as are semi-colons), but the expression you provided seems to require a comma and space combination as a separator for examples that are not in square brackets (which is two separators, not one). Is the space mandatory in the fix?
https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html

Are you able to provide a candidate list/query parameters that you used together with this SMARTS expression, to test this for ourselves?

@speakinside
Copy link
Author

No, the space is not a part of the separator. The original code and mine both trim the substring.

Here is a parameter file I used

PeakListString = 54.3690071105957_27058.451171875;55.05453109741211_74278.6484375;57.06821823120117_48136.6015625;57.06988525390625_948328.5625;59.04935836791992_32604.375;61.01066207885742_28063.423828125;62.989768981933594_17906.287109375;65.03873443603516_30178.607421875;67.05453491210938_52008.12890625;69.07018280029297_61650.3203125;73.04679870605469_230502.609375;77.03880310058594_61446.36328125;79.02117919921875_25791.615234375;79.05416107177734_79356.390625;81.07023620605469_57066.27734375;83.08582305908203_29242.76953125;85.06497955322266_15559.923828125;87.04420471191406_21584.623046875;87.36715698242188_13201.529296875;89.05963897705078_116576.34375;91.05439758300781_280626.96875;91.05768585205078_307414.46875;91.06072235107422_19726.9765625;93.07002258300781_38176.28515625;95.0490951538086_606452.4375;95.08551025390625_62714.35546875;97.10147857666016_17058.701171875;103.05426025390625_15972.8193359375;104.06221771240234_34833.0078125;105.04485321044922_72184.390625;105.07003784179688_65940.0078125;105.07392120361328_64197.87890625;107.0494155883789_62170.5546875;107.08567810058594_18448.951171875;109.10155487060547_15934.921875;115.05462646484375_18037.431640625;117.06576538085938_16135.26171875;117.0698471069336_81131.0625;118.53775787353516_14241.517578125;119.08584594726562_53468.73046875;121.1013412475586_27691.732421875;131.0859375_22171.34375;132.093505859375_32895.22265625;133.0863037109375_73942.46875;135.11685180664062_26869.81640625;136.857177734375_14529.3310546875;141.0098114013672_70642.734375;147.11672973632812_64703.1875;149.04483032226562_76752.3515625;153.07058715820312_27529.05078125;167.05543518066406_109288.6640625;169.0595245361328_12721.1005859375;177.11293029785156_22520.68359375;189.03167724609375_41526.91796875;202.07769775390625_485224.84375;203.0855255126953_182898.609375;210.18572998046875_22085.40625;248.34156799316406_12575.2294921875;265.0437316894531_31593.9765625;265.047607421875_21540.63671875;265.0625_1474505.875;266.0652770996094_15369.767578125;291.11663818359375_15457.03125;299.0623474121094_15320.0947265625;309.27880859375_233498.59375;310.281982421875_28852.89453125;314.92999267578125_18791.0703125;321.1249694824219_2690016.5;322.128173828125_32523.484375;330.9615783691406_17219.33984375;344.97607421875_65474.859375;377.18756103515625_4416049.5;378.1905517578125_87258.2578125;400.98455810546875_208835.609375;401.9857482910156_27875.16796875;402.9632263183594_68886.21875;418.9955139160156_554556.3125;419.03192138671875_25242.572265625;419.99481201171875_90851.1875;433.010986328125_31458.71484375;433.2500915527344_2850659.0;434.2517395019531_69297.953125;454.3619079589844_14023.400390625;459.02337646484375_16652.669921875;489.0550537109375_119971.4609375;489.3118591308594_679491.75;489.40399169921875_22947.23046875;489.4471435546875_65965.1875;490.0562744140625_37209.22265625
MetFragPeakListReader = de.ipbhalle.metfraglib.peaklistreader.FilteredStringTandemMassPeakListReader
MetFragDatabaseType = PubChem
NeutralPrecursorMolecularFormula = C29H45O4P
IonizedPrecursorMass = 489.313446044922
FragmentPeakMatchAbsoluteMassDeviation = 0.001
FragmentPeakMatchRelativeMassDeviation = 5
PrecursorIonMode = 1
IsPositiveIonMode = True
MetFragScoreTypes = FragmenterScore
MetFragScoreWeights = 1
MetFragCandidateWriter = CSV
SampleName = 11665_[C12]29[H1]45[O16]4[P31]_489.313446044922
ResultsPath = compute_dir
MaximumTreeDepth = 2
FilterSmartsInclusionList = [$(P(=O)(-O)(-O)-O-[$(c1c(-[CH3])cccc1),$(c1cc(-[CH3])ccc1),$(c1ccc(-[CH3])cc1)]),$(P(=O)(-O)(-O-[CH3])-O-c1ccccc1)],P(=O)(-O-[#6])(-O-[#6])-O-[#6]
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter, SmartsSubstructureInclusionFilter
MetFragPostProcessingCandidateFilter = InChIKeyFilter
NumberThreads = 2

@sneumann
Copy link
Member

What irks me is that we have that nowhere documented, maybe I could ask for a piece of documentation and an example ? I tried to figure where best to put that, and frankly only found FilterSmartsInclusionList
in
https://github.com/ipb-halle/MetFrag/blob/978eeb259bd4ad7680899f6c6b3358089d9c712d/projects/metfragr/index.md?plain=1#L138

@speakinside
Copy link
Author

I primarily use the command line tool in my research.

Based on my understanding, the FilterSmartsInclusionList and FilterSmartsExclusionList parameters define the SMARTS patterns used for substructure filtering when the SmartsSubstructureInclusionFilter and SmartsSubstructureExclusionFilter options are specified in the MetFragPreProcessingCandidateFilter parameter. These lists determine which molecular substructures should be either included or excluded during candidate filtering.

Regarding your example, would the parameter list I provided in my previous comment meet your requirements?

No, the space is not a part of the separator. The original code and mine both trim the substring.

Here is a parameter file I used

PeakListString = 54.3690071105957_27058.451171875;55.05453109741211_74278.6484375;57.06821823120117_48136.6015625;57.06988525390625_948328.5625;59.04935836791992_32604.375;61.01066207885742_28063.423828125;62.989768981933594_17906.287109375;65.03873443603516_30178.607421875;67.05453491210938_52008.12890625;69.07018280029297_61650.3203125;73.04679870605469_230502.609375;77.03880310058594_61446.36328125;79.02117919921875_25791.615234375;79.05416107177734_79356.390625;81.07023620605469_57066.27734375;83.08582305908203_29242.76953125;85.06497955322266_15559.923828125;87.04420471191406_21584.623046875;87.36715698242188_13201.529296875;89.05963897705078_116576.34375;91.05439758300781_280626.96875;91.05768585205078_307414.46875;91.06072235107422_19726.9765625;93.07002258300781_38176.28515625;95.0490951538086_606452.4375;95.08551025390625_62714.35546875;97.10147857666016_17058.701171875;103.05426025390625_15972.8193359375;104.06221771240234_34833.0078125;105.04485321044922_72184.390625;105.07003784179688_65940.0078125;105.07392120361328_64197.87890625;107.0494155883789_62170.5546875;107.08567810058594_18448.951171875;109.10155487060547_15934.921875;115.05462646484375_18037.431640625;117.06576538085938_16135.26171875;117.0698471069336_81131.0625;118.53775787353516_14241.517578125;119.08584594726562_53468.73046875;121.1013412475586_27691.732421875;131.0859375_22171.34375;132.093505859375_32895.22265625;133.0863037109375_73942.46875;135.11685180664062_26869.81640625;136.857177734375_14529.3310546875;141.0098114013672_70642.734375;147.11672973632812_64703.1875;149.04483032226562_76752.3515625;153.07058715820312_27529.05078125;167.05543518066406_109288.6640625;169.0595245361328_12721.1005859375;177.11293029785156_22520.68359375;189.03167724609375_41526.91796875;202.07769775390625_485224.84375;203.0855255126953_182898.609375;210.18572998046875_22085.40625;248.34156799316406_12575.2294921875;265.0437316894531_31593.9765625;265.047607421875_21540.63671875;265.0625_1474505.875;266.0652770996094_15369.767578125;291.11663818359375_15457.03125;299.0623474121094_15320.0947265625;309.27880859375_233498.59375;310.281982421875_28852.89453125;314.92999267578125_18791.0703125;321.1249694824219_2690016.5;322.128173828125_32523.484375;330.9615783691406_17219.33984375;344.97607421875_65474.859375;377.18756103515625_4416049.5;378.1905517578125_87258.2578125;400.98455810546875_208835.609375;401.9857482910156_27875.16796875;402.9632263183594_68886.21875;418.9955139160156_554556.3125;419.03192138671875_25242.572265625;419.99481201171875_90851.1875;433.010986328125_31458.71484375;433.2500915527344_2850659.0;434.2517395019531_69297.953125;454.3619079589844_14023.400390625;459.02337646484375_16652.669921875;489.0550537109375_119971.4609375;489.3118591308594_679491.75;489.40399169921875_22947.23046875;489.4471435546875_65965.1875;490.0562744140625_37209.22265625
MetFragPeakListReader = de.ipbhalle.metfraglib.peaklistreader.FilteredStringTandemMassPeakListReader
MetFragDatabaseType = PubChem
NeutralPrecursorMolecularFormula = C29H45O4P
IonizedPrecursorMass = 489.313446044922
FragmentPeakMatchAbsoluteMassDeviation = 0.001
FragmentPeakMatchRelativeMassDeviation = 5
PrecursorIonMode = 1
IsPositiveIonMode = True
MetFragScoreTypes = FragmenterScore
MetFragScoreWeights = 1
MetFragCandidateWriter = CSV
SampleName = 11665_[C12]29[H1]45[O16]4[P31]_489.313446044922
ResultsPath = compute_dir
MaximumTreeDepth = 2
FilterSmartsInclusionList = [$(P(=O)(-O)(-O)-O-[$(c1c(-[CH3])cccc1),$(c1cc(-[CH3])ccc1),$(c1ccc(-[CH3])cc1)]),$(P(=O)(-O)(-O-[CH3])-O-c1ccccc1)],P(=O)(-O-[#6])(-O-[#6])-O-[#6]
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter, SmartsSubstructureInclusionFilter
MetFragPostProcessingCandidateFilter = InChIKeyFilter
NumberThreads = 2

Additionally, I've implemented some code updates that now properly handle commas between bond primitives in SMARTS notation (e.g., C=,-C or C@,!-C) without splitting them.

@speakinside
Copy link
Author

I see some tests have failed.
However, the test failure is not caused by my code but the change of PubChem API #212.

It should be easy to fix. Should I fix it in my pull request? I worry it may be irrelevant to my pull request.

@sneumann
Copy link
Member

Hi, thanks for checking, and a separate PR with that fix would be highly appreciated,
so we can merge that first and then re-visit this one. Yours, Steffen

…rtsInclusionList" or "FilterSmartsExclusionList" parameters.
…erSmartsInclusionList" or "FilterSmartsExclusionList" parameters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants