Java Solutions for Cheminformatics Power Search: how to tune search for efficiency and performance June 2006 UGM Tune for efficiency Get those results what you want: How to formulate the Query? Which search type? Which search parameters? Using non-structural conditions, like predictions, etc? Non-chemical data? Tune for performance Getting results faster Why are fingerprints important? What is in a chemical hashed fingerprint? What are the fingerprint parameters and how to set them? Why use additional structural keys? What are good structural keys? Atom-by-atom searching tips and tricks Searching out of a database How does it work? Can I have all matches? How can I tweak it? (Are there any options?) Efficiency: How to formulate the Query? - Query atom and bond types - Query properties – new keyboard shortcuts in Marvin: .<prop> - Explicit Hydrogens - R-groups; R-logic - SMARTS atoms and bonds References: JChem Query Guide Structural Searching using ChemAxon Tools Which search type? – Substructure – Superstructure – Exact fragment: query must match a full fragment (same effect as all atoms received s* query property) – Exact: same size = no other fragments – Perfect: for duplicate checking Examples: Which search type? Main difference: Substructure, Superstructure: Query: Retrieved: Substructure Superstructure Which search type? Main difference: Exact fragment, Exact: Query: Retrieved: Exact fragment Exact Which search type? Main difference: Exact, Perfect: Query: Retrieved: Exact Perfect Which search parameters? Selection of parameters - Double bond stereo matching: All/Marked/None - Exact… Isotope, Charge, Radical, Stereo, QueryAtom, Bond - Ignore… Isotope*, Charge*, Radical*, Valence*, Stereo - Vague bond options*: ambiguous aromaticity rings, treat bonds “or aromatic”, ignore all bond types - Tautomer search* * New: from JChem 3.2; released in alpha Vague bond options explained - Vague bond options: - Level 1: handling of 5-membered ambiguous aromatic rings – will be default as result of UGM discussions. - Level 2: treat all query ring bonds “or aromatic”, - Level 3: treat all query bonds “or aromatic”, - Level 4: ignore all bond types Vague bond options explained Vague bond level 1: 5-membered ambiguous aromatic rings “Visually expected” hits: (all obtained at vague bond level 1 - default from JChem 3.2) Vague bond options explained Some “visually expected” hits are not found without vague bond option: Vague bond options explained Vague bond level 2: all query ring bonds: “or aromatic” “Visually expected” hits: Some “visually expected” hits are not found without vague bond option: Vague bond options explained Vague bond level 3: all query bonds: “or aromatic” Vague bond level 3: Without vague bond option: Vague bond options explained Vague bond level 4: ignore all bond types Vague bond level 4: Without vague bond option: Efficiency: Using conditions Chemical conditions of molecules? Chemical Terms filter Example: (mass() <= 500) && (logP() <= 5) && (donorCount() <= 5) && (acceptorCount() <= 10); Chemical Terms Language Reference: Chemical Terms built-in functions: Chemical Terms example Search context functions to access search objects: Name Description mol(), target() target (DB) molecule – default molecule object query() search query hit(), h() returned hit array (indexing 0-based!) Examples: 1. ringcount(target()) == ringcount(query()) + 2 (target molecule has 2 extra rings than query – works with all queries) 2. apka(hit(7)) > 10 Query: (The matching Nitrogen has acidic pka more than 10) Efficiency: Using conditions Non-chemical data? Filterquery parameters -> See JChem Base workshop by Szilárd Doránt Cartridge: filterquery + join operations -> See Cartridge workshop by Péter Kovács Performance: Why are fingerprints important? Two-stage method of searching: Structure Cache JChem Table Fast and effective • Molecules, • Fingerprints Fingerprint screening ~ 0.1s for 3M molecules Large proportion of non-hits filtered Slower, but exact method Result candidates Atom-by-atom search Results Query What is in a chemical hashed fingerprint? Fingerprints are fixed-length bit strings : 1011100000001010000000000000011100111000111000000001110000001000 How are the bits set? 1. All of the following patterns are enumerated: - Linear paths in the molecule up to a limit (fingerprint parameter) Rings up to a size (14) 2. Several hash-codes are created for each patterns, and those bits are set to 1. Reference: What is in a chemical hashed fingerprint? Example: Patterns in the molecule (Note – all substructures!): Hashing function uses atom type and bond type info 1011100000001010000000000000011100111000111000000001110000001000 Bit clash allowed How does fingerprint screening work? Superstructure contains all patterns of substructure. Therefore, superstructure fingerprint must contain substructure’s fingerprint. Example: Query fingerprint: 1010100000 1010100000 OK Target fingerprint: 1011101000 1010001000 Missing pattern! Cannot be substructure! Fingerprint darkness Darkness: % of 1-s in fingerprint: “Black” fingerprint: 11111111111111111111111111111111 “White fingerprint: 00000000000000000000000000000000 Darker fingerprint increases the probability of bit clash Bit clash = more patterns per bit = less effective screening What are the fingerprint parameters and how to set them? - Fingerprint length - Maximum pattern length - Number of bits per pattern 1011001100110011 (may decrease full clash probability) Fingerprint optimization goals (for a table): - Maximize fingerprint information AND - Minimize bit clashes == not too dark fingerprints Optimal fingerprint darkness for a table: - Average is ~40% - Maximum is less than 80% What are the fingerprint parameters and how to set them? Effect of fingerprint parameters change: Fingerprint parameter Fingerprint length Max. pattern length # of bits per pattern Optimization target: Goal change Parameter Information content Darkness change ↑ - ↓ ↓ - ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↑ ↑ ↑ ↓ ↓ ↓ high 40% Which queries have poor fingerprint? Queries with not definite atom/bond types: - Query atoms: A, Q, L, !L - Query bonds: any, single/aromatic, single/double - H atoms are excluded from fingerprints But “fingerprint mining” is done for SMART atoms: - [CX4,CH3,CD2] == [C;X4,H3,D2] ― Carbon is recognized. - [$(CCC(=O)N)] ― Inner substructure of recursive SMARTS is used. (Non-negated, non-or.) How to define additional keys? Additional structural keys: You may define your own substructure patterns (queries) to improve screening: 1011100000001010000000000000011100111000111000000001110000101000 Chemical hashed fingerprint Structural keys (1 bit each pattern) Advantages achieved: – No bit clash – reserved bit for pattern – If query == pattern: instant result – Improved screening (speed), if patterns introduce more info How to define additional keys? Which are good structural keys? Substructure criteria must hold in all search settings: For all M1 ⊂ M2 and key matches M1 ⇒ key matches M2 Safe query features in structural keys: – – – – – Atom, bond types Query atoms not matching Hydrogens: A, Q, L a, A, u, R query properties Recursive SMARTS SMARTS atoms not containing negations and non-allowed query features How to define additional keys? Examples for good functional keys: Wrong functional keys: - Wrong when ignore isotopes option is on - Wrong when ignore charge option is on - Wrong (Q: T: ) Searching out of a database How does atom-by-atom searching work? (chemaxon.sss.MolSearch) Ullmann algorithm: 1. 2. 3. 4. A boolean matrix stores query atom - target atom possible matchings Initial matching: local properties Refining (rarefying) based on neighbours Backtrack to find matchings – one matching per row Example: Query: Target: Initial matrix: Refined: Target atoms [ 0 1 2 Query atoms 0 [ . . . 1 [ T T T Target atoms [ 0 1 Query atoms 0 [ . . 1 [ . T First match: 3 4 5 ] . T T . T . ] ] Target atoms [ 0 1 2 Query atoms 0 [ . . . 1 [ . T . 3 4 5 ] . . T . . . ] ] 2 3 4 5 ] . T . . T . T . ] ] Reference: J. R. Ullmann: An algorithm for subgraph isomorphism. J. ACM, 1(23):31-42, 1976. Is it exhaustive? Backtrack may stop at the first hit: isMatching() Or iterate through all matchings: findFirst(), findNext(), findAll() Order sensitive option for overlapping hits: Query: Hits: ... How can I tweak it? Search parameters: All search types and parameters mentioned earlier! If you know part of the hit: addMatch(querAtom, targetAtom) Example: Query Only produces 1 hit instead of 3! Target Future Developments • Markush structure handling in JChem Base / Cartridge / Marvin • Further flexibility options: more sophisticated charge matching • Reaction center search • Mixture/Component searching • Coordinate (3D) search • Further query atom types (Metal, AnyH, etc.) Summary • Sophisticated search parameters • For speed, fingerprints are essential and are tunable • Atom-by-atom searching from API (outside the database) allows flexible matching too Further reading Structural Search JChem Base JChem Cartridge Standardizer JChem Query Guide JChem online demo Chemical Terms Language Chemical Terms Functions Find out more • Product descriptions & links • Forum • Presentations and posters • Download
© Copyright 2025