Java Solutions for Cheminformatics Power Search: how to tune search for efficiency and performance June 2006 UGM Tune for efficiency Get those results what you want: How to formulate the Query? Which search type? Which search parameters? Using non-structural conditions, like predictions, etc? Non-chemical data? Tune for performance Getting results faster Why are fingerprints important? What is in a chemical hashed fingerprint? What are the fingerprint parameters and how to set them? Why use additional structural keys? What are good structural keys? Atom-by-atom searching tips and tricks Searching out of a database How does it work? Can I have all matches? How can I tweak it? (Are there any options?) Efficiency: How to formulate the Query? - Query atom and bond types - Query properties – new keyboard shortcuts in Marvin: .<prop> - Explicit Hydrogens - R-groups; R-logic - SMARTS atoms and bonds References: JChem Query Guide http://www.chemaxon.com/jchem/doc/user/Query.html Structural Searching using ChemAxon Tools http://www.chemaxon.com/conf/Structural_Search.ppt Which search type? – Substructure – Superstructure – Exact fragment: query must match a full fragment (same effect as all atoms received s* query property) – Exact: same size = no other fragments – Perfect: for duplicate checking Examples: http://www.chemaxon.com/jchem/doc/user/QueryMatchExamples.html Which search type? Main difference: Substructure, Superstructure: Query: Retrieved: Substructure Superstructure Which search type? Main difference: Exact fragment, Exact: Query: Retrieved: Exact fragment Exact Which search type? Main difference: Exact, Perfect: Query: Retrieved: Exact Perfect Which search parameters? Selection of parameters - Double bond stereo matching: All/Marked/None - Exact… Isotope, Charge, Radical, Stereo, QueryAtom, Bond - Ignore… Isotope*, Charge*, Radical*, Valence*, Stereo - Vague bond options*: ambiguous aromaticity rings, treat bonds “or aromatic”, ignore all bond types - Tautomer search* * New: from JChem 3.2; released in alpha Vague bond options explained - Vague bond options: - Level 1: handling of 5-membered ambiguous aromatic rings – will be default as result of UGM discussions. - Level 2: treat all query ring bonds “or aromatic”, - Level 3: treat all query bonds “or aromatic”, - Level 4: ignore all bond types Vague bond options explained Vague bond level 1: 5-membered ambiguous aromatic rings “Visually expected” hits: (all obtained at vague bond level 1 - default from JChem 3.2) Vague bond options explained Some “visually expected” hits are not found without vague bond option: Vague bond options explained Vague bond level 2: all query ring bonds: “or aromatic” “Visually expected” hits: Some “visually expected” hits are not found without vague bond option: Vague bond options explained Vague bond level 3: all query bonds: “or aromatic” Vague bond level 3: Without vague bond option: Vague bond options explained Vague bond level 4: ignore all bond types Vague bond level 4: Without vague bond option: Efficiency: Using conditions Chemical conditions of molecules? Chemical Terms filter Example: (mass() <= 500) && (logP() <= 5) && (donorCount() <= 5) && (acceptorCount() <= 10); Chemical Terms Language Reference: http://www.chemaxon.com/jchem/doc/user/EvaluatorLanguage.html Chemical Terms built-in functions: http://www.chemaxon.com/jchem/doc/user/EvaluatorTables.html Chemical Terms example Search context functions to access search objects: Name Description mol(), target() target (DB) molecule – default molecule object query() search query hit(), h() returned hit array (indexing 0-based!) Examples: 1. ringcount(target()) == ringcount(query()) + 2 (target molecule has 2 extra rings than query – works with all queries) 2. apka(hit(7)) > 10 Query: (The matching Nitrogen has acidic pka more than 10) Efficiency: Using conditions Non-chemical data? Filterquery parameters -> See JChem Base workshop by Szilárd Doránt Cartridge: filterquery + join operations -> See Cartridge workshop by Péter Kovács Performance: Why are fingerprints important? Two-stage method of searching: Structure Cache JChem Table Fast and effective • Molecules, • Fingerprints Fingerprint screening ~ 0.1s for 3M molecules Large proportion of non-hits filtered Slower, but exact method Result candidates Atom-by-atom search Results Query What is in a chemical hashed fingerprint? Fingerprints are fixed-length bit strings : 1011100000001010000000000000011100111000111000000001110000001000 How are the bits set? 1. All of the following patterns are enumerated: - Linear paths in the molecule up to a limit (fingerprint parameter) Rings up to a size (14) 2. Several hash-codes are created for each patterns, and those bits are set to 1. Reference: http://www.chemaxon.com/jchem/doc/user/fingerprint.html What is in a chemical hashed fingerprint? Example: Patterns in the molecule (Note – all substructures!): Hashing function uses atom type and bond type info 1011100000001010000000000000011100111000111000000001110000001000 Bit clash allowed How does fingerprint screening work? Superstructure contains all patterns of substructure. Therefore, superstructure fingerprint must contain substructure’s fingerprint. Example: Query fingerprint: 1010100000 1010100000 OK Target fingerprint: 1011101000 1010001000 Missing pattern! Cannot be substructure! Fingerprint darkness Darkness: % of 1-s in fingerprint: “Black” fingerprint: 11111111111111111111111111111111 “White fingerprint: 00000000000000000000000000000000 Darker fingerprint increases the probability of bit clash Bit clash = more patterns per bit = less effective screening What are the fingerprint parameters and how to set them? - Fingerprint length - Maximum pattern length - Number of bits per pattern 1011001100110011 (may decrease full clash probability) Fingerprint optimization goals (for a table): - Maximize fingerprint information AND - Minimize bit clashes == not too dark fingerprints Optimal fingerprint darkness for a table: - Average is ~40% - Maximum is less than 80% What are the fingerprint parameters and how to set them? Effect of fingerprint parameters change: Fingerprint parameter Fingerprint length Max. pattern length # of bits per pattern Optimization target: Goal change Parameter Information content Darkness change ↑ - ↓ ↓ - ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↑ ↑ ↑ ↓ ↓ ↓ high 40% Which queries have poor fingerprint? Queries with not definite atom/bond types: - Query atoms: A, Q, L, !L - Query bonds: any, single/aromatic, single/double - H atoms are excluded from fingerprints But “fingerprint mining” is done for SMART atoms: - [CX4,CH3,CD2] == [C;X4,H3,D2] ― Carbon is recognized. - [$(CCC(=O)N)] ― Inner substructure of recursive SMARTS is used. (Non-negated, non-or.) How to define additional keys? Additional structural keys: You may define your own substructure patterns (queries) to improve screening: 1011100000001010000000000000011100111000111000000001110000101000 Chemical hashed fingerprint Structural keys (1 bit each pattern) Advantages achieved: – No bit clash – reserved bit for pattern – If query == pattern: instant result – Improved screening (speed), if patterns introduce more info How to define additional keys? Which are good structural keys? Substructure criteria must hold in all search settings: For all M1 ⊂ M2 and key matches M1 ⇒ key matches M2 Safe query features in structural keys: – – – – – Atom, bond types Query atoms not matching Hydrogens: A, Q, L a, A, u, R query properties Recursive SMARTS SMARTS atoms not containing negations and non-allowed query features How to define additional keys? Examples for good functional keys: Wrong functional keys: - Wrong when ignore isotopes option is on - Wrong when ignore charge option is on - Wrong (Q: T: ) Searching out of a database How does atom-by-atom searching work? (chemaxon.sss.MolSearch) Ullmann algorithm: 1. 2. 3. 4. A boolean matrix stores query atom - target atom possible matchings Initial matching: local properties Refining (rarefying) based on neighbours Backtrack to find matchings – one matching per row Example: Query: Target: Initial matrix: Refined: Target atoms [ 0 1 2 Query atoms 0 [ . . . 1 [ T T T Target atoms [ 0 1 Query atoms 0 [ . . 1 [ . T First match: 3 4 5 ] . T T . T . ] ] Target atoms [ 0 1 2 Query atoms 0 [ . . . 1 [ . T . 3 4 5 ] . . T . . . ] ] 2 3 4 5 ] . T . . T . T . ] ] Reference: J. R. Ullmann: An algorithm for subgraph isomorphism. J. ACM, 1(23):31-42, 1976. Is it exhaustive? Backtrack may stop at the first hit: isMatching() Or iterate through all matchings: findFirst(), findNext(), findAll() Order sensitive option for overlapping hits: Query: Hits: ... How can I tweak it? Search parameters: All search types and parameters mentioned earlier! If you know part of the hit: addMatch(querAtom, targetAtom) Example: Query Only produces 1 hit instead of 3! Target Future Developments • Markush structure handling in JChem Base / Cartridge / Marvin • Further flexibility options: more sophisticated charge matching • Reaction center search • Mixture/Component searching • Coordinate (3D) search • Further query atom types (Metal, AnyH, etc.) Summary • Sophisticated search parameters • For speed, fingerprints are essential and are tunable • Atom-by-atom searching from API (outside the database) allows flexible matching too Further reading Structural Search http://www.chemaxon.com/Structural_Search.ppt JChem Base http://www.chemaxon.com/JChem_Base.ppt JChem Cartridge http://www.chemaxon.com/JChem_Cartridge.ppt Standardizer http://www.chemaxon.com/Standardizer.ppt JChem Query Guide JChem online demo http://www.chemaxon.com/jchem/doc/user/Query.html http://www.jchem.com/examples/jsp1_x/index.jsp Chemical Terms Language http://www.chemaxon.com/jchem/doc/user/EvaluatorLanguage.html Chemical Terms Functions http://www.chemaxon.com/jchem/doc/user/EvaluatorTables.html Find out more • Product descriptions & links www.chemaxon.com/products.html • Forum www.chemaxon.com/forum • Presentations and posters www.chemaxon.com/conf • Download http://www.chemaxon.com/jchem/licensefrset.html
© Copyright 2024