Document 231320

Java Solutions for Cheminformatics
Power Search: how to tune search
for efficiency and performance
June 2006 UGM
Tune for efficiency
Get those results what you want:
How to formulate the Query?
Which search type?
Which search parameters?
Using non-structural conditions, like predictions,
etc?
Non-chemical data?
Tune for performance
Getting results faster
Why are fingerprints important?
What is in a chemical hashed fingerprint?
What are the fingerprint parameters and how to set
them?
Why use additional structural keys?
What are good structural keys?
Atom-by-atom searching tips and tricks
Searching out of a database
How does it work?
Can I have all matches?
How can I tweak it? (Are there any options?)
Efficiency: How to formulate the Query?
- Query atom and bond types
- Query properties –
new keyboard shortcuts in Marvin: .<prop>
- Explicit Hydrogens
- R-groups; R-logic
- SMARTS atoms and bonds
References:
JChem Query Guide http://www.chemaxon.com/jchem/doc/user/Query.html
Structural Searching using ChemAxon Tools http://www.chemaxon.com/conf/Structural_Search.ppt
Which search type?
– Substructure
– Superstructure
– Exact fragment: query must match a full fragment
(same effect as all atoms received s* query property)
– Exact: same size = no other fragments
– Perfect: for duplicate checking
Examples:
http://www.chemaxon.com/jchem/doc/user/QueryMatchExamples.html
Which search type?
Main difference: Substructure, Superstructure:
Query:
Retrieved:
Substructure
Superstructure
Which search type?
Main difference: Exact fragment, Exact:
Query:
Retrieved:
Exact
fragment
Exact
Which search type?
Main difference: Exact, Perfect:
Query:
Retrieved:
Exact
Perfect
Which search parameters?
Selection of parameters
- Double bond stereo matching: All/Marked/None
- Exact…
Isotope, Charge, Radical, Stereo, QueryAtom, Bond
- Ignore…
Isotope*, Charge*, Radical*, Valence*, Stereo
- Vague bond options*:
ambiguous aromaticity rings, treat bonds “or aromatic”,
ignore all bond types
- Tautomer search*
* New: from JChem 3.2; released in alpha
Vague bond options explained
- Vague bond options:
- Level 1: handling of 5-membered ambiguous aromatic
rings – will be default as result of UGM discussions.
- Level 2: treat all query ring bonds “or aromatic”,
- Level 3: treat all query bonds “or aromatic”,
- Level 4: ignore all bond types
Vague bond options explained
Vague bond level 1: 5-membered ambiguous aromatic rings
“Visually expected” hits: (all obtained at vague bond level 1 - default from
JChem 3.2)
Vague bond options explained
Some “visually expected” hits are not found without vague bond
option:
Vague bond options explained
Vague bond level 2: all query ring bonds: “or aromatic”
“Visually expected” hits:
Some “visually expected” hits are not found without vague bond option:
Vague bond options explained
Vague bond level 3: all query bonds: “or aromatic”
Vague bond level 3:
Without vague bond option:
Vague bond options explained
Vague bond level 4: ignore all bond types
Vague bond level 4:
Without vague bond option:
Efficiency: Using conditions
Chemical conditions of molecules?
Chemical Terms filter
Example:
(mass() <= 500) &&
(logP() <= 5) &&
(donorCount() <= 5) &&
(acceptorCount() <= 10);
Chemical Terms Language Reference:
http://www.chemaxon.com/jchem/doc/user/EvaluatorLanguage.html
Chemical Terms built-in functions:
http://www.chemaxon.com/jchem/doc/user/EvaluatorTables.html
Chemical Terms example
Search context functions to access search objects:
Name
Description
mol(), target() target (DB) molecule – default molecule object
query()
search query
hit(), h()
returned hit array (indexing 0-based!)
Examples:
1. ringcount(target()) == ringcount(query()) + 2
(target molecule has 2 extra rings than query – works with all
queries)
2. apka(hit(7)) > 10
Query:
(The matching Nitrogen has
acidic pka more than 10)
Efficiency: Using conditions
Non-chemical data?
Filterquery parameters
-> See JChem Base workshop
by Szilárd Doránt
Cartridge: filterquery +
join operations
-> See Cartridge workshop
by Péter Kovács
Performance: Why are fingerprints
important?
Two-stage method of searching:
Structure Cache
JChem Table
Fast and effective
• Molecules,
• Fingerprints
Fingerprint screening
~ 0.1s for 3M molecules
Large proportion of non-hits filtered
Slower, but exact method
Result
candidates
Atom-by-atom
search
Results
Query
What is in a chemical hashed fingerprint?
Fingerprints are fixed-length bit strings :
1011100000001010000000000000011100111000111000000001110000001000
How are the bits set?
1. All of the following patterns are enumerated:
-
Linear paths in the molecule up to a limit (fingerprint
parameter)
Rings up to a size (14)
2. Several hash-codes are created for each patterns,
and those bits are set to 1.
Reference: http://www.chemaxon.com/jchem/doc/user/fingerprint.html
What is in a chemical hashed fingerprint?
Example:
Patterns in the molecule (Note – all substructures!):
Hashing function
uses atom type and bond type info
1011100000001010000000000000011100111000111000000001110000001000
Bit clash allowed
How does fingerprint screening work?
Superstructure contains all patterns of
substructure.
Therefore, superstructure fingerprint must
contain substructure’s fingerprint.
Example:
Query fingerprint:
1010100000
1010100000
OK
Target fingerprint:
1011101000
1010001000
Missing
pattern!
Cannot be
substructure!
Fingerprint darkness
Darkness: % of 1-s in fingerprint:
“Black” fingerprint:
11111111111111111111111111111111
“White fingerprint:
00000000000000000000000000000000
Darker fingerprint increases the probability of
bit clash
Bit clash = more patterns per bit = less effective screening
What are the fingerprint parameters and how
to set them?
- Fingerprint length
- Maximum pattern length
- Number of bits per pattern
1011001100110011
(may decrease full clash probability)
Fingerprint optimization goals (for a table):
- Maximize fingerprint information AND
- Minimize bit clashes == not too dark fingerprints
Optimal fingerprint darkness for a table:
- Average is ~40%
- Maximum is less than 80%
What are the fingerprint parameters and how
to set them?
Effect of fingerprint parameters change:
Fingerprint parameter
Fingerprint length
Max. pattern length
# of bits per pattern
Optimization target:
Goal change
Parameter
Information content Darkness
change
↑
-
↓
↓
-
↑
↑
↑
↑
↓
↓
↓
↑
↑
↑
↓
↓
↓
high
40%
Which queries have poor fingerprint?
Queries with not definite atom/bond types:
- Query atoms: A, Q, L, !L
- Query bonds: any, single/aromatic, single/double
- H atoms are excluded from fingerprints
But “fingerprint mining” is done for SMART atoms:
- [CX4,CH3,CD2] == [C;X4,H3,D2] ― Carbon is recognized.
- [$(CCC(=O)N)] ― Inner substructure of recursive SMARTS is
used. (Non-negated, non-or.)
How to define additional keys?
Additional structural keys: You may define your own
substructure patterns (queries) to improve screening:
1011100000001010000000000000011100111000111000000001110000101000
Chemical hashed fingerprint
Structural keys
(1 bit each pattern)
Advantages achieved:
– No bit clash – reserved bit for pattern
– If query == pattern: instant result
– Improved screening (speed), if patterns introduce
more info
How to define additional keys?
Which are good structural keys?
Substructure criteria must hold in all search settings:
For all M1 ⊂ M2 and key matches M1 ⇒ key matches M2
Safe query features in structural keys:
–
–
–
–
–
Atom, bond types
Query atoms not matching Hydrogens: A, Q, L
a, A, u, R query properties
Recursive SMARTS
SMARTS atoms not containing negations and non-allowed
query features
How to define additional keys?
Examples for good functional keys:
Wrong functional keys:
- Wrong when ignore isotopes option is on
- Wrong when ignore charge option is on
- Wrong (Q:
T:
)
Searching out of a database
How does atom-by-atom searching work? (chemaxon.sss.MolSearch)
Ullmann algorithm:
1.
2.
3.
4.
A boolean matrix stores query atom - target atom possible matchings
Initial matching: local properties
Refining (rarefying) based on neighbours
Backtrack to find matchings – one matching per row
Example:
Query:
Target:
Initial matrix:
Refined:
Target atoms
[ 0
1
2
Query atoms
0
[ .
.
.
1
[ T
T
T
Target atoms
[ 0
1
Query atoms
0
[ .
.
1
[ .
T
First match:
3
4
5
]
.
T
T
.
T
.
]
]
Target atoms
[ 0
1
2
Query atoms
0
[ .
.
.
1
[ .
T
.
3
4
5
]
.
.
T
.
.
.
]
]
2
3
4
5
]
.
T
.
.
T
.
T
.
]
]
Reference: J. R. Ullmann: An algorithm for subgraph isomorphism. J. ACM, 1(23):31-42, 1976.
Is it exhaustive?
Backtrack may stop at the first hit:
isMatching()
Or iterate through all matchings:
findFirst(), findNext(), findAll()
Order sensitive option for overlapping hits:
Query:
Hits:
...
How can I tweak it?
Search parameters:
All search types and parameters mentioned earlier!
If you know part of the hit:
addMatch(querAtom, targetAtom)
Example:
Query
Only produces 1 hit instead of 3!
Target
Future Developments
• Markush structure handling in JChem Base / Cartridge / Marvin
• Further flexibility options: more sophisticated charge matching
• Reaction center search
• Mixture/Component searching
• Coordinate (3D) search
• Further query atom types (Metal, AnyH, etc.)
Summary
• Sophisticated search parameters
• For speed, fingerprints are essential and are
tunable
• Atom-by-atom searching from API (outside the
database) allows flexible matching too
Further reading
Structural Search
http://www.chemaxon.com/Structural_Search.ppt
JChem Base
http://www.chemaxon.com/JChem_Base.ppt
JChem Cartridge
http://www.chemaxon.com/JChem_Cartridge.ppt
Standardizer
http://www.chemaxon.com/Standardizer.ppt
JChem Query Guide
JChem online demo
http://www.chemaxon.com/jchem/doc/user/Query.html
http://www.jchem.com/examples/jsp1_x/index.jsp
Chemical Terms Language
http://www.chemaxon.com/jchem/doc/user/EvaluatorLanguage.html
Chemical Terms Functions
http://www.chemaxon.com/jchem/doc/user/EvaluatorTables.html
Find out more
• Product descriptions & links
www.chemaxon.com/products.html
• Forum
www.chemaxon.com/forum
• Presentations and posters
www.chemaxon.com/conf
• Download
http://www.chemaxon.com/jchem/licensefrset.html