Molecular Similarity





Chemical Structure Similarity





Chemical Graph Matching Using Spinifex





What is a MOS?






A Simple Example






An Example with Unconnected Fragments






Graph Similarity Getting More Complicated





How Spinifex Works


  1. Read input SDFs using Python OELib/OEChem - delete hydrogens, determine aromaticity, atom types, bond types, ring properties

  2. Compute Smallest Set of Smallest Rings (SSSR) using Figueras approach

  3. Identify redundant edge matchings for rings

  4. Identify redundant edge matchings for hydrocarbon chains

  5. Identify triangle-trinode inequalities

  6. Create edge graphs for input molecules

  7. Build correspondence graph, handling 5 to 7

  8. Find maximum clique in correspondence graph

  9. Compute similarity coefficient or create SDF solutions from input molecules

  10. Clustering

  11. Visualise results




Redundant Edges





Redundant Edges (cont.)





How to Build an Edge Graph


from Raymond et. al, JCICS, 2002



How to Build an Edge Graph (cont.)


from C.K. Chen, University of Hawaii




Triangle-Trinode Inequalities


Both cyclopropyl and isobutyl have identical edge graphs but we don't want them to match




Edge and Vertex Weights





Clique Detection Algorithms





Similarity Metric







from Raymond et. al, JCICS, 2002

    ( NumVert(MOS) + NumEdge(MOS) )**2
    _________________________________________________

    (NumVert(G)+NumEdge(G))*(NumVert(H)+NumEdge(H))





Similarity Metric with Penalty Factor


A penalty factor can be used to reduce the similarity metric if the arrangement of the three largest fragments in the MOS are not the same for a pair of molecules:
     Type1:
         A
         / \
         B---C

     Type 2:

        A-B-C

     Type 3:

        B-A-C

     Type 4:

        A-C-B



Also account for two molecules that share a large common fragment:

        AAAAAAA-B  considered similar to  AAAAAAA-CCCCC



It is possible to regenerate similarity metrics using alternative values for the penalty factors without re-computing chemical similarities.


Spinifex Command Line Options


spinifex.py - solve maximum common substructure problems

SYNOPSIS:

    spinifex.py [OPTIONS] sdfile1 [sdfile2]

    Default mode compares first mol in sdfile1 with first mol in sdfile2
     and writes sdf of the MOS from atoms and coords from sdfile1

OPTIONS:

    -a          write both sets of atom numbers of the MOS
    -b          write both sets of bond numbers of the MOS
    -c          run all vs all comparisons.  See Note 1
    -d          disable atom typing by sybyl types and H counting.  See Note 2
    -e          write both sets of atom numbers of the largest fragment
    -f 'int'    use graph with reduced features.  See Note 3
    -g          graphically display highlighted MOS for both structures
    -h          write this message
    -H          write this message and additional Notes
    -m          compare first mol in sdfile1 with all mols in sdfile2
    -n          write sdf of largest fragment from the maximum clique
    -o          write sdf of the MOS from atoms and coords from sdfile2
    -q          overlay molecules by RMS and create a transformed sdfile2
    -q -g       overlay molecules by RMS and graphically display results
    -r 'int'    reset timeout. See Note 4
    -s 'int'    write similarity metric. See Note 5
    -S 'a,b,c'  penalty factors for similarity metric 2. See Note 5
    -t          write timings (use only with -m and -n options)
    -u          show timings and any similarity metric penalty for each comparison
    -v          verbose output to stderr




Parallel Spinifex





Clustering





Clustering (cont.)




A hierarchical tree can be cut into clusters at a number of similarity levels and individual clusters can be viewed using Spinifex utilities and Rasmol





Clustering Data







Clustering Results







Clustering Results (cont.)