# Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2

- Hans-Christian Ehrlich
^{1}and - Matthias Rarey
^{1}Email author

**4**:13

**DOI: **10.1186/1758-2946-4-13

© Ehrlich and Rarey; licensee Chemistry Central Ltd. 2012

**Received: **17 February 2012

**Accepted: **27 April 2012

**Published: **31 July 2012

## Abstract

### Background

Searching for substructures in molecules belongs to the most elementary tasks in cheminformatics and is nowadays part of virtually every cheminformatics software. The underlying algorithms, used over several decades, are designed for the application to general graphs. Applied on molecular graphs, little effort has been spend on characterizing their performance. Therefore, it is not clear how current substructure search algorithms behave on such special graphs. One of the main reasons why such an evaluation was not performed in the past was the absence of appropriate data sets.

### Results

In this paper, we present a systematic evaluation of Ullmann’s and the VF2 subgraph isomorphism algorithms on molecular data. The benchmark set consists of a collection of 1235 SMARTS substructure expressions and selected molecules from the ZINC database. The benchmark evaluates substructures search times for complete database scans as well as individual substructure-molecule pairs. In detail, we focus on the influence of substructure formulation and size, the impact of molecule size, and the ability of both algorithms to be used on multiple cores.

### Conclusions

The results show a clear superiority of the VF2 algorithm in all test scenarios. In general, both algorithms solve most instances in less than one millisecond, which we consider to be acceptable. Still, in direct comparison, the VF2 is most often several folds faster than Ullmann’s algorithm. Additionally, Ullmann’s algorithm shows a surprising number of run time outliers.

### Keywords

Substructure search Subgraph isomorphism Algorithm Benchmark SMARTS Chemical pattern search## Background

Today’s drug discovery faces a constantly growing number of commercially available or synthetically accessible compounds maintained in large databases [1, 2]. In order to efficiently search such databases, computational search strategies comprising various search criteria have been developed over more than four decades [3–14]. Search criteria range from retrieving the one exact compound over selecting compounds via substructure features to the application of various similarity measures. In the following, we focus on methods that test compounds for the presence of certain functional groups or substructures.

Modeling molecular structures as labeled graphs has a long tradition and gives the basis for modern cheminformatics methods. A graph-based representation is chemically intuitive and forms a solid theoretical foundation for computer-aided processing. Furthermore, graphs allow the substructure search problem to be solved by graph isomorphism techniques, i.e., searching molecules for substructures is equivalent to testing two labeled graphs for subgraph isomorphism. The subgraph isomorphism problem is well studied [15–17] and one of the oldest and most applied algorithms [18–22] was introduced by Ullmann in 1976 [7]. Over the years that followed, only a few subgraph isomorphism methods were introduced [11, 16, 23], the most recent being the VF2 algorithm [12].

Until now, each comparison of (sub-)graph isomorphism algorithms [16, 17] only employs synthetic graph data. The data is most often constructed to show the algorithms’ behavior on medium to large graphs. Therefore, it is unclear how these algorithms behave on rather small graphs like molecular data. To our knowledge, no subgraph isomorphism comparison directly addresses the problem of searching chemical substructures in molecules. One of the main reasons why such a benchmark was not performed in the past was the lack of suitable and publicly available benchmark data sets.

This article describes such various data sets and discusses the differences between the Ullmann and the VF2 subgraph isomorphism algorithm applied on substructures and molecules. In the following, we introduce the graph theoretical concepts, summarize the two algorithms of interest, introduce different benchmark data sets and compare the algorithms’ performance in various molecular modeling scenarios.

## Preliminaries

For almost 150 years, chemists have used chemical and structural formulas to represent molecules. A structural formula is closely related to the mathematical concepts of graphs which makes graph theory and algorithms directly applicable in cheminformatics.

### Graph theoretical background

A *graph* *G*=(*V*,*E*) is defined by a set of nodes *V* and a set of connecting edges *E*. The edges of an *undirected* graph have no fixed orientation and if labels are assigned to nodes or edges the graph is denoted as *labeled*. If a path from each node to every other nodes exists, the graph is called *connected*. In the following, all graphs are labeled, undirected and connected except when stated otherwise.

#### Subgraph isomorphism

Two graphs *G*_{1}=(*V*_{1},*E*_{1}) and *G*_{2}=(*V*_{2},*E*_{2}) are *isomorphic* if a bijective projection between nodes *V* _{1} and nodes *V* _{2} exists such that two nodes from *V* _{1} are connected by an edge from *E*_{1} if and only if their image nodes in *V* _{2} are connected by an edge from *E*_{2}. An *induced subgraph* of a graph *G*=(*V*,*E*) is defined as a graph *G*^{
′
}=(*V*^{
′
},*E*^{
′
}) whose nodes *V*^{
′
} are a subset of *V* and whose edges *E*^{
′
} are all possible edges from *E* that connect two nodes in *V*^{
′
}. An *induced subgraph isomorphism* between a query graph *G*_{1} and a target graph *G*_{2} exists if *G*_{1} is isomorphic to an induced subgraph of *G*_{2}, i.e., the query graph *G*_{1} is a subgraph of the target graph *G*_{2}.

The problem of finding an isomorphic induced subgraph is believed to be a problem for which no efficient solution exists, i.e., it belongs to the class of NP-complete problems [5, 24]. Therefore, every subgraph isomorphism algorithm will show exponential run times with respect to the input graph size.

#### Molecular graphs

A *molecular graph* is given by nodes and edges that represent atoms and bonds, respectively. Often nodes and edges are labeled with atom and bond properties. Obviously, molecular graphs are undirected. The number of edges connecting each node is limited by the number of covalent bonds an atom can form. Therefore, the number of edges in a molecular graph linearly depends on the number of nodes.

Molecules are equal or isomorphic if their molecular graphs are isomorphic and the labels of the atoms and bonds are equal to the labels of their mapped atoms and bonds respectively. When two molecules differ in size, one can be a substructure of the other, i.e., a subgraph isomorphism between the two molecules exists. The small number of atoms and the linear atom degree allow for a fast subgraph isomorphism test on molecules.

#### Substructure graphs

*substructure graph*can be a molecule fragment, e.g., a functional group, or a more generalized construct. For example, a single halogen node might represent a fluorine, chlorine, bromine or iodine atom. The same applies to edges, e.g., an edge is either a single or a double bond. In the following, we will use substructure graphs with such general labels. Figure 1 shows an example.

Substructure graphs are compared with molecules to detect subgraph isomorphisms. The goal is to determine the presence or location of a functional group or a specific molecular structure. Nodes and edges are mapped to atoms and bonds in accordance with their labels. Since edges are explicitly assigned to bonds, the detected isomorphic subgraph might not be induced, i.e., non-circular substructures can be mapped to circular molecule parts.

For a clear differentiation, we will use the terms atoms and bonds for molecular target graphs and nodes and edges for query substructure graphs.

### Algorithm 1

## Substructure pattern languages

A substructure graph can be formulated by using a substructure pattern language like SMILES Arbitrary Target Specification (SMARTS) [25], Sybyl Line Notation (SLN) [26] or Wiswesser Line Notation (WLN) [27]. All languages define a substructure graph in a textual line notation similar to a molecule’s chemical formula. They allow the definition of a substructure’s topology and node and bond properties, including logical alternatives. SMARTS even provides the opportunity to specify additional information like a chemical environment. In this study, all substructures are formulated as SMARTS expressions.

## Methods

The Ullmann and the VF2 algorithms are two algorithms that solve the subgraph isomorphism problem. Applied to substructure and molecular graphs, they can be used to detect substructures in molecules. Both algorithms calculate an exact solution, i.e., the exact substructure must be present, and their application is not restricted to a special class of graphs, i.e., is not limited to molecular graphs.

### Ullmann algorithm

*n*×

*m*matrix

*M*of boolean values, where

*n*is the number of substructure nodes and

*m*the number of molecule atoms. An entry at position (

*i*

*j*) marks the compatibility of labels for substructure node

*i*and molecule atom

*j*. Additionally, it uses a boolean vector

*f*of length

*m*marking mapped atoms. Algorithms Algorithm 1 and Algorithm 2 show Ullmann’s match and refinement procedure. Figure 2 illustrates one step of the algorithm.

### Algorithm 2

The refinement is the crucial step of the algorithm. It evaluates the surrounding of every possible node-atom mapping. For a valid mapping, every neighbor node must have a compatible atom as illustrated in Figure 3. Otherwise, the mapping is invalid which is marked by setting the corresponding matrix entry to zero. The evaluation takes place for every possible mapping downstream the current row and is repeated until all remaining mappings are valid.

Although the refinement procedure is the key for an efficient reduction of the search space it does not take full advantage of topological constraints. For example, in the case of a small substructure and a large molecule, it evaluates entries topologically too far away from already mapped node-atom pairs.

## VF2 Algorithm

*s*which is composed of a partial solution

*M(s)*and adjacency sets

*T*

_{1}(

*s*) and

*T*

_{2}(

*s*). A pair (

*n*

*m*)∈

*M*(

*s*) represents an atom-node mapping of the partial solution.

*M*

_{1}(

*s*) and

*M*

_{2}(

*s*) describe the atoms and nodes, respectively, that belong to the partial solution.

*T*

_{1}(

*s*) and

*T*

_{2}(

*s*) hold atoms and nodes adjacent to atoms in

*M*

_{1}(

*s*) and nodes in

*M*

_{2}(2), respectively. The algorithm modifies the state

*s*in two steps. From the sets

*T*

_{1}(

*s*) and

*T*

_{2}(

*s*), it creates a candidate set

*P(s)*of atom-node pairs with compatible labels. Then, it explores every candidate (

*n*

*m*)∈

*P*(

*s*) that fulfills the feasibility rules

*F*

_{ syn }or backtracks if

*P(s)*is empty. Figure 4 graphically depicts one step of the algorithm.

*F*

_{ syn }(

*s*,

*n*,

*m*) (Equation 1) describes the feasibility of candidates

*(n,m)*in state

*s*. It is composed out of two terms,

*R*

_{ adj }(Equation 2) and

*R*

_{ inout }(Equation 3). The first feasibility rule

*R*

_{ adj }guarantees that each atom

*n*

^{ ′ }and node

*m*

^{ ′ }adjacent (

*Adj*) to the atom

*n*and node

*m*of a candidate pair (

*n*,

*m*) are mapped to each other in the partial solution (

*n*

^{ ′ },

*m*

^{ ′ })∈

*M*(

*s*). The second rule

*R*

_{ inout }performs a 1-look-ahead in the search procedure based on the nodes’ cardinality (

*Card*) and allows an early pruning of the search tree. Figure 5 and Figure 6 give an illustration of the feasibility rules.

The problem of reaching the same state, i.e., the same partial solution *M(s)*, via different paths is handled by imposing an arbitrary total order ≺ onto the subgraph nodes and processing only smallest feasible candidates with regard to that order. Therefore, feasible candidates (*n*_{
i
},*m*_{
j
}) in *P(s)* are not processed if *m*_{
k
}≺*m*_{
j
}∈*P*(*s*).

The main difference between the two algorithms is the way they account for the topology of the substructure. The Ullmann algorithm processes a compatibility matrix top-down. In every step it fixes one node-atom mapping and checks all other possible assignments for validity. Therefore, it processes substructure nodes in an non-topological, arbitrary order. In contrast, the VF2 iteratively adds node-atom pairs to a current solution and therefore directly explores the substructure’s topology.

## Algorithm 3

## Substructure pattern formulation for efficient computation

The formulation of substructure patterns is a tedious task. Most pattern languages are difficult to read and even more difficult to write, especially when defining isomeric or tautomeric structures. As a result, substructure formulations are focused on a correct chemical representation of a pattern. That formulation might be suboptimal for computational processing. Therefore, we present simple guidelines to optimize patterns for the search in molecules.

For an optimal formulation, the substructure must be in an order that allows an early processing of unusual nodes and edges, rare fragments and functional groups. Obviously, certain elements are more common than others. The same applies for substructure nodes that define a high number of atom properties or are part of an aromatic system. Unusual edges define aromatic bonds or those with a high bond order. Therefore, we write optimized substructures such that nodes with the rarest element, highest property specification and aromaticity as well as high order or aromatic bond definitions occur first. Additionally, we place substructure parts that are rather common or difficult to process at the end of the formulation. Nodes that specify generic atoms, hydrogen atoms, carbon atoms, and ring atoms are common. Chemical environments are difficult to process for most search algorithms, since they enforce an additional search step.

In the following we perform every pattern reformulation by hand. Nevertheless, both algorithms are well suited for an automated optimization process. Ullmann’s algorithm processes substructure nodes according to their row numbers in the compatibility matrix. Since row numbers are assigned arbitrarily, they can resemble the order employed by applying the given optimization rules. The VF2 uses an arbitrary node relation to obtain a total order. Therefore, the optimized order can be directly used.

## Data sets

Both algorithms are tested in different application setups like complete database scans, substructure-based filter scenarios and individual substructure-molecule searches. The tests show the dependency of the algorithm run times on substructure formulation, substructure size and molecule size.

The data sets comprise 1336 SMARTS from the literature [28–37] and molecules out of ZINC lead-like and ZINC everything database [1]. All data sets are provided in Additional file 1.

### Substructure search set

Molecule size is a crucial factor with respect to the algorithmic search time. To explore the influence of molecule size, we select a subset from the initial 1336 SMARTS. All duplicate expressions, expressions with errors, extensions and those that define isotopes or are disconnected are removed. The resulting set comprises 1235 SMARTS whose property overview is given in the Additional file 2: Table S1. SMARTS allows the explicit formulation of hydrogen atoms and the definition of atom environments. When explicit hydrogen atoms are used a search procedure must evaluate all hydrogen atoms, which roughly doubles the number of atoms to be evaluated. Atom environments induce an additional search step during the actual search procedure. In order to circumvent misinterpretations of the results, we group the SMARTS patterns by the presence/absence of explicit hydrogens and recursive environments into individual sets. The Additional file 2: Table S2 – S19 give detailed statistics on SMARTS properties for every set.

**SMARTS, ZINC lead-like, ZINC everything test sets**

all SMARTS | ZINC lead-like set | ZINC everything set | ||||
---|---|---|---|---|---|---|

no H nodes | H nodes | no H nodes | H nodes | no H nodes | H nodes | |

no recursion | 504 | 432 | 347 | 56 | 400 | 43 |

recursion | 234 | 65 | 48 | 18 | 106 | 39 |

### Molecule search set

Substructure size is the second major factor regarding pattern matching time. A set to measure its impact is composed by randomly selecting molecules from ZINC lead-like containing all-in-all 80 different substructures of various size. The presence of so many substructures in a single molecule is rather rare but selecting molecules with less patterns gives poor results. A selection was only possible for the set of SMARTS having no explicit hydrogen nodes and no recursive environments. The other three sets contain patterns of much higher complexity which are rarely present in one molecule or patterns that are designed to be complementary to each other, e.g., PAINS.

### PAINS substructure set

For a detailed case study, we choose 16 PanAssayINterferenceStructures(PAINS) described by Baell et al. [38] as ‘filter family A’. The PAINS substructures should describe unspecific binders in protein-protein interaction assays. PAINS were originally given in SLN and converted to SMARTS by Rajarshi Guha using Cactvs [39]. The converted PAINS patterns include hydrogen atoms and recursive environments. The PAINS’s property distribution is shown in the Additional file 2: Table S20 and Additional file 3: Figure S2 – S5 depict each substructure.

### Worst-case test

Since highly symmetric molecules impose a challenge for substructure search algorithms, we test a phenylring query against a fulleren target as a worst-case search scenario.

### Database subset

The database subset comprises the first 100.000 molecules from ZINC lead-like as of February 12th, 2011 and is designed to resemble a complete database. Its property distribution is similar to that of the full ZINC lead-like database as shown in Additional file 2: Table S25.

## Results and discussion

Search speed is measured on a single Intel(R) Xeon(R) CPU E5630 2.53GHz core. Each matching is repeated 400 times and average values are recorded. Average matching times are raw matching times excluding File I/O, molecule initialization and post-processing of search results, i.e., matching time only.

We are aware of the fact that the evaluation is done with an example implementation of both algorithms that most likely has some room for optimization. Nevertheless, we believe that our results allow general conclusions regarding the algorithms’ behavior on molecular data.

### Overall search speed

### Explicit vs. implicit hydrogens

### Recursion vs. no recursion

### Molecule size

### Subgraph size

### Worse-case test

As a worse-case substructure search scenario, we test a phenyl-ring query against a C70 fullerene target. The Ullmann finds the first occurrence in 51.11 ms and all matches in 106.94 ms. The VF2 is about 130 times faster when it solves the problem for the first occurrence (0.39 ms) and about 5 times when searching for all matches (21.67ms). Clearly, the phenyl-fullerene example is not the worse-case when considering SMARTS substructures. Substructures with explicit hydrogen nodes or recursive atom environments yield much higher run times. Nevertheless, the phenyl-fullerene experiment gives good guidance on how the Ullmann and VF2 algorithms behave on highly symmetrical structures.

### Complete database search

Often substructure search algorithms are used in database search scenarios in which a database is scanned for all molecules that contain a given query structure. Even though most database search systems are able to eliminate a large number of molecules from the actual subgraph isomorphism search using screening techniques [10, 22, 41, 43–46], a remarkable number of molecules might remain. The following test simulates a sequential subgraph isomorphism test over a large set of molecules. We search all 1235 patterns from the Substructure Search Set against the Database Subset and measure the complete time to identify all molecules which contain such a substructure. Since the majority of the first 100.000 molecules of the ZINC lead-like database do not contain a given pattern, the search time is dominated by the algorithm’s ability to quickly identify the non-occurrence of a substructure in a molecule. A good screening method would of course enrich the molecules submitted to the isomorphism test with molecules containing the substructure of interests. Nevertheless, testing both algorithms for the ability of quickly detecting molecules without a given pattern will reveal further insights into the algorithmic behavior. This test is only performed once, as minor changes in run time do not affect the order of magnitude.

^{2}s, while the Ullmann completes 3.73% below 10s, 54.24% below 10

^{2}s, 92.36% below 10

^{3}s (16.6 min), 98.76% below 10

^{4}s (2.78 h) and 99.85% below 10

^{5}s (27.78 h). All in all, in rare instances a database search system that uses the Ullmann algorithm might need over a day to give results for a single query, even though, most of the molecules might be eliminated from the subgraph isomorphism test.

### Parallelization scaling

The subgraph isomorphism problem is nearly perfectly suited for parallel computing when matching one query structure against many target structures. One simple but effective solution is a parallelization by data separation of the target structures. An alternative is an algorithm level parallelization based on the algorithms’ recursion. Since most substructure searches are below 1ms and most molecules consist of less than 100 atoms, a parallelization of one substructure against one target search is most likely not as efficient as searching in parallel on the data level. The situation might change when searching large query substructures against large target structures, e.g., searching for motifs in proteins.

In order to evaluate the efficiency of data level parallelization, we test both algorithms with the same data separation strategy on the PAINS Substructure Set against the complete ZINC lead-like database on different numbers of CPU cores. The target structures are split into equal blocks such that each core gets the query structures and a the same number of molecules. The measurement on one core is performed in sequential and parallel mode so that the computational overhead for parallelization becomes directly present. Detailed tables on the matching times and scaling factors on different numbers of cores can be found in Additional file 2: Table S26 – S27.

Both algorithm show good scaling behavior on all instances. On 8 cores the search times are decreased by an average factor of 5.6 for the VF2, and 6.92 for Ullmann’s algorithm respectively. The overall slightly better scaling of the Ullmann algorithm can be explained by the longer matching times. Longer matching times reduce the parallelization overhead relative to the matching time.

### SMARTS pattern case studies

To explore the possibility of reducing search speed by rearranging the subgraph formulation we created three different formulations for each substructure of the PAINS Substructure Set. The *original* substructure formulation as created by Cactvs, an *optimized* version by applying the re-formulation guidelines described in the “Substructure Pattern Formulation” section, and an *anti-optimized* version by applying the rules in reverse. All three formulations are searched against the complete ZINC lead-like database.

**Optimization run time examples**

Ull. time [s] | Ull. speedup | VF2 time [s] | VF2 speedup | matches | |
---|---|---|---|---|---|

| |||||

original | 157139.71 | 1.00 | 170.42 | 1.00 | 11699 |

optimized | 157027.63 | 1.00 | 168.56 | 1.01 | 11699 |

anti-opt. | 154195.33 | 1.02 | 2664.49 | -15.64 | 11699 |

PAINS 12 original | 3119.04 | 1.00 | 1698.42 | 1.00 | 9056 |

optimized | 2142.41 | 1.46 | 142.28 | 11.94 | 9056 |

anti-opt. | 3077.34 | 1.01 | 1675.40 | 1.01 | 9056 |

### Ullman faster than VF2

**Ullmann faster than VF2 without optimization examples**

SMARTS | Ullmann time | VF2 time |
---|---|---|

[ms] | [ms] | |

[#6]C(=[O,SX2])[CX4]C(=[O,SX2])[#6] | 0.868 | 0.948 |

[O,SX2]=C([#6])[CX4]C(=[O,SX2])[#6] | 0.654 | 0.271 |

[#6]C(=[O,SX2])C(=[O,SX2])[#6] | 0.938 | 1.046 |

[O,SX2]=C([#6])C(=[O,SX2])[#6] | 0.668 | 0.203 |

[a]˜*˜*-[CH3] | 0.479 | 0.601 |

[CH3]-*˜*˜[a] | 0.209 | 0.074 |

[C](=O)([C,c,O,S])[C,c,O,S] | 0.400 | 0.558 |

O=[C]([C,c,O,S])[C,c,O,S] | 0.403 | 0.144 |

[CD3H0,R](=[SD1H0])([ND2H1,R])([ND2H1,R]) | 0.251 | 0.510 |

[SD1H0]=[CD3H0,R]([ND2H1,R])([ND2H1,R]) | 0.242 | 0.076 |

[nD3H0,R](˜[OD1H0])(a)a | 0.290 | 0.435 |

[OD1H0]˜[nD3H0,R](a)a | 0.290 | 0.091 |

[R](-*(-*))˜*˜*˜*˜[a] | 2.082 | 2.774 |

[a]˜*˜*˜*˜[R](-*(-*)) | 1.764 | 0.906 |

c([OH])c([OH])c([OH]) | 0.581 | 0.708 |

[OH]cc([OH])c([OH]) | 0.581 | 0.274 |

c1([OH])c(O[CH3])cccc1 | 0.805 | 0.947 |

[OH]c1c(O[CH3])cccc1 | 0.797 | 0.169 |

c1([OH])ccc(O[CH3])cc1 | 0.74 | 0.922 |

[OH]c1ccc(O[CH3])cc1. | 0.734 | 0.193 |

## Conclusions

We presented, to our knowledge, the first comparison between Ullmann and VF2 subgraph isomorphism algorithm on molecular data and the first data set to perform such a benchmark. Using SMARTS as molecular substructure language, we explored the influence of substructure and molecular size as well as the usage of explicit hydrogen nodes and recursive environment specification on the matching time. Both algorithms where additionally tested for the use in complete database scans and their ability for data-based parallelization. Additionally, we presented an optimization strategy to reduce matching times by substructure pattern reformulation.

In conclusion, the VF2 algorithm outperforms the Ullman in all test cases when supplied with a favorable substructure formulation and seems to be more robust in terms of run time outliers. Even though the VF2 is generally faster, both algorithms perform most single substructure-molecule searches in times below one millisecond, which seems acceptable for most cheminformatics applications. Nevertheless, we recommend using the VF2 algorithm for molecular substructure searching in cheminformatics software because it shows a general run time superiority of about one order of magnitude.

The syntactic formulation of a substructure in terms of arrangement might be a critical point for the underlying subgraph isomorphism algorithm. Our experiments show that the VF2 algorithm is sensitive to the substructure’s formulation while the Ullmann algorithm is not. Therefore, other subgraph isomorphism algorithms might show the same sensitivity and need to be experimentally tested.

Fortunately, the subgraph reformulation rules as shown here have not to be done by hand. The VF2 algorithm is based on a precalculated node order which can be manipulated following the reformulation rules. Due to the sensitivity of the VF2 algorithm for node rearrangements, the algorithm has further room for optimization.

## Declarations

### Acknowledgements

Many thanks to Angela M. Henzler for revising the manuscript, Karen Schomburg for the help on collecting the SMARTS expressions and Sascha Urbaczek, J. Robert Fischer, Adrian Kolodzik, Tobias Lippert, and Matthias Hilbig for their work on the molecule software components.

## Authors’ Affiliations

## References

- Irwin J, Shoichet B: ZINC–a free database of commercially available compounds for virtual screening. J Chem Inf Model. 2005, 45: 177-182. 10.1021/ci049714+.View ArticleGoogle Scholar
- Bolton EE, Wang Y, Thiessen PA, Bryant SH: Chapter 12 PubChem: Integrated Platform of Small Molecules and Biological Activities. Annual Reports in Computational Chemistry Volume 4, Volume 4 of, Annual Reports in Computational Chemistry. Edited by: Wheeler RA, Spellmeyer DC. 2008, Elsevier, 217-241. [http://www.sciencedirect.com/science/article/pii/S1574140008000121]View ArticleGoogle Scholar
- Sussenguth EH: A graph-theoretic algorithm for matching chemical Structures. J Chem Documentation. 1965, 5: 36-43. 10.1021/c160016a007. [http://pubs.acs.org/doi/abs/10.1021/c160016a007]View ArticleGoogle Scholar
- Figueras J: Substructure search by set reduction. J Chem Documentation. 1972, 12 (4): 237-244. 10.1021/c160047a010. [http://pubs.acs.org/doi/abs/10.1021/c160047a010]View ArticleGoogle Scholar
- Read RC, Corneil DG: The graph isomorphism disease. J Graph Theory. 1977, 1 (4): 339-363. 10.1002/jgt.3190010410. [http://dx.doi.org/10.1002/jgt.3190010410]View ArticleGoogle Scholar
- Gati G: Further annotated bibliography on the isomorphism disease. J Graph Theory. 1979, 3 (2): 95-109. 10.1002/jgt.3190030202. [http://dx.doi.org/10.1002/jgt.3190030202]View ArticleGoogle Scholar
- Ullmann JR: An algorithm for subgraph isomorphism. J Assoc Comput Mach. 1976, 23: 31-42. 10.1145/321921.321925.View ArticleGoogle Scholar
- Attias R: DARC substructure search system: a new approach to chemical information. J Chem Inf Comput Sci. 1983, 23 (3): 102-108. 10.1021/ci00039a003. [http://pubs.acs.org/doi/abs/10.1021/ci00039a003]View ArticleGoogle Scholar
- Heyman J, Karasinskia E, Giles P: CAS information services for medicinal chemists. Drug Inf J. 1982, 16 (4): 185-190.Google Scholar
- Willett P, Barnard JM, Downs GM: Chemical similarity searching. J Chem Inf Model. 1998, 38 (6): 983-996. 10.1021/ci9800211. [http://dx.doi.org/10.1021/ci9800211]View ArticleGoogle Scholar
- Cordella L, Foggia P, Sansone C, Vento M: Performance evaluation of the VF graph matching algorithm. Image Analysis and Processing, 1999. Proceedings. International Conference on. 1999, 1172-1177.Google Scholar
- Cordella LP, Foggia P, Sansone C, Vento M: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans Pattern Anal Machine Intelligence. 2004, 26 (10): 1367-1372. 10.1109/TPAMI.2004.75.View ArticleGoogle Scholar
- Yan X, Yu PS, Han J: Proceedings of the 2005 ACM SIGMOD international conference on, Management of data, SIGMOD ’05. 2005, New York, NY, USA: ACM, 766-777. [http://doi.acm.org/10.1145/1066157.1066244]View ArticleGoogle Scholar
- Golovin A, Henrick K: Chemical substructure search in SQL. J Chem Inf Model. 2009, 49: 22-27. 10.1021/ci8003013.View ArticleGoogle Scholar
- Willett P, Wilson T, Reddaway SF: Atom-by-atom searching using massive parallelism. Implementation of the Ullmann subgraph isomorphism algorithm on the distributed array processor. J Chem Inf Comput Sci. 1991, 31 (2): 225-233. 10.1021/ci00002a008. [http://pubs.acs.org/doi/abs/10.1021/ci00002a008]View ArticleGoogle Scholar
- Messmer BT: Efficient Graph Matching Algorithms. 1995Google Scholar
- Foggia P, Sansone C, Vento M: A performance comparison of five algorithms for graph isomorphism. Proc of the 3rd IAPR TC-15 Workshop on Graph-based Representations in Pattern Recognition. 2001, 188-199.Google Scholar
- Brint AT, Willett P: Algorithms For the Identification of 3-dimensional Maximal Common Substructures. J Chem Inf Comput Sci. 1987, 27 (4): 152-158. 10.1021/ci00056a002.View ArticleGoogle Scholar
- Downs GM, Lynch MF, Willett P, Manson GA, Wilson GA: Transputer implementations of chemical substructure searching algorithms. Tetrahedron Comput Methodology. 1988, 1 (3): 207-217. 10.1016/0898-5529(88)90026-7. [http://dx.doi.org/10.1016/0898-5529(88)90026-7]View ArticleGoogle Scholar
- Barnard JM: Substructure searching methods: old and new. J Chem Inf Comput Sci. 1993, 33 (4): 532-538. 10.1021/ci00014a001. [http://pubs.acs.org/doi/abs/10.1021/ci00014a001]View ArticleGoogle Scholar
- Oprea TI: Chemoinformatics in drug discovery. 2005:, Weinheim: Wiley-VCH, 76–79. chap. 4.4.2.1.View ArticleGoogle Scholar
- Agrafiotis DK, Lobanov VS, Shemanarev M, Rassokhin DN, Izrailev S, Jaeger EP, Alex S, Farnum M: Efficient Substructure Searching of Large Chemical Libraries: The ABCD Chemical Cartridge. J Chem Inf Model. 2011, 51: 3113-3130. 10.1021/ci200413e. [http://pubs.acs.org/doi/abs/10.1021/ci200413e]View ArticleGoogle Scholar
- Falkenhainer B, Forbus KD, Gentner D: The structure-mapping engine: algorithm and examples. Artif Intelligence. 1989, 41: 1-63. 10.1016/0004-3702(89)90077-5.View ArticleGoogle Scholar
- Tarjan RE: Graph Algorithms in Chemical Computation. 1977:, American Chemical Society, 1–20. chap. 2. [http://pubs.acs.org/doi/abs/10.1021/bk-1977-0046.ch001]View ArticleGoogle Scholar
- Daylight Theory Manual, Daylight Chemical Information Systems Inc. 2011
- Ash S, Cline MA, Homer RW, Hurst T, Smith GB: SYBYL line notation (SLN): A versatile language for chemical structure representation. J Chem Inf Comput Sci. 1997, 37: 71-79. 10.1021/ci960109j.View ArticleGoogle Scholar
- Koniver DA, Wiswesser WJ, Usdin E: Wiswesser line notation: simplified techniques for converting chemical structures to WLN. Science. 1972, 176 (4042): 1437-1439. 10.1126/science.176.4042.1437. [http://dx.doi.org/10.1126/science.176.4042.1437]View ArticleGoogle Scholar
- Hann M, Hudson B, Lewell X, Lifely R, Miller L, Ramsden N: Strategic pooling of compounds for high-throughput screening. J Chem Inf Comput Sci. 1999, 39 (5): 897-902. 10.1021/ci990423o. [http://pubs.acs.org/doi/abs/10.1021/ci990423o]View ArticleGoogle Scholar
- Walters W, Murcko MA: Prediction of ‘drug-likeness’. Adv Drug Delivery Rev. 2002, 54 (3): 255-271. 10.1016/S0169-409X(02)00003-0. [http://www.sciencedirect.com/science/article/pii/S0169409X02000030]. [Computational Methods for the Prediction of ADME and Toxicity]View ArticleGoogle Scholar
- Abolmaali SFB, Wegner JK, Zell A: The compressed feature matrix - a fast method for feature based substructure search. J Mol Model. 2003, 9: 235-241. 10.1007/s00894-003-0126-0. [http://dx.doi.org/10.1007/s00894-003-0126-0]. [10.1007/s00894-003-0126-0]View ArticleGoogle Scholar
- Olah M, Bologa C, Oprea TI: An automated PLS search for biologically relevant QSAR descriptors. J Comput Aided Mol Des. 2004, 18: 437-449. 10.1007/s10822-004-4060-8. [http://dx.doi.org/10.1007/s10822-004-4060-8]. [10.1007/s10822-004-4060-8]View ArticleGoogle Scholar
- Maass P, Schulz-Gasch T, Stahl M, Rarey M: Recore: a fast and versatile method for scaffold hopping based on small molecule crystal structure conformations. J Chem Inf Model. 2007, 47 (2): 390-399. 10.1021/ci060094h. [http://pubs.acs.org/doi/abs/10.1021/ci060094h]. [PMID: 17305328]View ArticleGoogle Scholar
- Degen J, Wegscheid-Gerlach C, Zaliani A, Rarey M: On the art of compiling and using ‘drug-like’ chemical fragment spaces. Chem Med Chem. 2008, 3: 1503-1507.View ArticleGoogle Scholar
- Ahmed HEA, Vogt M, Bajorath J: Design and evaluation of bonded atom pair descriptors. J Chem Inf Model. 2010, 50: 487-499. 10.1021/ci900512g.View ArticleGoogle Scholar
- Daylight SMARTS examples; Daylight Chemical Information Systems, Inc. http://www.daylight.com/dayhtml_tutorials/languages/smarts/smarts_examples.html,
- Agrafiotis DK, Gibbs AC, Zhu F, Izrailev S, Martin E: Conformational sampling of bioactive molecules: a comparative study. J Chem Inf Model. 2007, 47 (3): 1067-1086. 10.1021/ci6005454. [http://pubs.acs.org/doi/abs/10.1021/ci6005454]. [PMID: 17411028]View ArticleGoogle Scholar
- Enoch SJ, Madden JC, Cronin MTD: Identification of mechanisms of toxic action for skin sensitisation using a SMARTS pattern based approach. SAR QSAR Environ Res. 2008, 19 (5-6): 555-578. 10.1080/10629360802348985. [http://dx.doi.org/10.1080/10629360802348985]View ArticleGoogle Scholar
- Baell JB, Holloway GA: New substructure filters for removal of Pan Assay Interference Compounds (PAINS) from screening libraries and for their exclusion in Bioassays. J Med Chem. 2010, 53 (7): 2719-2740. 10.1021/jm901137j. [http://pubs.acs.org/doi/abs/10.1021/jm901137j]. [PMID: 20131845]View ArticleGoogle Scholar
- Ihlenfeldt WD, Takahashi Y, Abe H, ichi Sasaki S: Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility. J Chem Inf Comput Sci. 1994, 34: 109-116. 10.1021/ci00017a013.View ArticleGoogle Scholar
- Xu J: GMA: a generic match algorithm for structural homomorphism, isomorphism, and maximal common substructure match and its applications. J Chem Inf Comput Sci. 1996, 36: 25-34. 10.1021/ci950061u. [http://pubs.acs.org/doi/abs/10.1021/ci950061u]View ArticleGoogle Scholar
- Gasteiger J, Engel, T (Eds): Chemoinformatics: A Textbook. 2003, Wiley-VCH, [http://www.worldcat.org/isbn/3527306811], 1 editionView ArticleGoogle Scholar
- Schomburg K, Ehrlich HC, Stierand K, Rarey M: From structure diagrams to visual chemical patterns. J Chem Inf Model. 2010, 50 (9): 1529-1535. 10.1021/ci100209a. [http://dx.doi.org/10.1021/ci100209a]View ArticleGoogle Scholar
- Ozawa K, Yasuda T, Fujita S: Substructure search with tree-structured data. J Chem Inf Comput Sci. 1997, 37 (4): 688-695. 10.1021/ci960378+. [http://pubs.acs.org/doi/abs/10.1021/ci960378%2B]View ArticleGoogle Scholar
- Rughooputh SDDV, Rughooputh HCS: Neural network based chemical structure indexing. J Chem Inf Comput Sci. 2001, 41 (3): 713-717. 10.1021/ci000394d. [http://pubs.acs.org/doi/abs/10.1021/ci000394d]View ArticleGoogle Scholar
- Miller MA: Chemical database techniques in drug discovery. Nat Rev Drug Discov. 2002, 1 (3): 220-227. 10.1038/nrd745. [http://dx.doi.org/10.1038/nrd745]View ArticleGoogle Scholar
- Jeliazkova N, Kochev N: AMBIT-SMARTS: efficient searching of chemical structures and fragments. Mol Informatics. 2011, 30 (8): 707-720. [http://dx.doi.org/10.1002/minf.201100028]Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.