# Efficient enumeration of monocyclic chemical graphs with given path frequencies

- Masaki Suzuki
^{1}, - Hiroshi Nagamochi
^{1}Email author and - Tatsuya Akutsu
^{2}Email author

**6**:31

https://doi.org/10.1186/1758-2946-6-31

© Suzuki et al.; licensee Chemistry Central Ltd. 2014

**Received: **28 September 2013

**Accepted: **21 May 2014

**Published: **30 May 2014

## Abstract

### Background

The enumeration of chemical graphs (molecular graphs) satisfying given constraints is one of the fundamental problems in chemoinformatics and bioinformatics because it leads to a variety of useful applications including structure determination and development of novel chemical compounds.

### Results

We consider the problem of enumerating chemical graphs with monocyclic structure (a graph structure that contains exactly one cycle) from a given set of feature vectors, where a feature vector represents the frequency of the prescribed paths in a chemical compound to be constructed and the set is specified by a pair of upper and lower feature vectors. To enumerate all tree-like (acyclic) chemical graphs from a given set of feature vectors, Shimizu et al. and Suzuki et al. proposed efficient branch-and-bound algorithms based on a fast tree enumeration algorithm. In this study, we devise a novel method for extending these algorithms to enumeration of chemical graphs with monocyclic structure by designing a fast algorithm for testing uniqueness. The results of computational experiments reveal that the computational efficiency of the new algorithm is as good as those for enumeration of tree-like chemical compounds.

### Conclusions

We succeed in expanding the class of chemical graphs that are able to be enumerated efficiently.

## Keywords

## Introduction

The enumeration of chemical structures satisfying given constraints is an important topic in chemoinformatics [1–3]. Applications of the enumeration of chemical structures include structure determination using mass-spectrum and/or NMR-spectrum [4, 5], virtual exploration of the chemical universe [6, 7], reconstruction of molecular structures from their signatures [8, 9], and classification of chemical compounds [10]. The enumeration problem is also important for development of novel chemical compounds because virtual exploration of chemical universe and reconstruction of molecular structures from their signatures are considered to be important elementary technologies. The enumeration of chemical structures has a long history. Cayley [11] considered the enumeration of structural isomers of alkanes in the 19th century. The seminal work of Pólya on counting the number of isomers using group theory is also famous [12].

*monocyclic graph*is an undirected connected graph containing exactly one cycle (a graph is connected if there exists a path connecting every pair of vertices), and a constraint is given in the form of a set of

*feature vectors*(i.e., a set of descriptors). We assume that each feature vector specifies the number of occurrences of each labeled path of length up to a given constant

*K*, where a labeled path is an alternating sequence of atom names and bond types (see Figure 1 for an example of a feature vector). We also assume that a set of feature vectors is given by specifying the minimum and maximum numbers of occurrences of each labeled path. We develop an efficient algorithm for this enumeration problem by extending existing algorithms [13, 14] for enumerating tree-like chemical structures (i.e., chemical structures without cycles). In this extension, some novel concepts are introduced and rigorous mathematical analysis is performed in orer to guarantee the correctness of the algorithm.

In order to verify the computational efficiency of our proposed algorithm, we perform computational experiments using a set of some chemical compounds from the KEGG LIGAND database [15]. The results suggest that the proposed algorithm enumerates chemical structures having monocyclic graph structures as nearly efficiently as tree-like chemical graphs have been enumerated.

The rest of this paper is organized as follows. First, we review some mathematical definitions and give a formal definition of the enumeration problem for chemical structures with monocyclic graph structures. Next, we review background and related work. Then, we present the algorithm and the results of computational experiments. Finally, we conclude with future work. Mathematical proofs, pseudocodes for the algorithm, and some details on computational experiments are given in Additional file 1.

## Preliminaries and problem formulation

*K*. In the example of Figure 2, we consider paths of lengths 0 and 1, where a path of length 0 corresponds to a single atom and a path of length 1 corresponds to a bond including its endpoint atoms. For example, the columns O, N, and C of feature vector

*f*

_{1}(

*G*) mean that each target structure must contain exactly one oxygen, two nitrogen, and three carbon atoms, respectively. The columns N1O, N1C, and C2C mean that each target structure must contain exactly one single bond connecting N and O, two single bonds connecting C and N, and one double bond connecting C and C. It should be noted that one single bond connecting N and O is counted by both O1N and N1O. Then, the chemical structure

*G*is consistent with

*f*

_{1}(

*G*). However, another chemical structure may be consistent with a given feature vector. For example, the feature vector remains the same even if the double bond (along with the branching carbon atom) is moved into the backbone chain. Therefore, it is desirable to enumerate all chemical structures consistent with a given feature vector and the valence condition (specified by val(…)). On the other hand, there may not exist any consistent chemical structure if

*K*is large; thus it may not be appropriate to uniquely specify a feature vector. Therefore, we assume in our target problem that upper and lower bounds of the number of occurrences of each labeled path are given as shown in Figure 3.

*k-augmented tree*if the number of adjacent vertex pairs (i.e., vertex pairs connected by edges) minus the number of vertices is

*k*−1 (hence a multi-tree is a 0-augmented tree). That is, a

*k*-augmented tree is a graph obtained by adding edges to

*k*different pairs of nonadjacent vertices in a multi-tree (see Figure 4). The problem considered in this paper is to enumerate all 1-augmented trees satisfying specified upper and lower bound conditions on feature vectors. In the following, we provide the mathematical definition of the problem. We assume that readers have some familiarity with basic concepts in graph theory. For those who are not familiar with graph theory, we suggest referring to an appropriate textbook (e.g., [16]). Readers not interested in mathematical details can skip this part.

A graph is defined to be an ordered pair (*V*,*E*) of a finite set *V* of vertices and a finite set *E* of edges, where an edge is an unordered pair of distinct vertices (thus no self-loop exists), where an edge with two end-vertices *u* and *v* is denoted by *uv*. A graph is called a *multi-graph* when *E* is not necessarily composed of distinct pairs of vertices (thus multiple edges are allowed in a multi-graph and *E* is no longer a set, but a multi-set), and is called a *simple graph* if no multiple edges are allowed. The multiplicity (the number of multiple edges) between two vertices *u* and *v* is denoted by *m*(*u*,*v*). An edge in a multi-graph *G* is called *simple* if its multiplicity in *G* is one. We denote the vertex set and edge set of a graph *G* by *V*(*G*) and *E*(*G*), respectively. For a vertex *v* in a multi-graph *G*, let deg(*v*;*G*) denote the number of edges incident to vertex *v* (i.e., degree of *v*). In this paper, a cycle is a closed path with a length at least three (two edges with the same endvertices are not treated as a cycle), and a connected multi-graph (resp., simple graph) with no cycle is a *multi-tree* (resp., *simple tree*). A *k-augmented tree* is a connected multi-graph such that the number of adjacent vertex pairs (i.e., vertex pairs connected by edges) minus the number of vertices is *k*−1. For two vertices *u* and *v* in a multi-graph *G*, let *G*+*u* *v* denote the multi-graph obtained by adding a new edge *uv* to *G*; when *u* *v*∈*E*(*G*), let *G*−*u* *v* denote the multi-graph obtained by removing *uv* from *G*. Let ${\mathbb{Z}}_{+}$ denote the set of nonnegative integers, and let *Σ* be a set of colors, which correspond to chemical elements such as H, O and C. Let each color *c*∈*Σ* be associated with a *valence*$\text{val}(c)\in {\mathbb{Z}}_{+}$. A multi-graph *G* is said to be *Σ*−*colored* if each vertex *v* has a color *c*(*v*)∈*Σ*. Chemical compounds can be viewed as *Σ*-*colored*, self-loopless connected multi-graphs, where vertices and colors represent atoms and elements, respectively.

Let $d\in {\mathbb{Z}}_{+}$ be a prescribed integer, which corresponds to the maximum multiplicity of chemical graphs, and *Σ*^{k,d} denote the set of all alternating sequences (*c*_{0},*m*_{1},*c*_{1},…,*m*_{
k
},*c*_{
k
}) consisting of colors *c*_{0},*c*_{1},…,*c*_{
k
}∈*Σ* and *m*_{1},*m*_{2},…,*m*_{
k
}∈{1,2,…,*d*}. We denote the union of *Σ*^{0,d}, *Σ*^{1,d}, *Σ*^{2,d},…,*Σ*^{k,d} by *Σ*^{≤k,d}. Let ${\mathcal{F}}_{k}(\Sigma ,d)$ be the set of all mappings *g* from *Σ*^{≤k,d} to ${\mathbb{Z}}_{+}$, i.e., ${\mathcal{F}}_{k}(\Sigma ,d)=\{g:{\Sigma}^{\le k,d}\to {\mathbb{Z}}_{+}\}$.

For a path *P*=(*v*_{0},*m*_{1},*v*_{1},…,*m*_{
k
},*v*_{
k
}) such that *V*(*P*)={*v*_{0},*v*_{1},…,*v*_{
k
}}, *E*(*P*)={*v*_{0}*v*_{1},*v*_{1}*v*_{2},…,*v*_{k−1}*v*_{
k
}}, and *m*_{
i
}=*m*(*v*_{i−1},*v*_{
i
}) is the multiplicity of edge *v*_{i−1}*v*_{
i
}, the length of *P* is defined to be *k*=|*V*(*P*)|−1, and the color sequence *c*(*P*) of *P* is defined to be the sequence *c*(*P*)=(*c*(*v*_{0}),*m*_{1},*c*(*v*_{1}),…,*m*_{
k
},*c*(*v*_{
k
}))∈*Σ*^{k,d}.

Given a multi-graph *G* and a sequence *t*∈*Σ*^{k,d} for some *k*, let *o* *c* *c*(*t*,*G*) denote the number of paths *P* in *G* such that *c*(*P*)=*t*. For an integer $K\in {\mathbb{Z}}_{+}$, *the feature vector* *f*_{
K
}(*G*) of level *K* in *G* is defined to be the |*Σ*^{≤K,d}|-dimensional vector *f*_{
K
}(*G*) whose value at each entry *t*∈*Σ*^{≤K,d} is given by *f*_{
K
}(*G*)[ *t*]=*o* *c* *c*(*t*,*G*). In this paper, we treat hydrogen-suppressed chemical graphs with carbon C, nitrogen N or oxygen O, which are represented by *Σ*-colored multi-graphs *G* with color set *Σ*={O, N, C }. Figure 2 illustrates an example of *Σ*-colored multi-graph *G* that represents a hydrogen-suppressed chemical graph and its feature vector *f*_{1}(*G*).

Note that in hydrogen-suppressed chemical graph *G*, deg(*v*;*G*)<val(*c*(*v*)) may hold for some vertex *v*∈*V*(*G*). Let us define the *residue degree* res(*v*) of a vertex *v* to be val(*c*(*v*))−deg(*v*;*G*). In a multi-graph *G*, we interpret res(*v*) of a vertex *v* as the number of hydrogen atoms attached to the vertex *v* (in our proposed procedure, we also interpret res(*v*) as the number of new edges/bonds that can be attached to *v* when *G* is being constructed by adding more edges).

For a vector $g\in {\mathcal{F}}_{K}(\Sigma ,d)$ of level *K*≥1, a multi-graph *G* with *f*_{
K
}(*G*)=*g* is a multi-graph such that the occurrence of each path *t*=(*c*_{0},*m*_{1},*c*_{1},…,*m*_{
p
},*c*_{
p
}) in *G* with length of at most *K* is completely specified by *g*[ *t*], in particular *V*(*G*)={*t*∣*g*[ *t*]≥1,*t*∈*Σ*^{0,d}=*Σ*} (i.e., *G* has exactly *g*[ *t*] vertices of color *t*), *E*(*G*)={*t*∣*g*[ *t*]≥1,*t*=(*c*,*m*,*c*^{′})∈*Σ*^{1,d}} (i.e., *G* has exactly *g*[ (*c*,*m*,*c*^{′})] edges of multiplicity *m* that join a vertex of color *c* and a vertex of color *c*^{′}).

For two vectors ${g}_{L},{g}_{U}\in {\mathcal{F}}_{K}(\Sigma ,d)$ and an integer *k*≥0, let ${\mathcal{G}}_{k}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$ denote the set of all *Σ*-colored *k*-augmented trees *G* such that *g*_{
L
}≤*f*_{
K
}(*G*)≤*g*_{
U
} (i.e., *f*_{
K
}(*G*)=*g*^{′} for some *g* with *g*_{
L
}≤*g*^{′}≤*g*_{
U
}) and deg(*v*;*G*)≤val(*c*(*v*)), *v*∈*V*(*G*).

Our problem is to enumerate all *k*-augmented trees *G* on a given set of atoms each of which is consistent with one of the feature vectors between the lower and upper vectors ${g}_{U},{g}_{L}\in {\mathcal{F}}_{K}(\Sigma ,d)$, such that *g*_{
L
}≤*g*_{
U
} (where *g*_{
L
}[ *t*]=*g*_{
U
}[ *t*] for all *t*∈*Σ*^{0,d} since the vertex set is fixed for all *G*).

In what follows, we fix a color set *Σ* and an upper bound *d* on multiplicity. We define the problem of enumerating *k*-augmented trees as follows.

**Enumerating chemical graphs with given upper and lower path frequency (EULF)**

Given a maximum path length $K\in {\mathbb{Z}}_{+}$ and feature vectors ${g}_{U},{g}_{L}\in {\mathcal{F}}_{K}(\Sigma ,d)$ such that *g*_{
L
}[ *t*]=*g*_{
U
}[ *t*] for all *t*∈*Σ*^{0,d}, enumerate all multi-graphs $G\in {\mathcal{G}}_{k}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$.

Figure 3 illustrates an example of an input of EULF with upper and lower feature vectors *g*_{
L
} and *g*_{
U
} and part of its output, multi-trees ${G}_{1},{G}_{2}\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$ and a 1-augmented tree ${G}_{3}\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$.

For *k*=0, we have developed an efficient algorithm for EULF [13, 14]. The purpose of this work is to describe an algorithm for EULF with *k*=1. We assume that the maximum valence is 4 and mainly enumerate a 1-augmented tree such that the cycle contains an edge of multiplicity one (a single bond), since otherwise a 1-augmented tree is a single cycle consisting of edges of multiplicity two, which can be separately handled as a special rare case.

## Background

As mentioned in Introduction, enumeration of chemical structures has a long history and many studies have been done. In the field of machine learning, a similar problem, which is called the *preimage problem*, has been studied [17, 18]. In this problem, a desired object is computed as a feature vector in a feature space, and then the feature vector is mapped back to the input space, where this mapped back object is called a preimage. The definition of the feature vectors based on the frequency of labeled paths [19, 20] or small fragments [10, 21] has been widely used. Akutsu and Fukagawa [22] formulated the graph preimage problem as the problem of inferring graphs from the frequency of paths of labeled vertices and proved that the problem is computationally intractable (NP-hard) even for planar graphs with bounded degrees [22]. Nagamochi [23] proved that a graph determined by the frequency of paths with length one can be found in polynomial time if any exists.

The preimage problem has also been studied in the field of chemoinformatics as a part of inverse QSAR/QSPR (quantitative structure-activity relationship/quantitative structure-property relationship) studies. Indeed, the problem is essentially the same as reconstruction and/or enumeration of molecules from their descriptors in inverse QSAR/QSPR [8, 9, 24, 25], where the descriptors correspond to feature vectors in the preimage problem. Wong and Burkowski developed a practical preimage based method and demonstrated that it actually generated the structure of a new drug candidate [26]. For enumeration of molecules from descriptors, useful tools such as MOLGEN have been developed [27]. However, they are not very efficient if large structures are to be enumerated because many of them treat general graph structures (under the valence constraint).

It might be possible to develop significantly faster algorithms for the preimage problem if we restrict the class of target chemical structures and employ recent techniques for enumeration of graph structures. Fujiwara *et al.*[28] studied enumeration of tree-like chemical graphs that satisfy a given feature vector which specifies frequency of paths of up to a prescribed length *K* in a chemical compound to be constructed. They proposed a branch-and-bound algorithm that consists of a branching procedure based on the tree enumeration algorithm by Nakano and Uno [29, 30] and bounding operations designed by properties on path frequency and atom-atom bonds. They showed by means of computational experiments on enumeration of alkane isomers that their algorithm works at least as efficiently as the fastest algorithm while using much less memory space.

To reduce the size of the search space, Ishida *et al*. [31] have introduced a new bounding operation, called the *detachment-cut*, based on the result of Nagamochi [23]. In this problem formulation, it is required that the path frequency of a chemical structure is exactly the same as the specified one. However, there does not exist such a structure in many cases because a mapping between chemical structures and feature vectors is not surjective and thus there are many vectors in a feature space that do not have preimages. To seek solutions effectively in a relaxed constraint, Shimizu *et al*. [13] recently introduced a problem of enumerating tree-like hydrogen-suppressed chemical graphs that satisfy one of a given set of feature vectors which is specified by a pair of upper and lower feature vectors. They proposed a branch-and-bound algorithm for the problem, called 1-Phase algorithm, and afterward Suzuki *et al*. [14] proposed a more efficient and effective algorithm, called 2-Phase algorithm. Implementations of these algorithms [13, 14] for enumerating tree-like hydrogen-suppressed/hydrogen-retained chemical graphs with given upper and lower bounds on path frequencies are available on a web server ( http://sunflower.kuicr.kyoto-u.ac.jp/tools/enumol2/).

As shown by Nakano and Uno [29, 30], the class of trees admits a nice scheme for computer representation of their structures (called “left-heavy trees”) which enables us to generate trees significantly faster (in constant time per tree) without executing any explicit test on the uniqueness of structure representations of temporarily generated labeled graphs. Development of algorithms for enumerating chemical graphs with a “non-tree structure” is thereby a challenging task if we still wish to attain high computational efficiency as we have achieved for enumeration of tree-like chemical graphs, because no such effective representation scheme is known for general graphs. It should be noted that although polynomial-time algorithms have been developed for equivalence test and unique representation form problems for bounded degree graphs [32, 33] and chemical compounds [34], they are not directly applicable to efficient enumeration of chemical graphs.

In the NCI database ( http://cactus.nci.nih.gov/ncidb2.2/), the ratio of the number of chemical compounds with *k*-augmented tree structures to that of all registered chemical compounds is approximately 9*%*, 22*%*, 28*%*, 20*%*, and 11*%* for *k*=0,1,2,3, and 4, respectively. This implies that we have been able to treat only 9*%* of all of chemical compounds with high computational efficiency. As the first step toward efficient enumeration of non-tree chemical graphs, we consider the problem of hydrogen-suppressed chemical graphs with 1-augmented tree (monocyclic) structure. If we can solve this problem, we can treat 31*%* (=9*%*+22*%*) of chemical compounds. Although no effective representation scheme is known even to 1-augmented trees, we can create a tree by removing one edge in the unique cycle in a 1-augmented tree (two multiple edges with the same endvertices is not called a cycle in this paper). Additionally, 2-Phase algorithm [14], which enumerates tree-like hydrogen-suppressed chemical graphs, can be used without any major modification to enumerate such trees *T*=*G*−*e* with one edge deficit from 1-augmented trees *G* to be constructed. Thus the main task is to efficiently test the uniqueness of generated labeled 1-augmented trees. To design such a procedure, we use a well-reflected definition of a parent 0-augmented tree *T*=*G*−*e* of a 1-augmented tree *G*. As a result, we can combine the new procedure with 2-Phase algorithm to obtain an algorithm for enumerating hydrogen-suppressed chemical graphs with 1-augmented tree structure from upper and lower bounds on feature vectors.

## Method

Our proposed algorithm is based on existing algorithms to enumerate colored trees [29, 30] and colored multi-trees [13, 14, 28, 31]. The basic strategy of our algorithm is to generate a multi-tree first and then extend it to a 1-augmented tree by adding an edge. In enumeration algorithms, it is important not to miss any possible structures and not to duplicate identical structures. In order to efficiently cope with these conditions, the concept of the family tree has been widely employed in various enumeration algorithms. To define a family tree for graphs, we need to define a parent-child relationship between graph structures so that a parent structure is uniquely determined from a child structure, where each child structure is obtained by adding a vertex or an edge to its parent structure. Because extension of a multi-tree to a 1-augmented tree is the core part of our proposed algorithm, we need to provide a proper definition of the parent-child relationship between a multi-tree and a 1-augmented tree. As will be shown later, there may exist multiple possible ways of having a parent structure. How to define the unique parent of a given 1-augmented tree is one of the novel points of our proposed algorithm. Another important issue on generating 1-augmented trees is not to generate identical 1-augmented trees from the same multi-tree. As will be shown later, there is a case in which additions of different edges result in identical structures. How to efficiently prevent this kind of duplicate generation of identical structures is the other novel point of our proposed algorithm. In the following, we give a detailed description of the algorithm including these novel points. Again, readers not interested in mathematical details can skip this part.

### Overview of a new algorithm for 1-Augmented trees

Let ${\mathcal{G}}_{0}^{\prime}$ be the set of 0-augmented trees (multi-trees) *T*=*G*−*e* obtained from each 1-augmented tree $G\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$ by removing a simple edge *e* in the unique cycle of *G*.

*g*

_{ L }as follows: For each

*t*∈

*Σ*

^{k,d}with

*k*≥2, let ${g}_{L}^{\prime}[\phantom{\rule{0.3em}{0ex}}t]=0$; and for each

*t*∈

*Σ*

^{1,d}, let

Thus our first task is to generate all multi-trees $T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$ by using fast conventional algorithms such as 2-Phase algorithm.

Our next task is to generate 1-augmented trees *G* from each multi-tree $T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$ such that no 1-augmented tree in $G\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$ will be duplicated during the entire enumeration over all $T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$. To attain this objective without storing all generated 1-augmented trees for a comparison with a newly generated 1-augmented tree, we define a mapping $\pi :{\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})\to {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$; the multi-tree *T*=*π*(*G*) for a 1-augmented tree *G* is called the *parent* of *G*. For a multi-tree *T*, a 1-augmented tree *G* with *π*(*G*)=*T* is called a *child* of *T* (possibly *T* has more than one child), and is called a *feasible child* of *T* if $G\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$. Note that any of definition of such a mapping will suffice as long as *π*(*G*) is determined only by the information of an “unlabeled graph” *G* (i.e., topological structure) except for a possible difference in computational efficiency to avoid duplication of solutions.

In the following, we show the 2-Phase algorithm, present details of our definition of parents *π* and design an efficient procedure for generating all children *G* from a given multi-tree $T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$.

### Summary of 2-Phase algorithm for 0-Augmented tree

*g*

_{ U }and ${g}_{L}^{\prime}$ and simplified feature vectors

*g*

_{ u }and ${g}_{L}^{\prime}$.

From the given multi-tree $T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$, our efficient procedure generates all 1-augmented trees in ${\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$.

### Parent-child relationship

In this section, to avoid duplication of a 1-augmented tree during the entire enumeration over all $T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$, we introduce a parent-child relationship between a 0-augmented tree and a 1-augmented tree.

#### Signature of rooted multi-trees

To define the parent *π*(*G*) of a 1-augmented tree *G* using only topological structure, we first introduce the concepts of “canonical form” and “signature” for a class of multi-graphs.

We fix the total order of colors in *Σ* arbitrarily, e.g., O <N <C, and regard each color *c*∈*Σ* as a small integer in ${\mathbb{Z}}_{+}$. We define the *lexicographical order* among sequences with elements in $\Sigma \cup {\mathbb{Z}}_{+}$ as follows. A sequence *A*=(*a*_{1},*a*_{2},…,*a*_{
p
}) is lexicographically smaller than a sequence *B*=(*b*_{1},*b*_{2},…,*b*_{
q
}) (denoted by *A*<*B*) if and only if there is an index *k* such that (i) *a*_{
i
}=*b*_{
i
}(1≤*i*≤*k*); and (ii) *a*_{k+1}<*b*_{k+1}(*k*+1≤ min{*p*,*q*}) or *k*=*p*<*q*; otherwise *A*=*B*, i.e., *p*=*q* and *a*_{
i
}=*b*_{
i
}(1≤*i*≤*p*), or *B*<*A*. Let *A*≤*B* denote *A*<*B* or *A*=*B*.

A multi-graph is called *labeled* if each vertex has a unique name or an index such as *v*_{0},*v*_{1},…,*v*_{n−1}, and we usually record a multi-graph as labeled in our computer. Hence, testing isomorphism of two multi-graphs is to find labels for these “unlabeled graphs” such that the two labeled graphs completely match each other including the adjacency between every two vertices. For a class
of multi-graphs, if we have a way of choosing a label for each multi-graph $G\in \mathcal{G}$ that is unique up to automorphisms of *G*, then we can test the isomorphism of two graphs directly with their labels. Such a labeling for *G* is called the *canonical form* of *G*. Once such a canonical form is obtained, we can easily encode each multi-graph $G\in \mathcal{G}$ into a code *σ*(*G*) (which is an integer or a sequence of integers/colors), called the *signature* of *G*, such that two multi-graphs $G,{G}^{\prime}\in \mathcal{G}$ are isomorphic if and only if *σ*(*G*)=*σ*(*G*^{′}). Without loss of generality we assume a total order over $\{\sigma (G)\mid G\in \mathcal{G}\}$ by introducing, if necessary, a total order over all colors and a lexicographical (total) order over all sequences of integers and colors.

A *rooted* graph is a multi-graph in which a vertex is designated as the root, and two rooted graphs are isomorphic if there is an isomorphism that maps their roots onto each other.

Any tree *T* has either a vertex or a pair of adjacent vertices removal of which leaves no component with at least |*V*(*T*)|/2 vertices [35], where the former is called the centroid and the latter is called the bicentroid.

In a rooted multi-tree *T*, the parent vertex of a non-root vertex *v* is denoted by *p*(*v*) and the *depth* of a vertex *v* is denoted by depth(*v*), where the depth of a vertex is its distance to the root. For a vertex *v* in *T*, let *T*_{
v
} denote the subtree induced from *T* by all descendants of *v* including *v*. For an edge *e*=*u* *v* in *T* (where *u*=*p*(*v*)), let *T*_{
e
} (*T*_{
u
v
}) denote the subtree of *T* that consists of *T*_{
v
} and *u*=*p*(*v*) joined by edge *e*=*u* *v*.

*T*is given by an “ordered tree”

*τ*of it (i.e., determination of a total order among children of each vertex). Let dfs(

*τ*) denote the total order of vertices in

*τ*visited by the depth-first-search order according to the order for children in

*τ*. For example, Figure 6 illustrates three ordered trees

*τ*

_{1},

*τ*

_{2}and

*τ*

_{3}, which are obtained from the same multi-tree

*T*rooted at the centroid, where the number beside each vertex

*v*indicates dfs(

*v*).

We let *δ*(*τ*) denote the alternating sequence (*c*_{0},*d*_{0},*c*_{1},*d*_{1},…,*c*_{n−1},*d*_{n−1}) such that *c*_{
i
} and *d*_{
i
} denote the color and depth, respectively, of the *i*-th vertex *v*_{
i
} in dfs(*τ*), and *M*(*τ*) denote the sequence (*m*_{1},*m*_{2},…,*m*_{n−1}) of the multiplicity *m*_{
i
}=*m*(*v*_{
i
},*p*(*v*_{
i
})) of the edge joining the *i*-th vertex and its parent *p*(*v*_{
i
}) in *T*. For a vertex *v*, let dfs(*v*) denote the labeling number of *v* in dfs(*τ*). For example, Figure 6 illustrates *δ*(*τ*_{
i
}) of ordered trees *τ*_{
i
}, *i*=1,2,3 and *M*(*τ*_{2}) and *M*(*τ*_{3}).

A *left-heavy* tree of a rooted multi-tree *T* is an ordered tree *τ* that has the maximum code *δ*(*τ*) among all ordered trees of *T* (hence a left-heavy tree *τ* is a canonical form and *δ*(*τ*) is a signature of it when we ignore the multiplicity of rooted multi-trees). We define the canonical form of a rooted multi-tree *T* to be the left-heavy tree *τ* that has the maximum code *M*(*τ*) among all left-heavy trees of *T*, and let *σ*(*T*) denote a signature of *T* (a code of the canonical form *τ* such as (*δ*(*τ*),*M*(*τ*))). For example, in Figure 6, *τ*_{2} and *τ*_{3} are left-heavy trees of *T*, since they have lexicographically maximum sequences *δ*(*τ*_{2})=*δ*(*τ*_{3}) among all ordered trees *τ* of the rooted multi-tree *T*, and *τ*_{3} is the canonical form of *T* and (*δ*(*τ*_{3})=(C,0,C,1,N,2,C,1,N,2,N,1,O,2),*M*(*τ*_{3})=(2,1,1,2,1,1)) is the signature of *T* since it is a left-heavy tree with the lexicographically maximum *M*(*τ*_{3}) among all left-heavy trees *τ* of *T*.

Using the canonical form for rooted multi-trees, we can define a canonical form for “unrooted” multi-trees *T* by regarding them as trees rooted at the centroid or bicentroid.

#### Defining parents *π*

*π*for 1-augmented trees (note that there is no root for any 1-augmented tree). Let

*G*be a 1-augmented tree with a unique cycle

*C*of length

*p*which by our assumption contains at least one simple edge. Then there are

*p*possible choices

*G*−

*e*

_{ i },

*i*=1,2,…,

*p*, for the parent of

*G*. We introduce a rule to choose one of them based only on the topological information on

*G*and

*C*. For each vertex

*v*in

*C*, let

*N*(

*v*) denote the set of vertices in

*V*−

*V*(

*C*) adjacent to

*v*. Removing an edge

*vw*with

*v*∈

*V*(

*C*) and

*w*∈

*N*(

*v*) leaves a multi-tree containing

*w*, which we denote by

*T*

_{ w }. For each vertex

*v*in

*C*, we encode all multi-trees

*T*

_{ w },

*w*∈

*N*(

*v*) into a signature

*σ*

^{∗}(

*v*) using the signature

*σ*for rooted multi-trees; we set

such that $\sigma ({T}_{{w}_{1}})\ge \sigma ({T}_{{w}_{2}})\ge \cdots \ge \sigma ({T}_{{w}_{h}})$ holds for *N*(*v*)={*w*_{1},*w*_{2},…,*w*_{
h
}}. Note that two vertices *v* and *v*^{′} in *C* have the same color and an identical set of subtrees in *N*(*v*) and *N*(*v*^{′}) if and only if *σ*^{∗}(*v*)=*σ*^{∗}(*v*^{′}). For each simple edge *e*=*u* *v* in *C*, we define a code *c*^{∗}(*e*) as follows. We encode the unique path *u*_{1} (=*u*),*u*_{2},…,*u*_{
h
} (=*v*) from *u* to *v* along *C* into *σ*^{∗}(*u*,*v*)=(*σ*^{∗}(*u*_{1}),*m*_{1},*σ*^{∗}(*u*_{2}),*m*_{2},…,*m*_{h−1},*σ*^{∗}(*u*_{
h
})), where *m*_{
i
}=*m*(*u*_{
i
},*u*_{i+1}). Symmetrically, we define *σ*^{∗}(*v*,*u*) = (*σ*^{∗}(*u*_{
h
}),*m*_{h−1},*σ*^{∗}(*u*_{h−1}),*m*_{h−2},…,*m*_{1},*σ*^{∗}(*u*_{1})). The code *c*^{∗}(*e*) is defined to be lexicographically the maximum one between two sequences *σ*^{∗}(*u*,*v*) and *σ*^{∗}(*v*,*u*). Furthermore, let *E*^{∗}(*C*) be the set of simple edges *e*^{∗} in *C* such that *c*^{∗}(*e*^{∗}) is lexicographically maximum among *c*^{∗}(*e*) for all simple edges *e* in *C*.

We call an edge *vw* with *v*∈*V*(*C*) and *w*∈*N*(*v*) a *heavy edge* if *T*_{
w
} has at least |*V*(*G*)|/2 vertices. We distinguish two cases to define parent *π*.

- Case 1.There is no heavy edge around
*C*: For an arbitrary edge*e*∈*E*^{∗}(*C*), we define*π*(*G*) to be*G*−*e*(note that when |*E*^{∗}(*C*)|≥2,*G*is symmetric around*C*and*G*−*e*and*G*−*e*^{′}will be isomorphic for any two edges*e*,*e*^{′}∈*E*^{∗}(*C*)). Figure 7 illustrates how the parent*π*(*G*) of a 1-augmented tree*G*in Case 1 is determined on these signatures*σ*(*u*,*v*) and*σ*(*v*,*u*) of simple edges*uv*in the cycle*C*. - Case 2.
There is a heavy edge

*v*^{∗}*w*^{∗}: Note that no other edge can be a heavy edge. Let*e*_{1}and*e*_{2}be the two edges in*C*that are adjacent to*v*^{∗}, where at least one of them is a simple edge since deg(*v*)≤4. If exactly one of them, e.g.,*e*_{1}is a simple edge, then we define*π*(*G*) to be*G*−*e*_{1}. When*e*_{1}and*e*_{2}are simple edges, we choose one of them as follows. We first ignore all trees*T*_{ w }with*w*∈*N*(*v*^{∗}), which are symmetric at the vertex*v*^{∗}commonly shared by*e*_{1}and*e*_{2}and hence useless to construct a signature for distinguishing*e*_{1}and*e*_{2}. Without using*T*_{ w }with*w*∈*N*(*v*^{∗}), we construct the code*c*^{∗}(*e*_{1}) and*c*^{∗}(*e*_{2}). Finally we choose any edge*e*_{ i }such that*c*^{∗}(*e*_{ i }) is lexicographically maximum between*c*^{∗}(*e*_{1}) and*c*^{∗}(*e*_{2}), and define*π*(*G*) to be*G*−*e*_{ i }.

#### Generating children

Recall that our algorithm for enumerating 1-augmented trees consists of two major stages: the first stage enumerates all multi-trees $T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$ by 2-Phase algorithm, and the second stage generates all feasible children *G* for each $T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U})$, i.e., 1-augmented trees $G\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$ with *π*(*G*)=*T*. This section describes a procedure for generating all children *G*=*T*+*e* of a given multi-tree *T* by adding a new edge *e*.

For simplicity, we consider the case where a given multi-tree *T* has the centroid (the case where it is rooted at the bicentroid can be treated with a minor technical modification). In the following, we assume that a given multi-tree *T* is represented as its canonical form (a left-heavy tree) *τ* rooted at its centroid, and that its sequences *δ*(*τ*)=(*c*_{1},*d*_{1},…,*c*_{
n
},*d*_{
n
}) and *M*(*τ*)=(*m*_{2},*m*_{3},…,*m*_{
n
}) over the labeling dfs(*τ*) have been already computed after the first stage (2-Phase algorithm can deliver not only solutions *T* but also *τ* and these sequences together).

It should be noted that the canonical form of left-heavy trees enjoys the following recursive structure. For any vertex *v* in *T*, the subtree *T*_{
v
} of *T* rooted at *v* induces an ordered tree *τ*_{
v
} from the left-heavy tree *τ* and *τ*_{
v
} is again the canonical form of *T*_{
v
}, since dfs(*τ*_{
v
}) is a subsequence of dfs(*τ*) with consecutive vertices and its ordered pair (*δ*(*τ*_{
v
}),*M*(*τ*_{
v
})) is also lexicographically maximized over all ordered trees of *T*_{
v
}.

#### Testing generated 1-Augmented trees

*τ*of a multi-tree

*T*, we add a new edge

*xy*for two nonadjacent vertices

*x*,

*y*∈

*V*(

*T*)(dfs(

*x*)<dfs(

*y*)) to obtain

*G*=

*T*+

*x*

*y*. Let

*C*denote the cycle created in

*G*. We check the following condition to test whether

*T*is the parent of

*G*or not.

- Case I.
*C*contains the root (centroid) of*T*:- (A)
*σ*^{∗}(*x**y*) is lexicographically maximum among*σ*^{∗}(*e*) for all simple edges*e*in*C*.

- (A)
- Case II.Otherwise:
- (B)
*x*is the ancestor of*y*; - (C)
*σ*^{∗}(*x**y*)≥*σ*^{∗}(*e*) if the edge*e*incident to*x*in*C*is simple.

- (B)

Then, we have the following lemma, where the proof is given in S1.1 (of the Additional file 1).

**Lemma** **1**.

For a multi-tree *T* and two nonadjacent vertices *x*,*y*∈*V*(*T*), testing whether *T*=*π*(*T*+*x* *y*) can be done by checking the above condition in *O*(|*V*(*C*)|^{2}) time.

### Avoiding duplication of children

*T*can be generated by the definition of the parent-child relationship between multi-tree and 1-augmented tree. However, if

*T*has two isomorphic subtrees

*T*

_{ u }and

*T*

_{ v }then we would have two children

*T*+

*x*

*y*and

*T*+

*x*

^{′}

*y*

^{′}, which are isomorphic to each other. To avoid such duplication, we test if

*T*+

*x*

*y*is isomorphic to

*T*+

*x*

^{′}

*y*

^{′}for some other vertices

*x*

^{′}and

*y*

^{′}when we add an edge

*xy*to

*T*. In fact, we do not try to find other such pair

*x*

^{′}and

*y*

^{′}explicitly. Instead we introduce a rule that we do not generate

*T*+

*x*

*y*by any edge

*xy*that has such an “isomorphic” vertex pair

*x*

^{′}and

*y*

^{′}on the left hand side of

*x*and

*y*in

*T*. To detect this situation efficiently, we first compute data on each vertex

*v*in

*T*that indicates whether the left hand side of

*v*contains another vertex

*u*such that its subtree

*T*

_{p(u)u}is isomorphic to

*T*

_{p(v)v}. Using such data, we show that given a vertex pair

*x*and

*y*whether there is an isomorphic pair

*x*

^{′}and

*y*

^{′}in the left hand side can be checked in constant time. In other words, we show an

*O*(

*n*

^{2}) time algorithm extracting all the leftmost side vertex pairs from

*T*.Among all isomorphic vertex pairs, we call the leftmost one an “admissible pair”, where “isomorphic” means here that the connection of vertices in each pair results in an isomorphic tree. Figure 8 shows an example of a case in which additions of different edges result in identical structures and the admissible pair.

In this section, we show the validity that we only need to add an edge between each admissible pair to avoid duplication and omission of 1-augmented trees generated from one multi-tree *T*. Finally, for a multi-tree, we provide an efficient algorithm extracting all vertex pairs to generate all children of the multi-tree.

#### Admissible pairs

We write *T*+*u* *v*∼*T*+*u*^{′}*v*^{′} if and only if *T*+*u* *v* and *T*+*u*^{′}*v*^{′} are isomorphic. For a tree *T*, let *c*_{
T
} denote its centroid, which is either a vertex (unicentroid) or an edge (bicentroid). Let *T* be a left-heavy tree rooted at its centroid *c*_{
T
}. When *c*_{
T
} is a bicentroid *r* *r*^{′}, *r* and *r*^{′} will be the vertices that have no parent in the parent-child relationship in *T*. We shall now introduce “rooted-isomorphism” among 1-augmented trees obtained from *T* by adding a new edge. We regard a 1-augmented tree *G*=*T*+*u* *v* obtained by adding new edge *uv* between two nonadjacent vertices *u*,*v*∈*V*(*T*) as a graph rooted at *c*_{
T
}. When *c*_{
T
} is a vertex *r*, we say that two 1-augmented trees *G*=*T*+*u* *v* and *G*^{′}=*T*+*u*^{′}*v*^{′} are *rooted-isomorphic* if they admit an isomorphism *ψ* such that *c*_{
T
} in *G*=*T*+*u* *v* corresponds to *c*_{
T
} in *G*^{′}=*T*+*u*^{′}*v*^{′} (i.e., *ψ*(*r*)=*r* when *c*_{
T
} is a vertex *r*, and {*ψ*(*r*),*ψ*(*r*^{′})}={*r*,*r*^{′}} when *c*_{
T
} is an edge *r* *r*^{′}). We write $T+\mathit{\text{uv}}\underset{r}{\approx}T+{u}^{\prime}{v}^{\prime}$ if and only if *T*+*u* *v* and *T*+*u*^{′}*v*^{′} are rooted-isomorphic with root *r*. Then, the following theorem holds, where the proof is given in Additional file 1: S1.2.

**Theorem** **2**.

Let *T* be a left-heavy tree rooted at its centroid *c*_{
T
} and {*u*,*v*},{*u*^{′},*v*^{′}}⊆*V*(*T*) be two pairs of nonadjacent vertices. If *T*+*u* *v*∼*T*+*u*^{′}*v*^{′} then $T+\mathit{\text{uv}}\underset{r}{\approx}T+{u}^{\prime}{v}^{\prime}$.

Theorem 2 tells us that two 1-augmented trees *G*=*T*+*u* *v* and *G*^{′}=*T*+*u*^{′}*v*^{′} are isomorphic if and only if they are rooted-isomorphic (i.e., *c*_{
T
} in *G* corresponds to *c*_{
T
} in *G*^{′} in the isomorphism *ψ*, where possibly *ψ*(*r*)=*r*^{′} and *ψ*(*r*^{′})=*r* when *c*_{
T
}=*r* *r*^{′}).

Now we consider how to generate a set ${\mathcal{G}}_{T}$ of 1-augmented trees *T*+*u* *v* such that the 1-augmented tree *T*+*u* *v* for any pair of nonadjacent vertices *u*,*v*∈*V*(*T*) is isomorphic to exactly one 1-augmented tree *G* in the set ${\mathcal{G}}_{T}$. By Theorem 2, we only need to check the rooted-isomorphism among 1-augmented trees *T*+*u* *v* for all pairs of nonadjacent vertices *u*,*v*∈*V*(*T*). Based on this, we can modify a given tree *T* with bicentroid *c*_{
T
}=*r* *r*^{′} into a tree *T*^{′} with unicentroid *r*^{∗} by inserting a new vertex on the edge *r* *r*^{′}. Since this does not change the rooted-isomorphism among 1-augmented trees *T*^{′}+*u* *v* or the left-heaviness of *T*, we assume in the following that a given tree *T* has a unicentroid *c*_{
T
}=*r*.

*T*be a left-heavy tree. We shall introduce some terminology. Let

*x*be a non-root vertex

*x*in

*T*. Denote by left(

*x*) the immediate left sibling of a non-root vertex

*x*(if any). We define data copy as follows.

Let *u* and *v* be two vertices in *T*. We denote by *P*(*u*,*v*) the unique path in *T* that connects *u* and *v*, where *P*(*u*,*v*)=*P*(*v*,*u*). Let lca(*u*,*v*) denote the least common ancestor of *u* and *v*, i.e., the highest vertex in *P*(*u*,*v*) (where we define lca(*u*,*v*) to be the edge *c*_{
T
}=*r* *r*^{′} when *T* is rooted at the bicentroid *c*_{
T
}=*r* *r*^{′}, and *u*∈*V*(*T*_{
r
}) and $v\in V({T}_{{r}^{\prime}})$). When dfs(*u*)<dfs(*v*), we define the *greatest uncommon ancestor* gua of *u* and *v*as follows:

● Let gua(*u*,*v*) denote the child of lca(*u*,*v*) that is closest to *u* in *T*, where gua(*u*,*v*) is an ancestor of *u* (including *u* itself) if lca(*u*,*v*)≠*u*;

● Let gua(*v*,*u*) denote the child of lca(*u*,*v*) that is an ancestor of *v* (including *v* itself), where gua(*u*,*v*)=gua(*v*,*u*) if lca(*u*,*v*)=*u*.

*u*,

*v*∈

*V*(

*T*) with dfs(

*u*)<dfs(

*v*)

*admissible*if it satisfies the following conditions (see Figure 9 for conditions (1) and (2) and Figure 10for condition (3)):

- (1)
copy(

*w*) = 0 for all vertices*w*∈*V*(*P*(lca(*u*,*v*),*r*))−{*r*}; - (2)
copy(

*w*)=0 for all vertices*w*∈*V*(*P*(*u*,gua(*u*,*v*)))∪*V*(*P*(*v*,gua(*v*,*u*)))−{lca(*u*,*v*),gua(*v*,*u*)}; - (3) if copy(gua(
*v*,*u*))=1 then- (i)
gua(

*u*,*v*)=left(gua(*v*,*u*)) (hence*u*≠lca(*u*,*v*)); and - (ii)
For the copy $\xfb$ of vertex

*u*in*T*_{gua(v,u)}, it holds $\text{dfs}(v)\ge \text{dfs}(\xfb)$ (where $\text{dfs}(\xfb)=\text{dfs}(u)+|V({T}_{\text{gua}(u,v)})|$).

- (i)

Note that (3)-(i) implies that copy(gua(*v*,*u*)) in (2) needs to be 0 when lca(*u*,*v*)=*u*.

The next lemma indicates that we only need to add an edge between each admissible pair to avoid duplication of 1-augmented trees, where the proof is given in Additional file 1: S1.3.

**Lemma** **3**.

For a left-heavy tree *T* rooted at its unicentroid *c*_{
T
}=*r*, let ${\mathcal{G}}_{T}=\{T+\mathit{\text{uv}}\mid $ admissible pairs *u*,*v*∈*V*(*T*)}. Then the 1-augmented tree *T*+*u* *v* for any pair of nonadjacent vertices *u*,*v*∈*V*(*T*) is isomorphic to exactly one 1-augmented tree *G* in ${\mathcal{G}}_{T}$.

#### Algorithm

In this section, we describe an algorithm of the second stage to generate all children of a given multi-tree *T* without duplication and omission, and show the computational complexity of the second stage. To generate all children of *T*, we first find all admissible pairs (*u*,*v*) for *T* and test whether *T* is the parent *π*(*T*+*u* *v*) of *T*+*u* *v* or not. Notice that a straightforward method would take *O*(*n*) time to check whether a pair (*u*,*v*) is admissible or not. Since there are at most _{
n
}*C*_{2} vertex pairs in a multi-tree *T*, finding all admissible pairs for *T* may take *O*(*n*^{3}) time. That is, from Lemma 1, we may need *O*(*n*^{3}|*V*(*C*)|^{2})=*O*(*n*^{5}) time to generate all children of *T*.

*O*(

*n*

^{4})-time algorithm to generate all children of a given multi-tree

*T*. For this, we find only a subset of all admissible pairs, called the set of “candidate” pairs defined as follows (see also Figure 11). We see that no pair (

*x*,

*y*) generates a child

*T*+

*x*

*y*of

*T*if a heavy edge is created in

*T*+

*x*

*y*and

*x*is not an ancestor of

*y*, since such (

*x*,

*y*) does not satisfy any of Cases I and II for generating children of

*T*. Hence we are not interested in storing such pairs (

*x*,

*y*), and call an admissible pair (

*x*,

*y*) a

*candidate*pair when (i) no heavy edge is created in

*T*+

*x*

*y*; or (ii)

*x*is an ancestor of

*y*, where (i) (resp., (ii)) is a necessary condition of Case I (resp., Case II). By definition, every candidate vertex pair (

*x*,

*y*) is admissible, whereas any admissible pair (

*x*,

*y*) such that

*T*+

*x*

*y*is a child of

*T*is always a candidate pair. Therefore, to generate all children of

*T*, we do not need to find all admissible pairs and only have to extract all candidate pairs.

To facilitate this, we examine all vertex pairs (*u*,*v*) (dfs(*u*)<dfs(*v*)) in *T* in a lexicographical order with respect to (dfs(*u*),dfs(*v*)), i.e., we choose each vertex *v*_{
i
} from *v*_{0} to *v*_{n−1} as *u* and then choose each vertex *v*_{
j
} from *v*_{i+1} to *v*_{n−1} as *v*. We call the lexicographical order over vertex pairs a *dfs order*. For each of the generated vertex pairs, we check whether it is a candidate pair or not. Finally, for each candidate pair (*u*,*v*), we test whether *T*+*u* *v* is a child of *T* in Case I or II.

We can find all candidate pairs for a multi-tree *T* in *O*(*n*^{2}) time in total as stated below. The proof is given in Additional file 1: S1.4.

**Lemma** **4**.

For a left-heavy multi-tree *T*, all candidate pairs of *T* can be found in *O*(*n*^{2}) time.

Finally, by Lemma 1 and Lemma 4, we can generate all children of a multi-tree *T* in *O*(*n*^{4}) time, as stated in the next lemma.

**Lemma** **5**.

Given a left-heavy multi-tree *T*, all children of *T* can be generated in *O*(*n*^{4}) time.

*Proof*.

For a left-heavy multi-tree *T*, we can find all candidate pairs in *O*(*n*^{2}) time by Lemma 4. For each candidate pair (*u*,*v*), we can test whether *T*=*π*(*T*+*u* *v*) or not in *O*(|*V*(*C*)|^{2}) time by Lemma 1. Thus, for a left-heavy multi-tree *T*, all children of *T* can be generated in *O*(*n*^{2}|*V*(*C*)|^{2})=*O*(*n*^{4}) time. □

## Experimental and results

- (I)
First we select four chemical compounds “C00062,” “C03343,” “C03690,” and “C07178” as chemical graphs with 0-augmented tree (acyclic) structure and four chemical compounds “C00095,” “C00270,” “C00645,” and “C00837” as chemical graphs with 1-augmented tree structure (see Additional file 1: Figure S21 for illustrations of these chemical graphs), wherein each benzene ring in chemical compounds “C03343,” “C03690,” and “C07178” is regarded as a virtual atom b of valence 6. These compounds are heuristically selected based on the following criteria: (i) each compound is a 0-augmented tree or 1-augmented tree (except benzene ri ngs), (ii) each compound consists of C,O,H (or, C,O,N,H) atoms, (iii) compounds are not very similar to each other, and (iv) compounds have varying sizes but are not too large.

The virtual atom b is treated as one atom so that we discard all possible regioisomers of benzene. Thus in our experiment, we consider the cycles not caused by benzenes but by other substructures in these 1-augmented trees. We remark that an efficient algorithm has been developed for generating all possible regioisomers of a given 0-augmented tree structure with virtual atoms b by Li *et al*. [36], and an implementation of the algorithm is available on a web server ( http://sunflower.kuicr.kyoto-u.ac.jp/tools/enumol2/).

To generate problem instances from each of the selected chemical graphs, we define $w\in {\mathbb{Z}}_{+}$ to be a *width* between upper and lower feature vectors. From the feature vector *g*=*f*_{
K
}(*G*) of a chemical graph *G* at level *K*, we construct two feature vectors *g*_{
U
} and *g*_{
L
} of width *w* as follows. For each entry *t* with *g*[ *t*]≥1, let *g*_{
U
}[ *t*]=*g*[ *t*]+*w* and *g*_{
L
}[ *t*]= max{*g*[ *t*]−*w*,0}; and for each entry *t* with *g*[ *t*]=0, let *g*_{
U
}[ *t*]=*g*_{
L
}[ *t*]=0. See Additional file 1: Figure S23 (resp., Additional file 1: Figure S24) for the lower and upper feature vectors *g*_{
L
} and *g*_{
U
} with *K*=1 and *w*=1 (resp., *K*=2 and *w*=1) created from C00062.

To examine the computational efficiency, we compare the time per output multi-tree/1-augmented tree by our algorithm and by 2-Phase algorithm [14]. Our algorithm enumerates not only the 1-augmented trees in ${\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$ but also the multi-trees in ${\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$. Therefore, if time per output graph of our algorithm is close to that of 2-Phase algorithm, then we can enumerate 1-augmented trees as fast as 2-Phase algorithm enumerates multi-trees.

*K*with fixed

*w*=1, where the meanings of columns are as follows.

**Comparison of 2-Phase algorithm and our algorithm in chemical graphs (I)**

Entry | 2-Phase Algorithm | Our Algorithm | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Formula | n | K | Tree | Time | Time/graph | Tree | Cycle | Time | Time/graph | Ratio |

1 | 388,192 | 0.288 | 741.9E-9 | 388,192 | 1,786,467 | 14.056 | 6.5E-6 | 8.7 | ||

2 | 614 | 0.021 | 34.2E-6 | 614 | 229 | 0.134 | 159.0E-6 | 4.6 | ||

C00062 | 3 | 95 | 0.020 | 210.5E-6 | 95 | 0 | 0.039 | 410.5E-6 | 2.0 | |

C | 12 | 4 | 1 | 0.008 | 8.0E-3 | 1 | 0 | 0.018 | 18.0E-3 | 2.3 |

5 | 1 | 0.005 | 5.0E-3 | 1 | 0 | 0.006 | 6.0E-3 | 1.2 | ||

6 | 1 | 0.006 | 6.0E-3 | 1 | 0 | 0.006 | 6.0E-3 | 1.0 | ||

7 | 1 | 0.004 | 4.0E-3 | 1 | 0 | 0.004 | 4.0E-3 | 1.0 | ||

1 | 1,708 | 0.007 | 4.1E-6 | 1,708 | 12,626 | 0.050 | 3.5E-6 | 0.9 | ||

2 | 50 | 0.004 | 80.0E-6 | 50 | 1,085 | 0.043 | 37.9E-6 | 0.5 | ||

C00095 | 3 | 0 | 0.046 | – | 0 | 286 | 0.046 | 160.8E-6 | – | |

C | 12 | 4 | 0 | 0.006 | – | 0 | 19 | 0.035 | 1.8E-3 | – |

5 | 0 | 0.004 | – | 0 | 7 | 0.030 | 4.3E-3 | – | ||

6 | 0 | 0.004 | – | 0 | 5 | 0.028 | 5.6E-3 | – | ||

7 | 0 | 0.004 | – | 0 | 5 | 0.013 | 2.6E-3 | – | ||

1 | 5,446,987 | 7.690 | 1.4E-6 | 5,446,987 | 31,395,098 | 217.681 | 5.9E-6 | 4.2 | ||

2 | 373 | 0.022 | 59.0E-6 | 373 | 71 | 0.171 | 385.1E-6 | 6.5 | ||

C03343 | 3 | 187 | 0.023 | 123.0E-6 | 187 | 25 | 0.071 | 334.9E-6 | 2.7 | |

C | 15 | 4 | 101 | 0.022 | 217.8E-6 | 101 | 9 | 0.042 | 381.8E-6 | 1.8 |

5 | 51 | 0.022 | 431.4E-6 | 51 | 6 | 0.059 | 1.0E-3 | 2.4 | ||

6 | 43 | 0.009 | 209.3E-6 | 43 | 0 | 0.036 | 837.2E-6 | 4.0 | ||

7 | 28 | 0.013 | 464.3E-6 | 28 | 0 | 0.020 | 714.3E-6 | 1.5 | ||

1 | 2,926,382 | 2.878 | 983.5E-9 | 2,926,382 | 23,965,432 | 146.669 | 5.5E-6 | 5.5 | ||

2 | 41,468 | 1.035 | 25.0E-6 | 41,468 | 213,820 | 37.792 | 148.0E-6 | 5.9 | ||

C00645 | 3 | 491 | 0.562 | 1.1E-3 | 491 | 4,482 | 6.281 | 1.3E-3 | 1.1 | |

C | 15 | 4 | 0 | 0.523 | – | 0 | 73 | 4.168 | 57.1E-3 | – |

5 | 0 | 0.374 | – | 0 | 5 | 2.320 | 464.0E-3 | – | ||

6 | 0 | 0.135 | – | 0 | 3 | 0.430 | 143.3E-3 | – | ||

7 | 0 | 0.121 | – | 0 | 1 | 0.328 | 328.0E-3 | – | ||

1 | 167,172,180 | 238.554 | 1.4E-6 | ≥ 1,594,520 | ≥ 33,962,677 | T. O. | 50.7E-6 | 35.5 | ||

2 | 210 | 1.232 | 5.9E-3 | 210 | 641 | 1.888 | 2.2E-3 | 0.4 | ||

C00837 | 3 | 0 | 0.853 | – | 0 | 4 | 1.206 | 301.5E-3 | – | |

C | 18 | 4 | 0 | 0.445 | – | 0 | 2 | 0.523 | 261.5E-3 | – |

5 | 0 | 0.389 | – | 0 | 1 | 0.596 | 596.0E-3 | – | ||

6 | 0 | 0.298 | – | 0 | 1 | 0.367 | 367.0E-3 | – | ||

7 | 0 | 0.285 | – | 0 | 1 | 0.395 | 395.0E-3 | – | ||

1 | 62,234,720 | 321.155 | 5.2E-6 | ≥ 4,812,773 | ≥ 40,426,928 | T. O. | 39.8E-6 | 7.7 | ||

2 | 884 | 0.310 | 350.7E-6 | 884 | 180 | 1.824 | 1.7E-3 | 4.9 | ||

C07178 | 3 | 22 | 0.026 | 1.2E-3 | 22 | 4 | 0.099 | 3.8E-3 | 3.2 | |

C | 18 | 4 | 1 | 0.004 | 4.0E-3 | 1 | 0 | 0.005 | 5.0E-3 | 1.3 |

5 | 1 | 0.004 | 4.0E-3 | 1 | 0 | 0.005 | 5.0E-3 | 1.3 | ||

6 | 1 | 0.004 | 4.0E-3 | 1 | 0 | 0.005 | 5.0E-3 | 1.3 | ||

7 | 1 | 0.005 | 5.0E-3 | 1 | 0 | 0.005 | 5.0E-3 | 1.0 | ||

1 | ≥ 1,208,446,991 | T. O. | 1.5E-6 | ≥ 7,009,856 | ≥ 47,008 | T. O. | 255.1E-6 | 171.2 | ||

2 | 27,312,856 | 965.131 | 35.3E-6 | ≥ 337,989 | ≥ 3,593,865 | T. O. | 458.1E-6 | 13.0 | ||

C00270 | 3 | 156,073 | 391.611 | 2.5E-3 | ≥ 1,546 | ≥ 234,187 | T. O. | 7.6E-3 | 3.0 | |

C | 21 | 4 | 0 | 299.393 | – | ≥ 0 | ≥ 165 | T. O. | 10.9E+0 | – |

5 | 0 | 208.268 | – | ≥ 0 | ≥ 7 | T. O. | 257.1E+0 | – | ||

6 | 0 | 19.361 | – | 0 | 2 | 85.109 | 42.6E+0 | – | ||

7 | 0 | 9.720 | – | 0 | 2 | 30.301 | 15.2E+0 | – | ||

1 | ≥ 664,049,939 | T. O. | 2.7E-6 | ≥ 4,621,297 | ≥ 33,216,732 | T. O. | 47.6E-6 | 17.5 | ||

2 | 164,885 | 34.357 | 208.4E-6 | 164,885 | 1,425 | 213.810 | 1.3E-3 | 6.2 | ||

C03690 | 3 | 32,995 | 15.978 | 484.3E-6 | 32,995 | 179 | 66.612 | 2.0E-3 | 4.1 | |

C | 23 | 4 | 3,884 | 2.265 | 583.2E-6 | 3,884 | 17 | 5.383 | 1.4E-3 | 2.4 |

5 | 1,237 | 1.490 | 1.2E-3 | 1,237 | 13 | 3.466 | 2.8E-3 | 2.3 | ||

6 | 559 | 0.773 | 1.4E-3 | 559 | 0 | 1.554 | 2.8E-3 | 2.0 | ||

7 | 177 | 0.445 | 2.5E-3 | 177 | 0 | 0.617 | 3.5E-3 | 1.4 |

- (1)
C00062, C00095, C00270 C00645, C00837, C03343, C07178, and C03690 are the chemical compounds in the KEGG LIGAND database, respectively;

- (2)
in Table 1, the width for constructing upper and lower feature vectors is 1;

- (3)
*n*is the number of vertices without hydrogen atoms in an instance preprocessed by replacing each benzene ring with a new atom with six valences; - (4)
*w*is the width for constructing upper and lower feature vectors; - (5)
*K*is the level of given feature vectors; - (6)
“time (s)” is the CPU time in seconds;

- (7)
T.O. means the “time over” (the time limit is set to be 1,800 seconds);

- (8)
“time/graph” is the time per enumerating one graph;

- (9)
“tree” is the number of all possible solutions of tree-like chemical graph in the time limit;

- (10)
“cycle” is the number of all possible solutions of 1-tree chemical graph in the time limit;

- (11)
“ratio” is a number such that “time/graph” of our algorithm is divided by that of 2-Phase algorithm;

- (12)
for any real numbers x and y, let

*x*E*y*denote*x*×10^{ y }.

It is to be noted that in some instances, the number of enumerated trees by 2-Phase algorithm and that of our algorithm are different because of the time limit. Hence, the “tree” and “cycle” columns show the number of incomplete solutions in instances whose “time” column is “T.O.”. However, this is not a critical issue because we mainly want to know the “time per graph” and its “ratio” between 2-Phase algorithm and our algorithm. We can make use of them as beneficial results from “tree,” “cycle,” and “time” columns even if they are incomplete and “T.O.”.

We find that almost all instances solved within the time limit by 2-Phase algorithm are also solved by our algorithm within the time limit. Moreover, the “ratio” of instances is less than 10 except 4 out of 38 cases, and that of many instances is less than 5. This means that the time per output by our algorithm is close to that by 2-Phase algorithm. Therefore, we have demonstrated that our algorithm maintains the high computational efficiency of 2-Phase algorithm even if *K* changes. Note that our algorithm does not output any 1-augmented trees in ${\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})$ in “C00062,” “C03343,” “C07178,” and “C03690” when *K* is large. This is because the instances are acyclic chemical compounds: 1-augmented trees become less able to satisfy the feature vector constraint as *K* increases and only multi-trees can satisfy the feature vector constraint.

*w*with fixed

*K*=3. Just like with Table 1, almost all instances solved within the time limit by 2-Phase algorithm are also solved by our algorithm within the time limit. The “ratio” of instances is less than 10 except 7 out of 48 cases, and that of many instances is less than 5. In particular, with respect to “C00095,” “C00645,” and “C00837,” which have 1-augmented tree structure, the “ratio” is less than 1 or close to 1. This implies that our algorithm can enumerate 1-augmented trees and multi-trees faster than 2-Phase algorithm enumerates multi-trees. These results mean that the time per output by our algorithm is close to that by 2-Phase algorithm. Therefore, we have demonstrated that our algorithm maintains the high computational efficiency of 2-Phase algorithm even if

*w*changes.

**Comparison of varying width**
w
**in chemical graphs (I)**

Entry | 2-Phase Algorithm | Our Algorithm | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Formula | n | w | K | Tree | Time | Time/graph | Tree | Cycle | Time | Time/graph | Ratio |

1 | 3 | 95 | 0.020 | 210.5E-6 | 95 | 0 | 0.039 | 410.5E-6 | 2.0 | ||

2 | 3 | 862 | 0.020 | 23.2E-6 | 862 | 100 | 0.130 | 135.1E-6 | 5.8 | ||

C00062 | 3 | 3 | 2,531 | 0.030 | 11.8E-6 | 2,531 | 894 | 0.213 | 62.2E-6 | 5.3 | |

C | 12 | 4 | 3 | 3,611 | 0.044 | 12.2E-6 | 3,611 | 2,737 | 0.254 | 40.0E-6 | 3.3 |

5 | 3 | 4,438 | 0.044 | 9.9E-6 | 4,438 | 5,454 | 0.265 | 26.8E-6 | 2.7 | ||

50 | 3 | 5,138 | 0.045 | 8.8E-6 | 5,138 | 12,044 | 0.388 | 22.6E-6 | 2.6 | ||

1 | 3 | 0 | 0.005 | – | 0 | 286 | 0.046 | 160.8E-6 | – | ||

2 | 3 | 40 | 0.065 | 1.6E-3 | 40 | 1,569 | 0.065 | 40.4E-6 | 0.0 | ||

C00095 | 3 | 3 | 280 | 0.009 | 32.1E-6 | 280 | 4,899 | 0.089 | 17.2E-6 | 0.5 | |

C | 12 | 4 | 3 | 855 | 0.010 | 11.7E-6 | 855 | 8,273 | 0.148 | 16.2E-6 | 1.4 |

5 | 3 | 1,502 | 0.011 | 7.3E-6 | 1,502 | 12,085 | 0.177 | 13.0E-6 | 1.8 | ||

50 | 3 | 4,608 | 0.012 | 2.6E-6 | 4,608 | 23,686 | 0.186 | 6.6E-6 | 2.5 | ||

1 | 3 | 187 | 0.023 | 123.0E-6 | 187 | 25 | 0.071 | 334.9E-6 | 2.7 | ||

2 | 3 | 2,077 | 0.091 | 43.8E-6 | 2,077 | 1,251 | 0.880 | 264.4E-6 | 6.0 | ||

C03343 | 3 | 3 | 5,345 | 0.201 | 37.6E-6 | 5,345 | 5,746 | 3.134 | 282.6E-6 | 7.5 | |

C | 15 | 4 | 3 | 10,391 | 0.346 | 33.3E-6 | 10,391 | 16,912 | 4.041 | 148.0E-6 | 4.4 |

5 | 3 | 14,531 | 0.482 | 33.2E-6 | 14,531 | 33,064 | 5.887 | 123.7E-6 | 3.7 | ||

50 | 3 | 19,819 | 0.655 | 33.0E-6 | 19,819 | 94,725 | 7.833 | 68.4E-6 | 2.1 | ||

1 | 3 | 491 | 0.562 | 1.1E-3 | 491 | 4,482 | 6.281 | 1.3E-3 | 1.1 | ||

2 | 3 | 7,846 | 1.122 | 143.0E-6 | 7,846 | 76,261 | 17.199 | 204.5E-6 | 1.4 | ||

C00645 | 3 | 3 | 151,227 | 2.420 | 16.0E-6 | 151,227 | 716,216 | 39.476 | 45.5E-6 | 2.8 | |

C | 15 | 4 | 3 | 216,507 | 2.946 | 13.6E-6 | 216,507 | 1,270,462 | 33.842 | 22.8E-6 | 1.7 |

5 | 3 | 272,898 | 3.405 | 12.5E-6 | 272,898 | 1,757,010 | 40.323 | 19.9E-6 | 1.6 | ||

50 | 3 | 355,958 | 3.985 | 11.2E-6 | 355,958 | 2,625,154 | 43.002 | 14.4E-6 | 1.3 | ||

1 | 3 | 0 | 0.853 | – | 0 | 4 | 1.206 | 301.5E-3 | – | ||

2 | 3 | 389 | 1.569 | 4.0E-3 | 389 | 660 | 2.496 | 2.4E-3 | 0.6 | ||

C00837 | 3 | 3 | 2,510 | 1.999 | 796.4E-6 | 2,510 | 3,173 | 3.367 | 592.5E-6 | 0.7 | |

C | 18 | 4 | 3 | 8,544 | 2.314 | 270.8E-6 | 8,544 | 12,834 | 3.994 | 186.8E-6 | 0.7 |

5 | 3 | 13,796 | 2.465 | 178.7E-6 | 13,796 | 27,186 | 4.841 | 118.1E-6 | 0.7 | ||

50 | 3 | 24,313 | 2.683 | 110.4E-6 | 24,313 | 94,089 | 5.540 | 46.8E-6 | 0.4 | ||

1 | 3 | 22 | 0.026 | 1.2E-3 | 22 | 4 | 0.099 | 3.8E-3 | 3.2 | ||

2 | 3 | 386 | 0.058 | 150.3E-6 | 386 | 261 | 0.426 | 658.4E-6 | 4.4 | ||

C07178 | 3 | 3 | 2,376 | 0.089 | 37.5E-6 | 2,376 | 1,288 | 0.735 | 200.6E-6 | 5.4 | |

C | 18 | 4 | 3 | 4,092 | 0.102 | 24.9E-6 | 4,092 | 2,240 | 0.863 | 136.3E-6 | 5.5 |

5 | 3 | 4,629 | 0.109 | 23.5E-6 | 4,629 | 2,385 | 1.284 | 183.1E-6 | 7.8 | ||

50 | 3 | 5,103 | 0.115 | 22.5E-6 | 5,103 | 2,603 | 0.980 | 127.2E-6 | 5.6 | ||

1 | 3 | 156,073 | 391.611 | 2.5E-3 | ≥ 1,546 | ≥ 234,187 | T. O. | 7.6E-3 | 3.0 | ||

2 | 3 | 12,515,364 | 1331.770 | 106.4E-6 | ≥ 93,244 | ≥ 1,107,707 | T. O. | 1.5E-3 | 14.1 | ||

C00270 | 3 | 3 | ≥ 88,182,895 | T. O. | 20.4E-6 | ≥ 4,350,635 | ≥ 1,135,547 | T. O. | 328.3E-6 | 16.1 | |

C | 21 | 4 | 3 | ≥ 134,281,382 | T. O. | 13.4E-6 | ≥ 5,230,515 | ≥ 2,086,287 | T. O. | 246.0E-6 | 18.4 |

5 | 3 | ≥ 169,965,948 | T. O. | 10.6E-6 | ≥ 5,213,010 | ≥ 2,383,696 | T. O. | 237.1E-6 | 22.4 | ||

50 | 3 | ≥ 254,637,067 | T. O. | 7.1E-6 | ≥ 7,025,893 | ≥ 12,785,700 | T. O. | 90.9E-6 | 12.9 | ||

1 | 3 | 32,995 | 15.978 | 484.3E-6 | 32,995 | 179 | 66.612 | 2.0E-3 | 4.1 | ||

2 | 3 | 2,472,133 | 149.048 | 60.3E-6 | ≥ 1,763,123 | ≥ 702,493 | T. O. | 730.0E-6 | 12.1 | ||

C03690 | 3 | 3 | 13,120,833 | 509.010 | 38.8E-6 | ≥ 2,416,279 | ≥ 2,028,470 | T. O. | 405.0E-6 | 10.4 | |

C | 23 | 4 | 3 | 43,379,162 | 1289.269 | 29.7E-6 | ≥ 1,815,035 | ≥ 4,297,658 | T. O. | 294.5E-6 | 9.9 |

5 | 3 | ≥ 80,447,027 | T. O. | 22.4E-6 | ≥ 2,828,014 | ≥ 10,431,681 | T. O. | 135.7E-6 | 6.1 | ||

50 | 3 | ≥ 111,576,848 | T. O. | 16.1E-6 | ≥ 3,580,958 | ≥ 55,327,406 | T. O. | 30.6E-6 | 1.9 |

*n*, where

*n*is the size of an instance. Note that

*n*is the number of vertices without hydrogen atoms in an instance preprocessed by replacing each benzene ring with a new atom with six valences. We notice that there is no large difference in the “ratio” between all cases in spite of the fact that the instance size of C03690 is almost twice as large as that of C00062. This implies that our algorithm maintains the high computational efficiency of 2-Phase algorithm even if the instance size becomes large.

- (II)
Next we select four chemical compounds, prostaglandin (D08040), allobarbital (D00332), gabapentin (D00555), and histamine (D00079) as chemical graphs with 1-augmented tree structure (see Additional file 1: Figure S22 for illustrations of these chemical graphs), all of which are existing drug compounds. We conducted the same experiment as we did for (I): Table 3 shows the result of the comparison of 2-Phase algorithm and our algorithm for varying

*K*with fixed*w*=1; Table 4 shows the result of the comparison of 2-Phase algorithm and our algorithm for varying*w*with fixed*K*=3. In this experiment, we observe that there still is no large difference in the “ratio” between all cases except for the instance of D00079.

**Comparison of 2-Phase algorithm and our algorithm in chemical graphs (II)**

Entry | 2-Phase Algorithm | Our Algorithm | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Formula | n | K | Tree | Time | Time/graph | Tree | Cycle | Time | Time/graph | Ratio |

1 | 2,609 | 0.006 | 2.3E-6 | 2,609 | 11,263 | 0.036 | 2.6E-6 | 1.1 | ||

2 | 193 | 0.007 | 36.3E-6 | 193 | 165 | 0.015 | 41.9E-6 | 1.2 | ||

D08040 | 3 | 14 | 0.009 | 642.9E-6 | 14 | 5 | 0.014 | 736.8E-6 | 1.1 | |

C | 8 | 4 | 9 | 0.005 | 555.6E-6 | 9 | 2 | 0.006 | 545.5E-6 | 1.0 |

5 | 4 | 0.004 | 1.0E-3 | 4 | 1 | 0.005 | 1.0E-3 | 1.0 | ||

6 | 4 | 0.004 | 1.0E-3 | 4 | 1 | 0.005 | 1.0E-3 | 1.0 | ||

7 | 1 | 0.005 | 1.3E-3 | 4 | 1 | 0.006 | 1.2E-3 | 0.9 | ||

1 | 17,470 | 0.127 | 7.2E-6 | 17,470 | 264,326 | 1.446 | 5.1E-6 | 0.7 | ||

2 | 1,183 | 0.023 | 19.4E-6 | 1,183 | 16,233 | 0.294 | 17.9E-6 | 0.9 | ||

D00332 | 3 | 30 | 0.017 | 566.7E-6 | 30 | 1,318 | 0.170 | 126.1E-6 | 0.2 | |

C | 12 | 4 | 0 | 0.025 | – | 0 | 292 | 0.112 | 383.6E-6 | – |

5 | 0 | 0.034 | – | 0 | 41 | 0.090 | 2.2E-3 | – | ||

6 | 0 | 0.021 | – | 0 | 12 | 0.050 | 4.1E-3 | – | ||

7 | 0 | 0.016 | – | 0 | 8 | 0.037 | 4.6E-3 | – | ||

1 | 54,072,616 | 200.647 | 3.7E-6 | ≥11,645,178 | ≥110,673,601 | T.O. | 14.7E-6 | 4.0 | ||

2 | 68,253 | 5.458 | 80.0E-6 | 68,253 | 321,853 | 130.500 | 334.2E-6 | 4.2 | ||

D00555 | 3 | 4,590 | 4.115 | 896.5E-6 | 4,590 | 6,511 | 70.821 | 6.4E-3 | 7.1 | |

C | 16 | 4 | 91 | 2.702 | 29.7E-3 | 91 | 278 | 28.632 | 77.5E-3 | 2.6 |

5 | 0 | 1.438 | – | 0 | 38 | 15.129 | 397.3E-3 | – | ||

6 | 0 | 0.694 | – | 0 | 5 | 6.882 | 1.4E+0 | – | ||

7 | 0 | 0.211 | – | 0 | 1 | 0.663 | 663.0E-3 | – | ||

1 | ≥10,280 | T.O. | 175.1E-3 | ≥1,432 | ≥883,812 | T.O. | 2.0E-3 | 0.0 | ||

2 | ≥19,587,838 | T.O. | 91.9E-6 | ≥0 | ≥0 | T.O. | – | – | ||

D00079 | 3 | ≥1,134,806 | T.O. | 1.6E-3 | ≥21,048 | ≥0 | T.O. | 85.5E-3 | 53.4 | |

C | 25 | 4 | ≥17,852 | T.O. | 100.8E-3 | ≥0 | ≥0 | T.O. | – | – |

5 | ≥23 | T.O. | 78.3E+0 | ≥0 | ≥0 | T.O. | – | – | ||

6 | ≥0 | T.O. | – | ≥0 | ≥0 | T.O. | – | – | ||

7 | ≥0 | T.O. | – | ≥0 | ≥0 | T.O. | – | – |

**Comparison of varying width**
w
**in chemical graphs (II)**

Entry | 2-Phase Algorithm | Our Algorithm | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Formula | n | w | K | Tree | Time | Time/graph | Tree | Cycle | Time | Time/graph | Ratio |

1 | 3 | 14 | 0.009 | 643.9E-6 | 14 | 5 | 0.014 | 736.8E-6 | 1.1 | ||

2 | 3 | 49 | 0.009 | 183.7E-6 | 49 | 14 | 0.017 | 269.8E-6 | 1.5 | ||

D08040 | 3 | 3 | 58 | 0.009 | 155.2E-6 | 58 | 21 | 0.017 | 215.1E-6 | 1.4 | |

C | 8 | 4 | 3 | 60 | 0.009 | 150.0E-6 | 60 | 25 | 0.017 | 200.0E-6 | 1.3 |

5 | 3 | 60 | 0.009 | 150.0E-6 | 60 | 26 | 0.017 | 197.6E-6 | 1.3 | ||

50 | 3 | 61 | 0.003 | 49.2E-6 | 61 | 28 | 0.017 | 191.0E-6 | 3.9 | ||

1 | 3 | 30 | 0.017 | 566.7E-6 | 30 | 1,318 | 0.170 | 126.1E-6 | 0.2 | ||

2 | 3 | 313 | 0.024 | 76.7E-6 | 313 | 8,822 | 0.266 | 29.1E-6 | 0.4 | ||

D00332 | 3 | 3 | 1,327 | 0.024 | 76.7E-6 | 1,327 | 18,010 | 0.285 | 14.7E-6 | 0.8 | |

C | 12 | 4 | 3 | 2,239 | 0.025 | 11.2E-6 | 2,239 | 24,550 | 0.293 | 10.9E-6 | 1.0 |

5 | 3 | 4,197 | 0.025 | 6.0E-6 | 4,197 | 30,122 | 0.297 | 8.7E-6 | 1.5 | ||

50 | 3 | 6,656 | 0.025 | 3.8E-6 | 6,656 | 34,145 | 0.309 | 7.6E-6 | 2.0 | ||

1 | 3 | 4,590 | 4.115 | 896.5E-6 | 4,590 | 6,511 | 70.806 | 6.4E-3 | 3.8 | ||

2 | 3 | 76,901 | 10.466 | 136.1E-6 | 76,901 | 186,971 | 221.353 | 838.7E-6 | 6.2 | ||

D00555 | 3 | 3 | 221,492 | 14.952 | 67.5E-6 | 221,492 | 770,625 | 317.488 | 320.0E-6 | 4.7 | |

C | 16 | 4 | 3 | 348,335 | 16.381 | 47.0E-3 | 348,335 | 1,307,167 | 347.379 | 209.8E-6 | 4.5 |

5 | 3 | 458,635 | 16.837 | 36.7E-3 | 458,635 | 1,976,544 | 357.252 | 146.7E-6 | 4.0 | ||

50 | 3 | 556,272 | 17.090 | 30.7E-3 | 556,272 | 3,544,713 | 363.743 | 88.7E-6 | 2.9 | ||

1 | 3 | ≥1,134,806 | T.O. | 1.6E-3 | ≥21,048 | ≥0 | T.O. | 85.5E-3 | 53.4 | ||

2 | 3 | ≥3,917,059 | T.O. | 459.5E-6 | ≥0 | ≥0 | T.O. | – | – | ||

D00079 | 3 | 3 | ≥86,360 | T.O. | 20.8E-3 | ≥28,187 | ≥0 | T.O. | 64.6E-3 | 3.1 | |

C | 25 | 4 | 3 | ≥1,469,428 | T.O. | 1.2E-3 | ≥61,929 | ≥229 | T.O. | 29.0E-3 | 24.2 |

5 | 3 | ≥5,118,134 | T.O. | 351.7E-6 | ≥19,900 | ≥1,726 | T.O. | 83.2E-3 | 236.6 | ||

50 | 3 | ≥216,008,008 | T.O. | 8.3E-6 | ≥0 | ≥0 | T.O. | – | – |

## Discussions and conclusions

We considered the problem of enumerating all chemical graphs of 1-augmented tree structure from a given set of path-frequency based feature vectors specified by upper and lower feature vectors, and proposed a new exact algorithm by extending 2-Phase algorithm [14]. The experimental results reveal that the computational efficiency of the new algorithm remains high, considering the hardness of treating 1-augmented trees compared with 0-augmented trees.

One of our future works is to introduce new bounding operations for 1-augmented trees in 2-Phase algorithm and our procedure for creating a cycle. Additionally, it is important to extend the proposed algorithm for enumeration of *k*-augmented trees with *k*≥2 because we can cover 59*%*, 79*%*, and 90*%* of chemical compounds by 2-augmented trees, 3-augmented trees, and 4-augmented trees, respectively. In this paper, we used the assumption that chemical graphs we treat contain only atoms with valence at most 4 (except benzene rings) in order to define the parent of a 1-augmented tree *G* as a 0-augmented tree *T* that is obtained by removing an edge corresponding to a single bond in *G*. However, it is not difficult to extend our enumeration algorithm for chemical graphs possibly with atoms with valence more than 4 just by modifying the definition so that the parent of a 1-augmented tree *G* is allowed to be a 0-augmented tree *T* obtained by removing an edge that corresponds to a double or triple bond in *G*. Although benzene rings have already been treated as virtual atoms of valence 6, regioisomers are ignored in the proposed algorithm. As mentioned in “Experimental and results” section, an efficient algorithm for generating all possible regioisomers of a given 0-augmented tree structure with virtual atoms b has been developed [36]. Therefore, combination of the proposed algorithm with that algorithm is left as future work as well as further extensions for including atoms with valence more than 4 and furan and more general structures.

Although we do not aim to develop enumeration algorithms that are directly applicable to drug design, this is a future target of our research. In order to apply enumeration algorithms to drug design, considering features based on the path frequency is far from sufficient. Factors such as hydrogen bond donors, hydrogen bond acceptors, positive charges, negative charges, and hydrophobic centers should be taken into account. In addition, the binding site information of the target molecule and geometric information such as the occurrence of rotatable bonds should be reflected. In order to include these factors in enumeration algorithms, we should develop efficient methods that can relate chemical graphs with such physico-chemical and geometric factors. However, such a development is not an easy task even for one type of factor and thus is long-term future work.

## Declarations

### Acknowledgements

This work was supported in part by Grant-in-Aid for Scientific Research (KAKENHI) no. 22240009 from JSPS, Japan.

## Authors’ Affiliations

## References

- Bytautas L, Klein DJ: Chemical combinatorics for alkane-isomer enumeration and more. J Chem Inf Comput Sci. 1998, 38: 1063-1078. 10.1021/ci980095c.View ArticleGoogle Scholar
- Bytautas L, Klein DJ: Formula periodic table for acyclic hydrocarbon isomer classescombinatorially averaged graph invariants. Phys Chem Chem Phys. 1999, 1: 5565-5572. 10.1039/a906137a.View ArticleGoogle Scholar
- Bytautas L, Klein DJ: Isomer combinatorics for acyclic conjugated polyenes: enumeration and beyond. Theoret Chem Acc. 1999, 101: 371-387. 10.1007/s002140050455.View ArticleGoogle Scholar
- Buchanan BG, Feigenbaum EA: DENDRAL and Meta-DENDRAL: their applications dimension. Artif Intell. 1978, 11: 5-24. 10.1016/0004-3702(78)90010-3.View ArticleGoogle Scholar
- Funatsu K, Sasaki S: Recent advances in the automated structure elucidation system, CHEMICS. Utilization of two-dimensional NMR spectral information and development of peripheral functions for examination of candidates. J Chem Inf Comput Sci. 1996, 36: 190-204. 10.1021/ci950152r.View ArticleGoogle Scholar
- Fink T, Reymond JL: Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. J Chem Inf Comput Sci. 2007, 47: 342-353. 10.1021/ci600423u.View ArticleGoogle Scholar
- Mauser H, Stahl M: Chemical fragment spaces for de novo design. J Chem Inf Comput Sci. 2007, 47: 318-324. 10.1021/ci6003652.View ArticleGoogle Scholar
- Faulon JL, Churchwell CJ: The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. J Chem Inf Comput Sci. 2003, 43: 721-734. 10.1021/ci020346o.View ArticleGoogle Scholar
- Hall LH, Dailey ES: Design of molecules from quantitative structure-activity relationship models. 3. Role of higher order path counts: Path 3. J Chem Inf Comput Sci. 1993, 33: 598-603. 10.1021/ci00014a012.View ArticleGoogle Scholar
- Deshpande M, Kuramochi M, Wale N, Karypis G: Frequent substructure-based approaches for classifying chemical compounds. IEEE Trans Knowl Data Eng. 2005, 17: 1036-1050.View ArticleGoogle Scholar
- Cayley A: On the analytic forms called trees with applications to the theory of chemical combinations. Rep Br Assoc Adv Sci. 1875, 45: 257-305.Google Scholar
- Pólya G: Kombinatorische anzahlbestimmungen für gruppen, graphen, und chemische verbindungen. Acta Math. 1937, 68: 45-254.View ArticleGoogle Scholar
- Shimizu M, Nagamochi H, Akutsu T: Enumerating tree-like chemical graphs with given upper and lower bounds on path frequencies. BMC Bioinformatics. 2011, 12 (Suppl 14): 53-View ArticleGoogle Scholar
- Suzuki M, Nagamochi H, Akutsu T: A 2-phase algorithm for enumerating tree-like chemical graphs satisfying given upper and lower bounds. IPSJ SIG Technical Reports. BIO 2012, 2012, 28 (17): 1-8.Google Scholar
- Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010, 38: D355-D360,Google Scholar
- Harary F: Graph Theory. 1969, MA: Addison WesleyGoogle Scholar
- Bakir GH, Weston J, Schölkopf B: Learning to find pre-images. Adv Neural Inform Process Syst. 2003, 16: 449-456.Google Scholar
- Bakir GH, Zien A, Tsuda K: Learning to find graph pre-images. Lect Notes Comput Sci. 2004, 3175: 253-261. 10.1007/978-3-540-28649-3_31.View ArticleGoogle Scholar
- Kashima H, Tsuda K, Inokuchi A: Marginalized kernels between labeled graphs. Proceedings of the 20th International Conference Machine Learning. 2003, Palo Alto: AAAI Press, 321-328.Google Scholar
- Ueda N, Akutsu T, Perret JL, Vert JP, Mahé P: Graph kernels for molecular structure-activity relationship analysis with support vector machines. J Chem Inf Model. 2005, 45: 939-951. 10.1021/ci050039t.View ArticleGoogle Scholar
- Byvatov E, Fechner U, Sadowski J, Schneider G: Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci. 2003, 3: 1882-1889.View ArticleGoogle Scholar
- Akutsu T, Fukagawa D, Jansson J, Sadakane K: Inferring a graph from path frequency. Discrete Appl Math. 2012, 160: 1416-1428. 10.1016/j.dam.2012.02.002.View ArticleGoogle Scholar
- Nagamochi H: A detachment algorithm for inferring a graph from path frequency. Algorithmica. 2009, 53: 207-224. 10.1007/s00453-008-9184-0.View ArticleGoogle Scholar
- Kier L, Hall L, Frazer J: Design of molecules from quantitative structure-activity relationship models. 1. Information transfer between path and vertex degree counts. J Chem Inf Comput Sci. 1993, 33: 143-147. 10.1021/ci00011a021.View ArticleGoogle Scholar
- Miyao T, Arakawa M, Funatsu K: Exhaustive structure generation for inverse-QSPR/QSAR. Mol Inform. 2010, 29: 111-125. 10.1002/minf.200900038.View ArticleGoogle Scholar
- Wong WWL, Burkowski FJ: A constructive approach for discovering new drug leads: Using a kernel methodology for the inverse-QSAR problem. J Cheminf. 2009, 1: 4-10.1186/1758-2946-1-4.View ArticleGoogle Scholar
- Gugisch R, Kerber A, Laue R, Meringer M, Rucker C: History and progress of the generation of structural formulae in chemistry and its applications. MATCH Comm Math Comput Chem. 2007, 58: 239-280.Google Scholar
- Fujiwara H, Wang J, Zhao L, Nagamochi H, Akutsu T: Enumerating tree-like chemical graphs with given path frequency. J Chem Inf Model. 2008, 48: 1345-1357. 10.1021/ci700385a.View ArticleGoogle Scholar
- Nakano S, Uno T: Generating colored trees. Lect Notes Comput Sci. 2005, 3787: 249-260. 10.1007/11604686_22.View ArticleGoogle Scholar
- Nakano S, Uno T: Efficient generation of rooted trees. NII Technical Report. 2003, NII-2003-005E,Google Scholar
- Ishida Y, Zhao L, Nagamochi H, Akutsu T: Improved algorithms for enumerating tree-like chemical graphs with given path frequency. Genome Inform. 2008, 21: 53-64.Google Scholar
- Luks EM: Isomorphism of graphs of bounded valence can be tested in polynomial time. J Comput Syst Sci. 1982, 25: 42-65. 10.1016/0022-0000(82)90009-5.View ArticleGoogle Scholar
- Fürer M, Schnyder W, Specker E: Normal forms for trivalent graphs and graphs of bounded valence. Proceedings of 15th Annual ACM Symp Theory of Computing. 1983, NY: ACM Press, 161-170.Google Scholar
- Faulon J-L: Isomorphism, automorphism partitioning, and canonical labeling can be solved in polynomial-time for molecular graphs. J Chem Inf Comput Sci. 1998, 38: 432-444. 10.1021/ci9702914.View ArticleGoogle Scholar
- Jordan C: Sur les assemblages de lignes. J Reine Angew Math. 1869, 70: 185-190.View ArticleGoogle Scholar
- Li J: Enumerating benzene isomers of tree-like chemical graphs. Master’s Thesis. Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University 2014,Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Comments

View archived comments (1)