MORT: a powerful foundational library for computational biology and CADD
© Zhang et al.; licensee Chemistry Central Ltd. 2014
Received: 17 April 2014
Accepted: 23 June 2014
Published: 27 June 2014
A foundational library called MORT (Molecular Objects and Relevant Templates) for the development of new software packages and tools employed in computational biology and computer-aided drug design (CADD) is described here.
MORT contains several advantages compared with the other libraries. Firstly, MORT written in C++ natively supports the paradigm of object-oriented design, and thus it can be understood and extended easily. Secondly, MORT employs the relational model to represent a molecule, and it is more convenient and flexible than the traditional hierarchical model employed by many other libraries. Thirdly, a lot of functions have been included in this library, and a molecule can be manipulated easily at different levels. For example, it can parse a variety of popular molecular formats (MOL/SDF, MOL2, PDB/ENT, SMILES/SMARTS, etc.), create the topology and coordinate files for the simulations supported by AMBER, calculate the energy of a specific molecule based on the AMBER force fields, etc.
We believe that MORT can be used as a foundational library for programmers to develop new programs and applications for computational biology and CADD. Source code of MORT is available at http://cadd.suda.edu.cn/MORT/index.htm.
KeywordsRelational model MORT AMBER Antechamber Foundational library CADD
Molecular modeling techniques have been widely used in the fields of chemistry, biology, drug design, and materials science for studying molecular systems ranging from small molecules to large biological molecules and even material assemblies. A lot of molecular simulation and visualization tools or packages have been developed [1–9]. Amber  is extensively used for the simulation of biomolecules, and XLEAP is its graphical user interface. XLEAP is written in C but trying to use the object-oriented programming (OOP) paradigm that is not natively supported by C. Therefore, the code of XLEAP is awkward and extremely hard to be extended. Consequently, a lot of functionalities that should have been incorporated into XLEAP were either implemented as separate programs, such as ADDLES2 for Locally Enhanced Sampling, and antechamber for the calculations of partial charges and the assignment of atom types, or merged into molecular dynamics (MD) calculation stage as extra points. The study we discuss here is trying to put this situation to an end by developing a new foundational library that is called Molecular Objects and Relevant Templates (MORT).
MORT is written in C++ that is a native OOP language, and a relational model and other well-designed patterns have been applied to this library, which makes it very flexible. Furthermore, many commands/functions supported by XLEAP and AmberTools have been merged into this library . Based on MORT, it will be very easy for programmers or readers with interests to develop new applications for computational biology and CADD.
Data structure and basic features of MORT
The relational model of MORT
An advantage of MORT is that it employs the relational model rather than the traditional hierarchical model to store all the information of a molecule. In the hierarchical model, a molecule owns some residues, and a residue owns some atoms. Therefore, a molecule does not own atoms directly. In the relational model, a molecule owns residues and atoms directly, while a residue does not own atoms, and they just have relations between each other.
The hierarchical model is used by XLEAP and some other software packages (for example, NAB ). But it has the following disadvantages. Firstly, the hierarchical model is inconvenient for iterating over the atoms in a molecule. In the hierarchical model, to iterate over atoms, users need to iterate over all residues first. Therefore, counting atom number is not a constant time operation but proportional to the number of residues. To overcome this problem, some kind of cache may be used to store the atom number or pointers to the atoms inside a molecule. This solves the efficiency problem but brings up a consistency problem, i.e. the cache needs to be updated when an atom is deleted or inserted. Secondly, the hierarchical model has problems to handle bonds, because in the hierarchical model, residues own atoms, and it is natural that bonds should be owned by residues too. A problem may always exist for inter-residue bonds, whose ownership would be a dilemma for the hierarchical model. Similar things happen to other components, such as angles, torsions and improper torsions. Thirdly, the hierarchical model is hard to be extended. For example, adding a new object strand is a very useful concept for a DNA system, because a DNA molecule has two strands and each strand has several residues. In the hierarchical model, we need to add an extra layer between molecule and residue, but in practice, all code that involves the iterations on atoms and residues needs to be changed to iterate on strands first, which in turn means massive changes to the existing code.
However, all these problems do not exist or can be easily fixed for the relational model employed in MORT. Firstly, the problem of iterating on atoms does not exist because a molecule owns atoms directly, and then users can perform the iteration on a molecule directly disregarding residues. Secondly, the relational model does not have any problem dealing with bonds, angles and torsions, because in the relational model a molecule directly owns bonds, angles (if any) and torsions (if any). Finally, compared with the hierarchical model, the relational model is extremely easy to be extended. Taking the strand as an example, to introduce a strand by employing the relational model, users do not need to change the old code. Strand can only appear in DNA-related program and all users need to do is to make sure that the relations between strand and residues have been created correctly.
Composition of a molecule
The descriptions of some important classes of variables
a class contains molecular objects and their relationships
a class used to store the information of molecular objects
a class used to store the relationships between molecular objects
a class contains lots of molecules
a base class of the molecular objects
a class used to handle atoms
a class used to handle bonds
a class used to handle angles
a class used to handle dihedral angles
a class used to handle residues
A component physically contains all the properties of a kind of molecular objects. For example, the atom component of a molecule contains all the properties of atoms. More specifically, the properties of the same type are stored sequentially in an array, the atomic numbers of all atoms are stored in an integer array, while the names of all atoms are stored in a string array. However, the properties whose type is numerical vector are handled differently, and they are not stored in an array of numerical vectors (which can be considered as a 2D array of floating numbers) but in a 1D array of floating numbers. The component is implemented in this way to achieve the best space efficiency.
An adjacency records the relations between two components. For each pair of connected molecular objects, there is a record in the adjacency. More specifically, it uses a 2D array to store connections.
Assessing properties of MOs
As has been mentioned before, class morf_t is created for the modification, creation or deletion of objects. In order to achieve the above goals, some functions are necessary to access the properties of MOs, and therefore several member functions have been designed as follows:
void set_x( const string& pname, const value_type& v );
value_type get_x( const string& pname ) const;
bool get_x( const string& pname, value_type& v );
The character ‘x’ in the names of these functions can be any of the following five characters: ‘i’ (for integer), ‘d’ (for double precision), ‘s’ (for string), ‘v’ (for numeric vector) and ‘a’ (for any other data type), while the value_type can be int, double, string, numvec and boost::any depending on the ‘x’. In order to accelerate the process, string can be replaced by HASH values, and these functions can be transformed into the following ones:
void set_x( const long long& pid, const value_type& v );
value_type get_x( const long long& pid ) const;
bool get_x( const long long& pid, value_type& v );
Iterating on MOs
Two methods can be used to iterate on the MOs of a molecule: MOITER (molecular object iterator, represented by class iter_T) or MORANGE (molecular object range, represented by class range_T).
MOITER is a random access iterator, and it has the following member functions:
morf_t const& operator*() const;
morf_t & operator- > ();
morf_t const& operator- > () const;
iter_t& operator + =(ptrdiff_t pdif);
iter_t& operator- = (ptrdiff_t pdif);
With all these member functions implemented, MOITER works just like a pointer to class morf_t. The following two member functions of class molecule_t return the starting and ending iterators of a certain type of MO:
iter_t xxxx_begin ();
Here, xxxx could be any of the eight 4 character ID of a molecular object type.
Another way to enumerate MOs is to use MORANGE (molecular object range, represented by class range_T). Class range_T has the following member functions:
morf_t operator(int id) const;
morf_t at(int id) const;
MORANGE in a sense works just like an array of MOs. Class molecule_t has the functions xxxxs() that return the MORANGE of a certain type of MOs, while class morf_t has similar member functions related_xxxxs() that returns the MORANGE of related MOs.
Basic functions in MORT
OBJFUN: a lot of functions are defined in this directory, and they can be used to modify objects. For example, the function fixbond can be used to fix the bond order of a molecule based on several rules, such as hard rule, length rule and conjugation rule, as illustrated in our previous work ; the function addHs can be used to add the missing hydrogen atoms of a molecule (bond information is required); the function create can be used to create atoms, bonds, residues, angles, etc.; the functions transform and translate can be used to transform/move a molecule according to one matrix/vector; the function rotate can be used to rotate a molecule; the function center can be used to determine the geometrical center of a molecule or residue. A lot of other functions are not mentioned here, and the descriptions of the important functions can be found in Additional file 1. Apart from being used alone, reasonable combinations of these functions may be more helpful to users. For example, the combination of the functions fixbond, addHs and setpchg can be used to add the missing information of a molecule, which is necessary for the calculation of the energy.
PDBENT and TRIPOS: functions in these two directories are primarily used to handle the files in PDB  and MOL2  formats. The most important functions are read_pdb/read_mol2 and write_pdb/write_mol2, which act as the controller of the data input/output stream in MORT. Each of these functions is composed of several sections, and in each section there is one function to parse the corresponding information. For example, read_pdb is used to get the molecule from a PDB formatted file. While parsing the file, it can recognize the first four letters of each line as its identity to determine which section it belongs to, and then the corresponding functions will be invoked to get the information from this line. Take "ATOM" for example: when "ATOM" has been recognized, the function read_atom will be used to parse the atom’s information (including its name, coordinate and type). And then an atom object will be created to store all the information.
SMARTS: functions in this directory can handle two kinds of information: SMILES and SMARTS . SMILES is the acronym of Simplified Molecular Input Line Entry Specification that has been widely used as a general-purpose chemical nomenclature and data exchange format, and SMARTS is the straightforward extensions of SMILES. The SMILES/SMARTS of a molecule is stored as a character string, which is the input parameter of the function read_smiles/read_smarts. These functions can extract the topology information of a molecule from the input string, and then store them into the database as a molecule. The whole process is not very complicated, and it will handle one character in each iteration. First, it will judge if the characters are in bracket "" or not, because the characters in or out of a bracket have different meanings. For example, "C" in "xxCxx" means that it is a normal carbon atom, but "C" in "[xxCxx]" may represent a carbon isotope. After the judgment, the corresponding functions will be used to analyze the character: parse_charge can be used to get the charge information from "+" and "-"; parse_ring can be used to recognize the digit outside a bracket as a starting/ending point of a ring; parse_weight can be used to parse the element’s weight that is listed in a bracket; parse_alpha can be used to interpret the element’s type, etc. Finally, all the information will be recorded into a molecule.
CAPBOX: functions in this directory are mainly designed to add the solvent environment to an object. Four functions were designed to build different kinds of solvation environments: solvatecap can be used to add a solvate cap around the specified position of a solute; solvatebox can be used to add a solvate box to the solute in a cuboid way; solvateoct can be used to add the solvate box to a solute in a truncated octahedron way, which can reduce the number of the added solvent molecules; solvateshl can be used to add a solvate shell around the whole solute. The algorithms of these four functions are very similar. The process of solvateshl is illustrated in Figure 5 as an example.
Apart from the functions mentioned above, the function addions can be used to add positive/negative ions to the whole molecule.
ATMASK: sometimes, users may want to get the selected partition of a molecule, such as the atoms within 5Å from the 4th atom of a molecule. To achieve this goal, the function mask_atom was designed. This function can be employed to the specified atoms and residues. The argument of this function is a little complicated, and it is composed of some figures and symbols. For example, ":1-10" means to get the residues 1 to 10 from a molecule; "@4 < @5" means to return all the atoms within 5Å from the atom 4.
Following is the steps to interpret "@4 < @5". The first "@" indicates that this is something about atoms ("@" represents atoms and ":" stands for residues). The functions of class atom_node will then be invoked, and the following letters or numbers will be parsed as a MO in a specified molecule (here "4" represents the 4th atom). The symbol "<" means to get the atoms/residues in a certain distance, and then it will invoke the function parse of class dist_node to interpret the next section. The previous MO will be regarded as a core. The second "@" indicates that the queried objects are atoms, and "5" equals to the threshold of the distance. Finally, the function match checks each atom in the molecule that satisfies the condition, and returns the qualified ones.
FORMAT: functions defined here can be used to handle different kinds of formatted files. For example, the function read_sdf can be used to parse the molecule from a MOL/SDF formatted file, and the function load_mdb can be used to load the molecules from a database file, which contains a lot of molecules with the MOL2, MOL/SDF or OFF format. Once a molecule has been loaded, it will be stored in a database (represented by class database_t) with its name as the identity.
- (1).ENEFRC: functions in this directory are energy-related. The function eval_bond, eval_angl or eval_tors can be used to calculate the energy of bonds, angles or torsions. The non-bond energy can be calculated in two ways. The function get_dir can be used to calculate the energy based on periodic boundary condition (PBC) by using Ewald summation [15, 16]. And the other function nonbond_egb can be used to calculate the polar contribution of desolvation for a non-periodic model by using the Generalized Born (GB) model  based on the following equations:(1)(2)(3)(4)
AMBFMT: In this part of MORT, some functions were designed to read the AMBER and AMOEBA force fields and save the properties of a molecule into topology files. For example, write_amber_prmtop can be used to generate two files: topology (such as charge, bond, angle, etc.) and coordinate files (space information), which can be used as the input files for MD simulations. read_frc/read_amoeba_frc can be used to get the AMBER/AMOEBA force field parameters from an AMBER/AMOEBA parameter file and then store them into molecules.
For the better use of MORT, several commands have been defined in directory "plugin", which makes it very convenient to develop one serviceable application with just a few or sometimes just one command for the developers.
PLUGIN: commands in this directory can be regarded as the MORT’s interface, and users can call these commands to accomplish many kinds of tasks with less effort than using functions. They can help the developers do more work in less time.
In each command class, apart from its constructor and destructor, one function is necessary:
virtual bool exec();
This function is a virtual function inherited from class command_i, which is the base class of all the command classes. exec performs variously in command_i’s child classes, reflecting the polymorphism of C++ language. Once one command class is declared, exec should be used to execute the corresponding commands at the end.
A lot of commands have be defined, for example, source can be used to interpret the files displayed in directory "dat/cmd", which contains the commands used for loading the force field parameters for the preparation of other operations (so source is usually the first thing that users need to do before executing any command); merge can be used to put the objects listed in its arguments together; solvate can be used to add the solvents around a molecule; moloper can be used to add the missing information to a molecule (mentioned in directory OBJFUN part), etc. These commands are all defined in their own classes, and users need to declare the class first before doing any operation. The details of the important functions and classes in MORT and the related information of important commands can be found in Additional file 1.
Results and discussion
MORT can be used as a foundational library to develop new programs or software packages for computational biology and CADD. Usually, based on MORT, only a few codes are needed to solve a problem. For example, if users want to calculate the energy of a protein, the energy-related functions can be used conveniently. If the information of a molecule is incomplete, fixbond, addHs and other functions can work together to add the missing information.
Moreover, many commands in AmberTools have been integrated into MORT. For example, setpchg can be used to call the standalone program in Antechamber  to assign atomic partial charges and parmchk can be used to call Parmchk to add missing force field parameters. With all these commands implemented, it is easy to create the topology file for a protein-inhibitor complex inside MORT without calling any outside programs.
Availability and requirements
MORT is available at http://cadd.suda.edu.cn/MORT/index.htm, and it can be compiled into a static library on Linux platform. It’s written in C++ and the boost library (version 1.46.1 or newer) is needed.
This study was supported by the National Science Foundation of China (21173156), the National Basic Research Program of China (973 program, 2012CB932600), and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).
- Andrio P, Fenollosa C, Cicin-Sain D, Orozco M, Gelpí JL: MDWeb and MDMoby: an integrated web-based platform for molecular dynamics simulations. Bioinformatics. 2012, 28 (9): 1278-1279. 10.1093/bioinformatics/bts139.View ArticleGoogle Scholar
- Cao DS, Xu QS, Hu QN, Liang YZ: ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics. 2013, 29 (8): 1092-1094. 10.1093/bioinformatics/btt105.View ArticleGoogle Scholar
- Humphrey W, Dalke A, Schulten K: VMD: visual molecular dynamics. J Mol Graph. 1996, 14 (1): 33-38. 10.1016/0263-7855(96)00018-5.View ArticleGoogle Scholar
- Morris GM, Huey R, Lindstrom W, Sanner MF, Belew RK, Goodsell DS, Olson AJ: AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. J Comput Chem. 2009, 30 (16): 2785-2791. 10.1002/jcc.21256.View ArticleGoogle Scholar
- Pronk S, Páll S, Schulz R, Larsson P, Bjelkmar P, Apostolov R, Shirts MR, Smith JC, Kasson PM, van der Spoel D: GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics. 2013, 29 (7): 845-854. 10.1093/bioinformatics/btt055.View ArticleGoogle Scholar
- Case D, Darden T, Cheatham T, Simmerling C, Wang J, Duke R, Luo R, Walker R, Zhang W, Merz K: AMBER 12. 2012, San Francisco: University of CaliforniaGoogle Scholar
- Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009, 25 (11): 1422-1423. 10.1093/bioinformatics/btp163.View ArticleGoogle Scholar
- Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer M: BioJava: an open-source framework for bioinformatics. Bioinformatics. 2008, 24 (18): 2096-2097. 10.1093/bioinformatics/btn397.View ArticleGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H: The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.View ArticleGoogle Scholar
- Macke TJ, Case DA: Modeling unusual nucleic acid structures. ACS Symp Ser Am Chem Socn. 1998, 1998: 379-394.Google Scholar
- Zhang Q, Zhang W, Li Y, Wang J, Zhang L, Hou T: A rule-based algorithm for automatic bond type perception. J Cheminform. 2012, 4 (1): 1-10. 10.1186/1758-2946-4-1.View ArticleGoogle Scholar
- Callaway J, Cummings M, Deroski B, Esposito P, Forman A, Langdon P, Libeson M, McCarthy J, Sikora J, Xue D: Protein Data Bank contents guide: Atomic coordinate entry format description. Brookhaven Natl Lab. 1996Google Scholar
- Tripos L: Tripos Mol2 File Format. 2007, St. Louis, MO: TriposGoogle Scholar
- Weininger D: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988, 28 (1): 31-36. 10.1021/ci00057a005.View ArticleGoogle Scholar
- Ewald PP: Die Berechnung optischer und elektrostatischer Gitterpotentiale. Ann Phys. 1921, 369 (3): 253-287. 10.1002/andp.19213690304.View ArticleGoogle Scholar
- Still WC, Tempczyk A, Hawley RC, Hendrickson T: Semianalytical treatment of solvation for molecular mechanics and dynamics. J Am Chem Soc. 1990, 112 (16): 6127-6129. 10.1021/ja00172a038.View ArticleGoogle Scholar
- Wang J, Wang W, Kollman PA, Case DA: Automatic atom type and bond type perception in molecular mechanical calculations. J Mol Graph Model. 2006, 25 (2): 247-260. 10.1016/j.jmgm.2005.12.005.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.