Template-based combinatorial enumeration of virtual compound libraries for lipids
© Sud et al.; licensee Chemistry Central Ltd. 2012
Received: 30 July 2012
Accepted: 20 September 2012
Published: 25 September 2012
Skip to main content
© Sud et al.; licensee Chemistry Central Ltd. 2012
Received: 30 July 2012
Accepted: 20 September 2012
Published: 25 September 2012
A variety of software packages are available for the combinatorial enumeration of virtual libraries for small molecules, starting from specifications of core scaffolds with attachments points and lists of R-groups as SMILES or SD files. Although SD files include atomic coordinates for core scaffolds and R-groups, it is not possible to control 2-dimensional (2D) layout of the enumerated structures generated for virtual compound libraries because different packages generate different 2D representations for the same structure. We have developed a software package called LipidMapsTools for the template-based combinatorial enumeration of virtual compound libraries for lipids. Virtual libraries are enumerated for the specified lipid abbreviations using matching lists of pre-defined templates and chain abbreviations, instead of core scaffolds and lists of R-groups provided by the user. 2D structures of the enumerated lipids are drawn in a specific and consistent fashion adhering to the framework for representing lipid structures proposed by the LIPID MAPS consortium. LipidMapsTools is lightweight, relatively fast and contains no external dependencies. It is an open source package and freely available under the terms of the modified BSD license.
The combinatorial virtual library enumeration methodology is routinely used during the early stages of the small molecule drug discovery cycle. Virtual compound libraries containing a large of number molecules are generated and ranked based on various calculated/predicted characteristics such as physicochemical properties, activity, specificity, solubility, etc. A set of top ranked compounds are selected and synthesized/acquired for further investigation using experimental techniques [1–7]. A variety of software packages are available for the combinatorial enumeration of virtual compound libraries. These tools fall into three broad categories: open source or freely available packages [8–12]; commercially available packages [13–21]; proprietary software packages implemented for internal use on top of custom or commercial software libraries [22–25]. Although implementation details might differ, all virtual library enumeration packages deploy similar general strategy to generate virtual compound libraries. A core scaffold along with attachment points for R-groups is specified and lists of R-groups are provided by the user. Options to incorporate linkers between the core scaffold and R-groups are also available in some packages. The core scaffold, R-groups and linkers are specified either as SMILES [26, 27] or SD  files. All possible structures are enumerated by the combinatorial attachment of R-groups to the core scaffold along with the placement of any linkers between them and a virtual compound library is generated as a SMILES or SD file. The 2D structure representations generated for the compounds in virtual libraries are rather arbitrary. Although input SD files contain 2D atomic coordinate information for core scaffolds and R-groups, it is not possible to specify the exact orientation of R-groups around scaffolds for the structures enumerated for virtual libraries in any available software package, to the best of our knowledge. Different software packages end up generating completely different orientations of R-groups around scaffolds due to different internal strategies deployed for their optimal placement in the enumerated structures. Consequently, 2D structure layouts for the enumerated structures are not always consistent across software packages.
We have developed a software package called LipidMapsTools for the combinatorial enumeration of virtual compound libraries for lipids. Virtual libraries are enumerated for the specified lipid abbreviations using matching lists of pre-defined templates and chain abbreviations, instead of core scaffolds, linkers and lists of R-groups provided by the user as SMILES or SD files. 2D structures of the enumerated lipids are drawn in a specific fashion; their representation is consistent and adheres to the framework for representing lipid structures proposed by LIPID MAPS consortium [29, 30]. The structure data for the enumerated virtual library is written to a SD file along with additional ontological information such as abbreviation, systematic name, category, main class, sub class, etc. LipidMapsTools is capable of generating large virtual compound libraries for lipids with minimal input from the user.
The three lipid categories of GL, GP and SP along with the cardiolipins (CL), a lipid class under GP, have a fixed backbone with chains and head groups attached to the specific attachment points on the backbone. These characteristics make these types of lipids amenable to the template-based combinatorial enumeration of virtual compound libraries, using the pre-defined lists of most likely chains and templates containing the appropriate head groups.
The lipid abbreviation format (Figure 2) consists of the specifications for chains and head groups. The individual chain specifications are delimited by a backslash (/). For glycerolipids, three different lipid abbreviation formats are used: MG(sn1Chain/0:0/0:0), DG(sn1Chain/sn2Chain/0:0) and TG(sn1Chain/sn2Chain/sn3Chain). MG, DG and TG refer to monoradylglycerols, diradylglycerols and triradylgylcerols. Representative examples of the lipid abbreviation format for GL are: MG(16:0/0:0/0:0), DG(18:1(9Z)/16:0/0:0) and TG(16:0/16:0/18:1(11E)).These abbreviations correspond to 1-hexadecanoyl-sn-glycerol, 1-(9Z-octadecenoyl)-2-hexadecanoyl-sn-glycerol and 1,2-dihexadecanoyl-3-(11E-octadecenoyl)-sn-glycerol respectively.
The glycerophospholipid abbreviation format consists of the specifications for a head group and two sn chains: Headgroup(sn1Chain/sn2Chain). The cardiolipins abbreviation format is similar to the glycerophospholipids with additional specifications for extra set of sn chains: CL(1'-[sn1Chain/sn2Chain],3'-[sn1Chain/sn2Chain]). It has two sets of sn1 and sn2 chains at two different glycerol backbones attached to two phospho groups that are further connected to another glycerol backbone at 1' and 3' positions; no head group is specified. Representative examples of lipid abbreviations for GP are: PC(16:0/20:4(5Z,8Z,11Z,14Z)) and PE(16:0/18:1(9Z)). These abbreviations correspond to hexadecanoyl-2-(5Z,8Z,11Z,14Z-eicosatetraenoyl)-sn-glycero-3-phosphocholine and hexadecanoyl-2-(9Z-octadecenoyl)-sn-glycero-3-phosphoethanolamine. The cardiolipin abbreviation, CL(1'-[16:0/18:1(11Z)],3'-[16:0/18:1(11Z)]), corresponds to 1',3'-Bis-[1-hexadecanoyl-2-(11Z-octadecenoyl)-sn-glycero-3-phospho]-sn-glycerol.
The sphingolipid abbreviation format includes the specifications of a long chain base and an N-acyl chain on the ceramide backbone along with the specification of a head group: Headgroup(LongChainBase/NAcylChain). One of the three letters - d, t, or m - must precede the chain length specifier of the long chain base; the format of rest of the long chain base and N-acyl chain abbreviation is similar to the format of chain abbreviation for other lipid categories. The letters t and m are used to represent 4R-hydroxy and 3-keto groups at positions 4 and 3 respectively in the long chain base. Representative examples of the lipid abbreviations format for SP are: Cer(d18:1(4E)/14:0), Cer(t18:0/18:2(9Z,12Z)) and Cer(m14:0/16:1(9Z)). These abbreviations correspond to N-(tetradecanoyl)-sphing-4-enine, N-(9Z,12Z-octadecadienoyl)-4R-hydroxy-sphinganine and N-(9Z-hexadecenoyl)-3-keto-tetradecasphinganine.
Representative examples of chains available for sn positions during the combinatorial enumeration of virtual compound libraries
Representative examples of the commands for generating virtual compound libraries for different lipid categories
Enumerate all possible glycerolipid structures
Enumerate all possible glycerolipid structures containing 2 double bonds in sn1 chains
Enumerate all possible diglycerols structures containing 2 double bonds in sn1 chains and a specific double bond in sn2 chain
Enumerate all possible glycerophospholipid structures
GPStrGen.pl “*(*- > 10 < 20:*/*+ > 16 < 24:*)”
Enumerate all possible glycerophospholipid structures containing sn1 chains with odd chain length > 10 and < 20, and sn2 chains with even chain length > 16 and < 24
Enumerate all possible glycerophospholipid structures containing phosphocholine (PC) headgroup and a specific sn1 chain
Enumerate all possible cardiolipin structures
Enumerate all possible cardiolipin structures containing a specific sn1 chain at 1' position
Enumerate all possible sphingolipid structures
Enumerate all possible sphingolipid structures without any N-acyl
Enumerate all possible sphingolipid structures containing odd long chain base lengths
SPStrGen.pl “*(*+ > 18:*/0:0)”
Enumerate all possible sphingolipid structures containing long chain bases with even chain length > 18 and no N-acyl chain
In addition to the complete enumeration of all possible structures for various lipid categories using wild card characters for chain lengths, number of double bond along with their geometries and head groups, the LipidMapsTools software package allows the generation of subsets of these virtual libraries corresponding to specific chain lengths, number of double bonds with specific double bond geometry and head groups. For example, the lipid abbreviation "CL(1'-[* > 17 < 21:*/* > 17 < 21:*],3'-[* > 17 < 21:*/* > 17 < 21:*])" generates a subset of virtual compound library for the cardiolipins containing sn chain lengths between 17 and 21.
The templates for lipid categories are stored in each category specific module as MDLMOL strings corresponding to template structure data, along with mapping of each template ID to additional information such as atom numbers and atomic coordinates for attachment points, number of chain carbons in the template, head group name, lipid category, main class, sub class, etc. No external MDLMOL data files are needed.
LipidMapTools provides command line scripts for the combinatorial enumeration of virtual compound libraries for lipids from the specified lipid abbreviations. Virtual compound libraries are generated by the combinatorial enumeration of most likely chains around the specific templates, with chain lengths varying from 2 to 39 containing specific number of double bonds and their geometry. Some radyl chains corresponding to alkyl and 1Z-alkenyl chains instead of acyl chains are skipped from sn2 and sn3 positions during the combinatorial enumeration, wherever they are not permitted by the LIPID MAPS classification scheme for lipids. For example, the LIPID MAPS classification scheme doesn’t contain any sub classes for alkyl and 1Z-alkenyl chains at sn2 and sn3 positions for the glycerolipids, and the structures corresponding to these chains at sn2 and sn3 positions are not generated.
Virtual compound libraries containing all possible structures for GL, GP, CL and SP are generated using the commands GLStrGen.pl "*(*/*/*)", GPStrGen.pl "*(*/*)", CLStrGen.pl "CL(1'-[*/*],3'-[*/*]) and SPStrGen.pl "*(*/*)" respectively. These command line scripts generate SD files containing the 2D structure data for all the enumerated structures along with additional information such as abbreviation, systematic name, chain length and double bond count, lipid category, main class, sub class, etc. The subsets of complete virtual libraries containing specific chain lengths and head groups are generated by their explicit specification in the specified lipid abbreviations. For example, the command GLStrGen.pl "PC(*:*/*:*)" generates a subset of GP virtual library containing all possible structures with the phosphocholine (PC) head group. A SP virtual library containing structures with the sphingnomyelin (SM) head group and the long chain bases between length 15 and 21 is generated by the following command: SPStrGen.pl "SM(* > 15 < 21:*/* > 15 < 21:*)". The complete lists of available head groups for GP and SP are shown in Additional file 1: Table S7 and Table S8.
Command line format along with some relevant statistics for enumerating virtual compound libraries of lipids corresponding to various lipid categories
1 h 29 m 10s
16 m 33 s
190 h 18 m 36 s
5CLStrGen.pl "CL(1'-[* > 17 < 21:*/* > 17 < 21:*],3'-[* > 17 < 21:*/* > 17 < 21:*])"
7 h 28 m 12 s
1 m 32 s
The current focus of LipidMapsTools for the combinatorial enumeration of virtual compound libraries is on the mammalian lipids. The pre-defined lists of radyl chains and long chain bases contain the specifications for chain lengths with degrees of unsaturation that are most likely to occur in the mammalian lipids. Although these pre-defined lists are quite comprehensive, it is impossible to cover all the scenarios not only in terms of novel mammalian lipids but also different types of radyl chains or long chain bases that may be present in non-mammalian species such as plants, insects, bacteria, fungi and marine organisms. LipidMapsTools is designed to allow addition of new radyl chains and long chain bases in a relatively straight forward manner. The core Perl modules ChainAbbrev.pm and SPChainAbbev.pm contain the pre-defined lists of radyl chains and long chain bases respectively. After the pre-defined lists in the appropriate modules have been updated, the newly added radyl chains or long bases are available for the enumeration of virtual compound libraries through the command line scripts.
In addition to the combinatorial enumeration of virtual compound libraries for lipids from the specified abbreviations containing wild card characters, LipidMapsTools is capable of generating the specific lipid structures from the specific lipid abbreviations. For example, the command, GPStrGen.pl "PC(16:0/20:4(5Z,8Z,10E,14Z)(12OH[S]))", generates a SD file containing structure and ontological data for one glycerophospholipid corresponding to 1-hexadecanoyl-2-(12S-hydroxy-5Z,8Z,10E,14Z-eicosatetraenoyl)-sn-glycero-3-phosphocholine.
LipidMapsTools also provides the capability to generate individual lipid structures containing arbitrary specifications for radyl chains or long chain bases which are not present in the pre-defined lists available in the package, without requiring any customization by the user. This functionality is available in the scripts provided with the LipdMapsTools package through a command line option. For the lipid abbreviations containing arbitrary specifications, the structure generation methodology used in the LipidMapsTools package skips the step to confirm the presence of the specified radyl chains or long chain bases in the pre-defined lists of most likely chain lengths, and proceeds to generate the structure as long as the format of the specified abbreviation is valid. For example, the command, SPStrGen.pl --ChainAbbrevMode Arbitrary “SM(d30:4(4E,8E,12E,16E)/34:4(5Z,8Z,11Z,14Z)(16OH[R]))", parses the arbitrary specifications for the long chain base and N-acyl chain not present in the pre-defined lists available in the package, generates the appropriate sphingomyelin structure and writes it out to a SD file. This functionality facilitates the generation of individual structures for both mammalian and non-mammalian species containing radyl chains or long chain bases currently not present in the LipidMapsTools package. The capability to generate individual structures from the specific lipid abbreviation is quite useful for on-the-fly structure generation for populating databases and on line structure display.
The LipidMapsTools software package has been developed for the template based combinatorial enumeration of virtual compound libraries for lipids. A set of command line scripts is provided to enumerate all possible structures corresponding to the specified lipid abbreviations without any additional input requirements from the user. It is relatively straight forward to generate subsets of complete virtual libraries by explicit specifications of chains and head groups in the lipid abbreviations. 2D structures of the enumerated lipids are drawn in a specific fashion; their representation is consistent and adheres to the framework for representing lipid structures proposed by LIPID MAPS consortium. The customization and enhancement of existing functionality along with development of new functionality is facilitated by modular nature of the software architecture. LipidMapsTools is under continuous development and we anticipate the addition of the new templates along with the radyl chains and long chain bases for both mammalian and non-mammalian lipid species in the future versions of the package.
·Project name: lipidmapstools
·Project home page:http://www.lipidmaps.org/downloads/
·Operating system(s): Platform independent
·Programming language: Perl
·Other requirements: None
·License: Modified BSD License
·Any restrictions to use by non-academics: None
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number U54 GM069338. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We thank the reviewers for their suggestions to improve the quality of the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.