- Research article
- Open Access
PubChem chemical structure standardization
© The Author(s) 2018
- Received: 18 April 2018
- Accepted: 1 August 2018
- Published: 10 August 2018
PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure.
The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form).
Several machine-readable molecule representations have been developed. Among the most popular are line notations [9–17], systematic IUPAC names [18–20], connection table files, and reaction data files [21–24]. The level of detail in these representations varies, especially with respect to the specification of hydrogen atoms and the configuration of stereocenters. Conversion between different structure representations is prone to information loss and errors [25, 26]. The perception of structures from three-dimensional (3-D) atom coordinates is an additional source for structural errors [27–30]. Erroneous (interpretation of) structures are a major problem, as it has been shown that even small errors in structure representations can lead to significant loss of predictive ability of computer models , affecting downstream computation in cheminformatics.
Conversely, ‘aromatic’ moieties in structures can be represented in Kekulé form using alternating single- and double bonds [63, 64]. Several algorithms for the enumeration of Kekulé structures of conjugated systems have been reported in the literature [65–70]. Kekulé forms of a molecule (as opposed to the aromatic representation) may be necessary when computing descriptors or properties about a chemical structure or to remove ambiguity in aromaticity interpretation. Yet, methods attempting to generate a single representative Kekulé form (a process referred to as ‘kekulization’) are either heuristics (i.e., may not find a Kekulé representation even though it exists) or remain arbitrary (i.e., non-canonical) in the resulting structure [71, 72]. To the best of our knowledge, no method has been described that is dedicated to the generation of a representative canonical Kekulé form. This issue compounds the lack of a standard definition of aromaticity, because aromaticity is typically perceived from a Kekulé structure. On the other hand, given a structure with ‘aromatic’ (instead of single and double) bonds, the underlying (canonical) Kekulé structure is not obvious. Consequently, kekulization approaches should be able to deal with the various existing aromaticity definitions and compensate for their intrinsic differences, without generating cases where conjugation is broken (e.g., *–C=C=C–* or * = C–C–C= * instead of *–C=C–C= *) or where a different count of double bonds occurs due to differences in handling exo-cyclic heteroatoms. Lastly, aromaticity approaches should be coupled closely with tautomer handling approaches, as choice of tautomeric form may directly affect aromaticity, depending on the aromaticity model employed.
Chemical structure standardization is of utmost importance to compensate for the diverse (and potentially ambiguous) nature of chemical structure representation and interpretation, while identifying and correcting (or rejecting) erroneous structures, to ensure proper interpretation of chemical content by a given data system. Yet, guidelines or performance measures for this purpose remain scarce [53, 73, 74]. With increasing size and popularity of public chemical information resources this issue becomes even more important as the ready ability to download, normalize, and share millions of chemical structures increases the potential for rapid and broad spread of errors [75–77]. Once erroneous structures are shared, errors in these copies may not be easily recognized or corrected, especially if the chemical structure is deemed valid and the original data content provenance is lost. This is not a minor problem, as the percentage of affected erroneous structures has been estimated to be between 0.1 and 8% [31, 78–80].
Verify element, which evaluates the validity of specified element and isotopic information.
Verify hydrogen, which performs adjustments to implicit hydrogen counts, as necessary.
Verify functional groups, which puts diverse functional group representations into a preferred form.
Verify valence, which evaluates connectivity and charge information per atom using a dictionary of allowed valences.
Standardize annotations, which removes perceived PubChem-specific bond type annotations.
Standardize valence bond form, which generates a canonical tautomer representation of the structure.
Standardize aromaticity, which determines a canonical Kekulé structure.
Standardize stereochemistry, which evaluates available information about stereocenters and attempts a canonical configuration.
Standardize explicit hydrogens, which converts implicit hydrogen counts to explicit hydrogen atoms in the molecular graph.
Standardization success rates
Standardization rejection rates
Number of substances
Verify functional groups
Standardize valence bond
Standardize explicit hydrogens
Successfully standardized substances
A total of 10,243 substances are rejected during the determination of a canonical tautomer in the Standardize Valence Bond step. The reasons for this can be very simple, as shown in Fig. 9c for SID 235635. A final structure sanity check tests for identical charge types on adjacent atoms and rejects structures that test positive. One can look at these as edge cases, whereby the structural representation becomes corrupted in some way. With such diverse structural content, while such cases are potentially fixable (e.g., by means of adjusting hydrogen count or removal of a charge), it usually is a sign of some other molecule corruption or oddity that should be rejected for later manual inspection.
During the conversion of implicit hydrogen atom counts to explicit hydrogen atoms (in the Standardize Explicit Hydrogens step), 486 substances are rejected. In most cases the affected structures are oligonucleotides. The addition of explicit hydrogen atoms to the molecule can result in those structures exceeding the current PubChem atom/bond limit of 999 (while not a technical limit, it is a ‘line in the sand’ defining a ‘small molecule’ project scope that may be changed in the future given the increasing number of therapeutic, chemically-modified biopolymers). This restriction mimics the limits of the MDL V2000 MOL file format for chemical structures. Exemplary substances are SID 596521 (a hammerhead ribozyme) and SID 596662 (Ampligen with Amphotericin B).
A structure failing standardization is not necessarily a shortcoming of the standardization approach. In most cases, the rejection of a chemical structure indicates that it does not comply with known/common chemical configurations. Without additional information indicating the original intent of the scientist, the chemical substances cannot be readily normalized and, consequently, are not mapped to a compound. Conflicting or ambiguous chemical structure drawing conventions add a barrier to the creation of normalization rules, as what may correct in one case may corrupt in another.
We monitored structure modifications during standardization by comparing de-aromatized canonical isomeric SMILES generated before and after each standardization step, as described in the “Standardization modification tracking” subsection of the “Methods” section. We did not include data obtained from structures that eventually were rejected during standardization.
Standardization modification rates
(102,177,263 successfully standardized substances)
(52,082 successfully standardized substances)
(2,064,089 successfully standardized substances)
(104,293,434 successfully standardized substances)
Exclusively modified Substances
Exclusively modified Substances
Exclusively modified Substances
Exclusively modified Substances
Verify functional groups
Standardize valence bond
Standardize explicit hydrogens
In the Verify Hydrogens step, the (implicit) hydrogen atom counts in 297,283 substances (0.3% of successfully standardized substances and 0.6% of modified substances during standardization) were adjusted to obtain chemically-valid structures. No inorganic substance was modified in this step.
The Standardize Valence Bond step performs the identification of a canonical tautomer. Consequently, the resonance form may be altered in this step and then again in a later, separate canonicalization. In addition, this step can change bond orders as well as alter hydrogen counts and formal charges. A total of 37,722,187 substances were affected by this step (36.2% of standardized substances, 81.3% of modified substances). The remaining 63.8% of standardized substances were not altered in this step, meaning that they either did not exhibit tautomerism or were already the preferred tautomeric form selected by the PubChem standardization procedure. Therefore, the detected change in 36% of substances may be considered a “lower bound” for the fraction of chemical structures that show some form of tautomerism. This is noteworthy as it is greater than the results obtained in some earlier studies (0.5% , 26% , 30% ).
To get a more accurate estimate for the fraction of structures subject to tautomerism, a more detailed analysis was performed by keeping track of the numbers of tautomers that were generated for every covalently-connected component in every substance (there can be multiple components per substance. Only components with two or more non-hydrogen atoms were considered. Otherwise, they skip this standardization step). Of the 104,293,434 standardized substances, 66,053,812 contained at least one component for which more than one tautomer were generated and evaluated during the valence bond canonicalization step. This means that 63.3% of Substance records show some form of tautomerism, but this number does not consider the redundancy in the Substance database. When multiple substances with the same fully-standardized structure (identified by comparing their de-aromatized canonical isomeric SMILES) are counted only once, 28,417,846 of 45,808,881 unique standardization results (62%) generated more than one tautomer during standardization. This result is comparable to that of the study by Sitzmann et al. , estimating more than 67% of chemical structures being affected by tautomerism.
After the identification of a canonical valence bond form, a canonical resonance structure is determined in the step Standardize Aromaticity. In 38,750,144 cases, we detected the generation of an alternate Kekulé structure. In this step, aromaticity is perceived and annotated in 96,003,930 substances (92.1% of all successfully standardized substances), indicating that this fraction of structures in Substance has ‘aromatic’ structural elements in the employed perception model. Of the 45,808,881 unique structures after standardization, 41,614,562 (90.8%) contain aromatic systems under the perception model employed in this study.
The Standardize Stereochemistry step modified 18,211,483 substances. In 18,067,088 cases, stereo annotation was added to substances that did not have any prior to this standardization step (e.g., to annotate unspecified stereocenters). In 28,327 cases, existing stereochemistry annotation was modified (e.g., placing the stereo wedge on a different bond). In 116,068 substances, annotated stereochemistry was identified as being incorrect and removed [e.g., non-stereogenic Cahn–Ingold–Prelog (CIP)-type centers]. In 6,082,156 substances, existing annotation of stereochemistry was not changed. In total, after this step, 24,177,571 substances had annotated stereochemistry.
The Standardize Explicit Hydrogens step affected 6770 substances (0.006% of successfully standardized substances, and 0.015% of modified substances). Here, changes in the de-aromatized canonical isomeric SMILES, which we used for the detection of modifications, can be the result of two effects. First, the standard valence model gets re-applied to the structures, prior to the conversion of implicit hydrogen atom counts to explicit atoms. Second, hydrogen atoms adjacent to chiral atoms are represented as explicit ‘[H]’ in the SMILES strings.
A modification rate of 44.5% in successfully standardized substances (44.3% for organic, 3.2% for inorganic, 51.6% for mixed substances) indicates that almost half of all deposited structures in PubChem are modified by algorithms to provide a consistent structure representation. The standardized structures are used to determine structure equivalency to create unique entries in PubChem Compound and map the original substances (using their SIDs) to the corresponding CIDs. It is important to note that contributed substances are kept in their original state, allowing PubChem standardization rules to be changed as a function of time and re-applied to the original content. This is especially important to keep the original intent and to avoid corruption of structural content that sometimes occurs with coding errors or methodology shortcomings.
Standardization time statistics
To demonstrate the relative time per standardization step that consumes the most time, all individual per-step standardization times were normalized to the total standardization time of the particular substance. The resulting average percentages are presented in Fig. 14c. The steps Verify Element, Verify Valence and Standardize Annotations perform no modifications of the molecular graph (instead, they filter out ‘bad’ chemical structures). Consequently, they consume the least amount of time with averages of 0.1%, 3.1% and 0.4% of the time that is used per substance, respectively. The Verify Hydrogen step involves the conversion of non-special (e.g., non-isotopic and without stereo-wedge or formal charge), explicit hydrogen atoms into implicit hydrogen. On average, this step consumes 5.9% of the standardization time per structure. The Verify Functional Groups step comprises the repeated matching of substructure queries against the molecular graph. Detecting subgraph isomorphisms is an inherently complex problem , but due to the small size of substructure queries, the complexity does not fully manifest and the average fraction of per substance standardization time is 5.2%. Most of the standardization time is spent for valence bond canonicalization (in the Standardize Valence Bond Form step), with 44.0% of the per substance standardization time. The major computation expense is due, in part, to the approach used. It is not just focused on generation of a canonical tautomer. Rather, it performs a canonic walk through (potentially) many possible tautomeric forms and uses a tautomer scoring function to provide the “best” tautomer representation, as described in the “Methods” section.
Just like the generation of a canonical tautomer, the standardization of aromaticity is a global operation on the molecular graph. Consequently, it is more time consuming than the initial local checks of substructure representations and accounts for 15.5% of the per substance standardization time on average. The standardization of stereochemistry relies on the computation of atomic symmetry classes, which is an iterative procedure on the entire molecular graph. On average, it takes 17.2% of the per substance standardization time. The Standardize Explicit Hydrogens step consumes 8.6% of per substance standardization time, a comparable amount of time to its inverse, Verify Hydrogen.
In general, the described standardization workflow and its implementation are rather efficient. Only 0.4% of cases take longer than 0.1 s to be individually processed. Yet, those comparatively few cases are responsible for the highest fraction of total standardization time. Steps that involve only atom-wise checks and manipulations are faster than global operations on the molecular graph. Valence bond canonicalization is the most time-consuming step and is a good target for further optimization.
Unique structure analysis
Comparison to InChI-derived structure
141 failing substances do not pass the initial check of element specifications during PubChem standardization due to invalid isotope specifications. InChI describes the given isotope as delta value to the most common isotope in the ‘/i’ layer. In this process, it seems to accept isotope specifications that are rejected by PubChem standardization (this was verified using the InChI executables: For a wide range of isotopes rejected by PubChem, the difference to the most common isotope is still encoded in the InChI. In the case of very high differences to the most common isotope, isotope specification is omitted in the generated InChI).
In most cases (364,946 substances), those substances fail the PubChem valence check (Additional file 5).
In 10,243 cases, substances are rejected in PubChem standardization after valence bond canonicalization for identical charges on adjacent atoms or invalid valences.
The PubChem standardization protocols rejects 65 substances due to the limit of 999 explicit atoms.
With the increasing popularity of InChI as a chemical representation, some cheminformatics software packages provide the functionality to covert InChI strings into chemical structures. One may wonder how different PubChem-standardized and InChI-derived structures are [here, the InChI-derived structures refer to the structures generated from standard InChI strings using the GetStructFromINCHI() function in the InChI API library]. Therefore, the PubChem-standardized and InChI-derived structures of the 104,293,426 substances that passed both procedures (see Fig. 17) were compared with each other by using the de-aromatized canonical isomeric SMILES strings converted from them. This approach can be likened to Kekulization of an aromatic SMILES. Differences between PubChem-standardized and InChI-derived structures can be manifest in two ways, disagreement on which structures are the same and preference for a structural form. However, complicating a thorough analysis is that the conversion of a standard InChI string into a chemical structure can be problematic, yielding a structure with a different charge or tautomeric state or, especially in the case of metals, missing bonds found in the original structure. As a result, this subsequent analysis helps to identify differences between the PubChem-standardized structure and InChI-derived chemical structure.
Modification frequencies in PubChem standardization applied to standard InChI-derived chemical structures
Exclusively modified substancesb
First modified substancesc
Verify functional groups
Standardize valence bond form
Standardize explicit hydrogens
Differences noted during the Standardize Aromaticity step are rooted in the respective approaches used for the generation of a Kekulé structure. Quoting from the InChI technical manual, “the conversion of aromatic bonds to alternating single and double bonds is done through radical cancellation” . It means that each aromatic atom initially is represented as a radical. Electrons from neighboring such radicals are combined to an additional (pi) bond between them if permitted by their valence. Just as the related PubChem approach, the outcome of this procedure depends on the (canonical) processing order of atoms. This, and consequently the resulting Kekulé structure, cannot be expected to be equivalent between both approaches. However, as the input structures are already valid Kekulé structures without aromaticity perceived and annotated, the InChI-derived structure does not result in any changes of single and double bond patterns and the outcome of PubChem standardization applied to originally deposited structure and InChI-derived structure are identical.
The comparison of PubChem-standardized and InChI-derived structures revealed conceptual differences between the approaches employed to generate them. Identified differences arise from diverging valence models, conventions for the representation of functional groups, tautomeric preference and the definition of stereocenters. In the case of valence bond canonicalization, the approaches are conceptually different. Whereas PubChem standardization aims at identifying a preferred tautomer in a canonic walk using a scoring function, InChI normalization creates a single representation that covers multiple tautomeric states by considering a tautomeric region, which consists of a group of skeletal atoms that share mobile hydrogen atoms involved in tautomerism. The considerable number of unequal InChI-derived/PubChem-standardized structures (60.47% of substances passing both clean-up procedures) shows that those differences in opinion have major impact on the representation of chemical structures. This is especially important considering the increasing prevalence and use of InChI, not only as a chemical descriptor, but also to represent chemical structures (i.e., InChI-derived chemical structure), a use case for which it was never intended.
The data presented in this study shows that the PubChem structure standardization is an effective and (in general) efficient method that accounts for various sources of molecular diversity and weeds out most improper structures. Its rejection rate for erroneous structures is higher than that of InChI normalization, especially with respect to isotope specifications. The low average processing time (only 0.4% of all substances have an individual standardization time above 0.01 s) and the parallelizability of the problem (embarrassingly parallel) make it suitable for automated compound registration. Yet, the total amount of time necessary to standardize the complete Substance database is dominated by a minority of structures that can be traced to difficulties and inconsistencies in chemical representation when handling organo-metallic complexes (e.g., resulting in negative charges on carbon atoms). A more detailed analysis revealed the generation of a canonical tautomer as the most time-consuming step. The normalization approach used (first developed in 2004 and with periodic major updates between 2005 and 2008) is “ripe” for further optimization, modernization, and improvement.
The representation of chemical structures used in PubChem (after standardization) overcomes problems inherent with chemical information formats. Most prominently, the definition of non-standard bond types (i.e., ionic, complex, and dative bonds) from deposited covalent single bonds remedies their influence on atom valences, ring counts and topological complexity. In this way, PubChem already exceeds what has been recently proposed for the further development of structure file formats . The representation of a stereogenic double bond with undefined cis/trans configuration as a crossed double bond is not recommended by IUPAC , but it is our opinion that this representation facilitates better understanding of the stereo-configuration of a chemical structure (or lack thereof). It reduces the risk of accidently creating ‘not acceptable’ configurations when using the IUPAC recommended ‘wavy’ bond type. Standardized structures in Compound are made publicly available with explicit hydrogen atoms, eliminating valence ambiguities caused by different implicit-hydrogen valence models.
The comparison to InChI (v1.0.4) normalization and InChI-derived chemical structures revealed discrepancies in tautomeric preference and the definitions of stereocenters. PubChem standardization aims at generating a canonical tautomer with preferred structural properties to enhance its human interpretation. The stereocenter differences could be remedied by an expansion of the stereocenter definitions in PubChem [102–104]. It could also be the basis for further exchange and debate about standards in chemical information, even though the structure standardization problem has not yet found recognition as a grand challenge in cheminformatics  or as a hindering factor in computer-assisted drug discovery .
With a large pre-existing corpus of structures (tens of millions) complying with diverging approaches, human inspection and curation of structures seems not feasible. Even though ‘RoboChemistry’ is in part responsible for creating the “wasteland” of chemical structures we are dealing with today, automated systems are the only viable option for this task—but they need to be configured, validated, and used with care. The existing standardization system in PubChem faces new challenges every time a new depositor submits data, as the deposition might include chemical representations not seen previously. Any modification to the system must be carefully validated (much like a doctor treating a patient with a promise to “first, do no harm”), with minor changes possibly affecting many thousands of structures. In PubChem, the separation of deposited structures (Substance) and standardized structures (Compound) facilitates the evaluation of alterations to the system, making the creation of a better cleanup and normalization ‘robot’ possible, while keeping provenance clear. As a community, chemical information needs to make progress towards improved digital standards in chemical file formats and chemical structure representation.
Prior to standardization, a major obstacle in cheminformatics must be addressed: different standards for representing hydrogen atoms. They are typically represented in three ways: (1) as explicit atoms; (2) as a numeric property of atoms; or (3) as implied atoms (e.g., carbon is always tetravalent, with hydrogen being assumed for any valence not already used). In the last case, the implicit hydrogen count of a non-hydrogen atom is determined by a standard value in a valence model. These hydrogen counts are typically based on atomic number, formal charge, and the number and the order of incident bonds. Unfortunately, standard valences can vary between valence models or change for a valence model as a function of time. (For example, in 2017, the default valences for the CTAB/MOL/SDF file format was changed.) Depending on the source of structural information, PubChem deals with all three representations of hydrogen atoms. Consequently, a pre-processing step is performed to unify hydrogen representations. For each atom, implicit hydrogen counts are determined and set according to a simplistic valence model by invoking the function OEAssignMDLHydrogens in the OpenEye OEChem C++ toolkit . This model assumes that bond orders and formal charges on atoms are correct and adds implicit hydrogen atoms using the available information. This is used as a simple starting point and adjusted in subsequent steps.
Ionic bonds are set in cases where the ionic character of a bond clearly outweighs the covalent part [i.e., when an alkali metal or alkaline earth metal is bonded to an organic element (see Fig. 28)].
Complex bonds are used to describe coordination complexes. They occur mostly in interactions of organic elements to transition metals and are also used to represent metal–metal bonds. Prominent examples for this bond type are the bonds to central iron and magnesium ions in hemoglobin and chlorophyll, respectively.
In a dative bond (also known as a dipolar bond), an electron pair is shared between interacting partners, making one the donor and the other one the acceptor. Compared to a covalent bond, where every bonding partner contributes an electron, this bond type has higher polarity, and is weaker and longer. They are annotated in PubChem without placing charges on the bonding partners.
In the following subsections, we describe the structure verification and normalization processes performed during PubChem standardization. The verification process consists of atom-based validity checks and modifications. In this way, it is ensured that only structures consisting of valid and reasonably configured atoms are considered in the subsequent normalization process.
This step evaluates the validity of provided element and isotope information. First, the atomic number of each atom in the structure is checked for its validity. Second, it is determined whether the provided isotope is known and valid. An internal knowledgebase from NUBASE2012 of allowed isotopes is applied. Isotopes are restricted to include only those with a half-life longer than 1 ms (isotopes with shorter half-lives can exist in the Substance database but are excluded from the compound database).
The verification of hydrogen atoms aims at generating a representation of the provided chemical structure that only uses implicit hydrogen atoms (as-is possible). For this purpose, explicit hydrogen atoms are converted to implicit ones by incrementing hydrogen counts of the connected atom (count increments by 1 for every deleted explicit hydrogen atom). Excluded from this conversion are hydrogen atoms in H2, H∙ radicals, and H+ or H− ions. Furthermore, the hydrogen atom to be deleted must be connected to an organic atom with a single covalent bond, must not be allowed to have a charge or be isotopically labelled, and must not be incident to an annotated stereo ‘wedge’ bond. If any of those criteria are not met, the explicit hydrogen atom is not removed and the implicit hydrogen atom count of its adjacent atom is not incremented.
Arsenic, phosphorus, and nitrogen atoms with a valence of 5 get assigned a formal charge of + 1 and their implicit hydrogen count is decreased by 1, thus reducing the valence by one.
Selenium or sulfur atoms with a valence of 6 or 4 get assigned a formal charge of − 1 and their implicit hydrogen count is decreased by 1, thus reducing the valence by one.
Iodine, bromine, or chlorine atoms with a valence of 7, 5 or 3 get assigned a formal charge of − 1 and their implicit hydrogen count is decreased by 1, thus reducing the valence by one.
On non-organic atoms (see Fig. 28), the implicit hydrogen count is set to a default value of 0, thus preventing implicit hydrides. (e.g., ‘Li’ does not become ‘LiH’).
Verify functional groups
Oxides and analogous cases for carbon
The first group of standardization rules handles the standardization of various oxides and analogous cases for carbon (Fig. 30a–d). Hydrogen and charge preferences are set and valences are adjusted. Many zwitter-ionic bonds are converted to double bonds (which in some cases is an overly aggressive normalization that prevents some known forms of stereochemistry).
Ionic bonds are set to indicate interactions between charged atoms as appropriate. Nonetheless, the involved atoms keep their charges (Fig. 30e–g). If ionic bonding partners are not connected by a bond, an ionic bond is defined (Fig. 30e). A prerequisite is that the two ionic bonding partners are the only matches for their respective type. If, for example, two Na+ ions and one Cl− ion are present, it can’t be decided which one of the Na+ is involved in the bond and no ionic bond is set. If the ionic bonding partners are connected by a covalent (single) bond, this bond is replaced by an ionic bond and charges are adapted as necessary (Fig. 30f, g). The conversion of a covalent into an ionic bond also applies to the charged variants of this scenario. The alterations in charge are incremental in this case and not hard coded as + 1/− 1 and + 2/− 2, respectively (this is an area where more aggressive normalization than currently performed may be warranted, given the combinatoric ways of drawing various equivalent salt forms).
The standardization of tri-valent oxygen handles cases where the coordinate bond between oxygen and boron is represented as covalent single bond (Fig. 30h). In those cases, the bond is replaced with a dative bond. The oxygen and boron atoms must be uncharged prior to this modification.
Three different cases exist for the standardization of tri-valent oxygen (Fig. 30i). The atom must be uncharged and terminal, connected only by a triple bond to another atom. If such an atom is connected to a carbon atom that is connected to a metal by a single bond or a terminal uncharged carbon atom (as in carbon monoxide), a charge of − 1 is placed on the carbon atom and the oxygen gets assigned charge + 1. In all other cases, the oxygen atom gets assigned charge + 1.
Transition metals and semiconductor elements
The simplest case for the processing of transition metals and semiconductor elements is when this atom is not connected to other atoms. If it has a charge present in the valence list (provided in Additional file 1), its processing terminates successfully. Otherwise the charge is set to 0 (there are varying approaches to transition metal charge schemes employed, often with the transition metal charge being used to ensure a net neutral molecule as opposed to a known valid formal charge, making it difficult or near impossible to reliably understand what was the original chemist intent from the structure alone). In both cases, standardization proceeds with the next transition metal atom if there is one. If the transition metal atom is connected to other atoms, certain bonding scenarios remain unmodified (Fig. 31a–c). In other cases, covalent bonds will be replaced by complex bonds and the participating atoms’ charges and/or hydrogen counts will be adapted (Fig. 31d–e).
If the transition metal atom is connected to the adjacent atom by anything else other than a single bond or if the other atom does not belong to any of the organic, semiconductor, and metal element classes, and is not boron, silicone, or selenium, it remains unchanged and standardization proceeds with the next neighboring atom.
If the neighboring atom is a positively charged nitrogen atom that engages in a pi bond, + 1 is added to that of the transition metal atom and that of the nitrogen atom is set to 0 (Fig. 31d).
For standardization to proceed, the configuration of the connected atom must be in the valence list. If the adjacent atom is uncharged carbon in an aromatic 5- or 7-membered carbon-only ring or uncharged nitrogen in an aromatic 5-membered nitrogen-containing ring, its charge is set to − 1 and that of the transition metal atom is increased by + 1 (as illustrated in Fig. 31e). This accounts, for example, for situations encountered in porphyrin systems.
The same happens if the adjacent atom is uncharged carbon or nitrogen (both not in a ring) that is connected to an oxygen atom by a double bond: The adjacent atom gets assigned charge − 1 and that of the transition metal is incremented by + 1 (Fig. 31f). Uncharged carbon, uncharged nitrogen and uncharged sulfur (except for the case of tetra-valent sulfur with one hydrogen atom) get assigned a negative charge as well, and the charge of the transition metal is incremented by 1. In the mentioned special case of sulfur, the hydrogen atom is removed. In the case the neighboring atom is a nitrogen with charge + 1, its charge is set to 0. If the charge of the neighboring atom has not been changed by any of those rules, the number of implicit hydrogens on the adjacent atom is incremented by 1.
Finally, the covalent single bond between the transition metal atom and its neighbor is replaced by a complex bond.
Seven cases of penta-valent nitrogen are differentiated (Fig. 32a–g). If a penta-valent nitrogen is connected to a terminal nitrogen atom by a triple bond and to another carbon, nitrogen, or oxygen atom by a double bond (e.g., the azide functional group), the triple bond is decreased to a double bond by charge separation; the terminal nitrogen gets assigned a charge of − 1 and the former penta-valent one gets a charge of + 1 (Fig. 32a). If the penta-valent nitrogen is connected to a terminal oxygen or sulfur by a double bond as well as a tetra-valent carbon by a triple bond, the double bond is decreased to a single bond by charge separation; the terminal oxygen or sulfur gets assigned a charge of − 1 and the former penta-valent nitrogen gets a charge of + 1 (Fig. 32b). The nitro group as well as nitrate have their own standardized form with charge separated single bonds (Fig. 32c). If a N=O group is attached to a penta-valent nitrogen connected to three carbon atoms by single bonds, the double bond to nitrogen is replaced by a single bond, placing a positive charge on the nitrogen and a negative charge on the terminal oxygen (Fig. 32d). If one of the adjacent atoms to a penta-valent nitrogen with five single-bonded connections in total is oxygen (or sulfur) that is single-bonded to C, N, P or S, the N–O (or N–S) bond is replaced by an ionic bond, placing a positive charge on the nitrogen and a negative charge on the oxygen (or sulfur) (Fig. 32e). The same processing is applied if a halogen (F, Cl, Br, I) atom is connected to a penta-valent nitrogen with five single-bonded connections (Fig. 32f) or with three single-bonded and one double-bonded connections (Fig. 32g).
Subsequently to penta-valent nitrogen, tetra-valent cases are processed. As a simple rule, if a tetra-valent nitrogen has a zero charge and at least one implicit hydrogen, the charge is considered the more reliable information and the implicit hydrogen count is decreased by 1. Otherwise, the charge on the nitrogen is increased by 1. The additional cases are like those for penta-valent nitrogen. If the nitrogen with four connections (all single-bonded, Fig. 32h) or three connections (two single-bonded and one double-bonded, Fig. 32i) is single-bonded to a halogen, the nitrogen-halogen single bond is replaced by an ionic bond, placing a positive charge on nitrogen and a negative charge on the halogen (Fig. 32h, i). If a tetra-valent nitrogen is connected to a penta-valent boron atom by a double bond, this bond is replaced by a dative bond (Fig. 32j). An uncharged tetra-valent nitrogen atom explicitly connected to carbon or nitrogen atoms by four single bonds (Fig. 32k) or by two single bonds and one double bond (Fig. 32l) gets assigned a charge of + 1. If a nitro group is represented with a charged tetra-valent nitrogen and a single-bonded hydroxyl group (thus could not be fixed using rules for penta-valent nitrogen), the hydroxyl group is deprotonated (Fig. 32m).
PubChem stores bond annotations as properties. These are used to control customized bond visualization, for example, for PubChem-specific non-standard bond types. These annotations can be provided by PubChem data contributors during substance submission. They are converted to covalent bonds during pre-processing and re-perceived. To prevent them from influencing subsequent steps, they are removed at this point during standardization processing.
Standardize valence bond form
This step generates a canonical preferred tautomer of a structure, considering protons and charges as mobile elements. For this purpose, the various covalently-connected components of a deposited substance are treated separately and a canonical tautomer is generated for each one of them. If a component has less than two connected atoms, its processing is skipped. Before the actual valence bond canonicalization, the structure is checked against a hand-curated ‘blacklist’ of structures that spent too much time in this step in the past without yielding a better tautomer (vide infra). If the component is on this blacklist (65 structures, provided as canonical SMILES in Additional file 2), it skips valence bond canonicalization. The component is checked against a second list of structures subject to limited processing (1746 structures provided in Additional file 3). The maximum number of generated tautomers per connected component is 250,000 in the unlimited case. In the limited case, this number is reduced to 2500 to reduce processing time at the expense of a less-extensive canonic walk through valence-bond forms.
Explicit hydrogen atoms are made implicit with the same exceptions as described in Verify Hydrogen. Certain charges are identified in the component that should not be modified during the valence bond canonicalization (for example, these are charged atoms in annotated complex or ionic bonds, terminal N− as in [N−] = [N+]=*, and the N+ and O− as in a nitro group). These are immobilized on the respective atoms; later, generated tautomers that do not possess the identical pattern of those charges are rejected. This is the case for charged atoms involved in complex bonds (possibly) generated in a previous step, and negative charges around certain nitrogen configurations: (1) if a positively charged and tetra-valent nitrogen with an explicit degree of two is connected to a terminal negatively charged nitrogen by a double bond, the negative charge on the terminal nitrogen is kept in place (e.g., azide group); (2) if a positively charged and tetra-valent nitrogen with an explicit degree of three is connected to a terminal oxygen (or sulfur) atom with charge − 1 and another oxygen (or sulfur) atom by a double bond, the negative charge on the terminal oxygen (or sulfur) atom is kept in place (e.g., nitro group). During the optimization, tautomerization of methyl and methylene groups is not considered, due to an extensive expansion of memory and computational cost. (Improved normalization covering acidic hydrogen atoms on carbon is warranted but not performed, as there are many cases of sp2-hybridized carbon atoms that could also be readily represented in an sp3-hybridized form, especially in keto-enol cases. In some cases, the opposite is true, especially for some heterocycles where the presence of sp3-hybridized carbon prevents aromaticity from being identified.)
negatively and positively charged carbon atoms both are present in the structure;
a combination of negatively charged nitrogen, oxygen, sulfur, phosphorus or carbon and positively charged nitrogen, oxygen, sulfur, phosphorus or carbon are present in the structure;
the structure has any number of charged carbon atoms (positive or negative);
at least one positively charged oxygen atom is present in the structure;
at least one negatively charged nitrogen atom is present in the structure;
at least one positively charged nitrogen or negatively charged oxygen atom is present in the structure;
any other case.
fewer positively charged carbon atoms;
fewer negatively charged carbon atoms;
fewer negatively charged phosphorus atoms;
fewer positively charged sulfur atoms;
fewer positively charged oxygen atom;
fewer negatively charged nitrogen atoms;
more positively charged nitrogen atoms;
more negatively charged oxygen atoms;
more negatively charged sulfur atoms;
more positively charges phosphorus atoms;
more zwitter-ionic cases of [N+]–[O−] with a tetra-valent nitrogen or [N−]=[N+]=*;
more hydrogen atoms on carbon;
fewer hydrogen atoms on oxygen;
fewer hydrogen atoms on sulfur;
more aromatic atoms (here, the OEAroModelMDL is used because it has stronger emphasis on cyclic systems and ignores exocyclic bonds, which is preferred in this case);
fewer hydrogen atoms on nitrogen;
fewer hydrogen atoms on phosphorus;
fewer hydrogen atoms on atoms in rings;
fewer C=C double bonds.
The generated structure (with possibly multiple connected components) is subjected to a valence check as described in Verify Valence. If the generation of a canonical tautomer yielded at least one atom with a configuration not in the valence list, the substance fails this standardization step, and consequently standardization. In addition to that, a sanity check of local atom neighborhoods is performed. If a situation was created where a charged atom is adjacent to an atom with the identical charge type, the structure fails this standardization step.
If the processing time for one of the components was above 5 min and the iteration limit was 250,000, the structures is flagged as a candidate to be put on the list for limited tautomer enumeration. If the limit already was set to 2500, and 5 min elapsed in this standardization step, it is flagged as a candidate to be put on the blacklist (such lists are periodically updated in source code).
Stereo configuration of tetrahedral atoms
As a measure of priority, the PubChem standardization protocols employ symmetry classes as implemented in the OpenEye OEChem toolkit . This concept is similar to atom classes in Morgan’s relaxation algorithm [109, 110]. Stereocenters can be easily identified using this concept. If a tetrahedral atom has four adjacent atoms that belong to different symmetry groups, it is chiral. If atoms incident to a stereogenic double bond have adjacent atoms of unequal symmetry groups, that double bond is a stereogenic center. We assign symmetry classes using the function OEPerceiveSymmetry in the OpenEye OEChem C++ toolkit . Explicit hydrogen atoms get assigned their own symmetry class of ‘0’ (lowest priority).
In the PubChem structure standardization protocols, stereochemistry standardization relies mostly on routines from the OpenEye OEChem C++ toolkit . If 3-D structural information is provided, stereo information is perceived using the function OE3DToInternalStereo. It configures the tetrahedral chirality around atomic centers and the E/Z configuration around double bonds based on 3-D atom coordinates, provided they are not set to ‘any’ stereo. If the structure has no atomic coordinates at all (e.g., it was submitted as SMILES string), 2-D coordinates for the structure are generated using the function OEDepictCoordinates in the OpenEye OEDepict C++ toolkit that assigns a set of 2-D coordinates to each explicit atom . If tetrahedral atoms in this structure have a defined parity but incident bonds are not annotated as bold or hashed wedges, the parity is used to set this annotation accordingly using the function OEMDLPerceiveBondStereo. In all cases of atom-coordinate dimensionality, if tetrahedral atoms are defined only by provided bond annotations (wedge and hashed bonds) the parity is set using the function OEMDLStereoFromBondStereo.
Phosphorus atoms that are not tri-valent and tri-coordinated or penta-valent and tetra-coordinated are non-chiral. The same is true if more than one adjacent atom is of type OH, O−, =O, SH, S− or = S, as those may be subject to mesomeric effects (cases of S=P–OH and O=P–SH can be chiral, whereas O=P–OH and S=P–SH cases are achiral).
Sulfur atoms that are hexa-valent, tetra-coordinated and adjacent via a single bond to carbon with implicit hydrogen atoms or charge, or are incident to a bond that is not a single or a double bond, are non-chiral; the same is true in tetra-valent and tertiary cases if more than one adjacent atom is of type OH, O−, =O, SH, S− or = S.
If an atom is not phosphorous or sulfur, it must be tetra-valent and tetra-coordinated to be considered for chirality tests. Otherwise, it is non-chiral.
Stereo configuration of double bonds
Double bonds considered to exhibit geometric stereoisomerism are non-aromatic double bonds with a connectivity of three for each incident atom. If the bond is in a ring, the smallest ring it is in must be at least of size eight (atoms). If either side has two adjacent (or implicit) hydrogen atoms, it is configured as ‘undefined’. If one of the atoms incident to the double bond is nitrogen, this atom must meet two conditions for further investigating stereochemistry. It is not allowed to have an adjacent atom that is: (1) a hydrogen atom (or an implicit hydrogen atom); or (2) a carbon atom that is adjacent to carbon, hydrogen (or has implicit hydrogen atoms) or incident to a single bond (except for that to the nitrogen atom). Otherwise the double bond is configured as ‘undefined’ (note that structures that do not meet the two conditions may be subject to mesomeric effects).
If the above-mentioned conditions are met, the atoms adjacent to those incident to the double bond are investigated for their symmetry classes. There must be atoms of two different symmetry classes on each side of the double bond, taking implicit hydrogen atoms into account. The bond parity is defined as E or Z with respect to the pair of adjacent atoms with the highest symmetry class on each side of the double bond. The bond in question is checked for an annotated parity by passing those two atoms to the GetStereo function of the double bond. If no bond parity was defined (GetStereo returned ‘undefined’) and the atom coordinates were not automatically generated in a prior step, the atom coordinates are used to determine the E/Z configuration. If the two defining atoms are on the same (opposite) side of the double bond, it is defined as Z (E). The IUPAC recommendation for undefined stereochemistry around a double bond is to draw the single bond as extension of the double bond in question, with an angle of 180° between the two. This guideline is implemented with a tolerance of 10°; higher deviations result in the automated perception as E or Z from atom coordinates. In the case the bond was originally annotated as ‘undefined’, this information has higher priority as the determined parity and the bond remains annotated as undefined (accounting for cases where the 2-D coordinates were only chosen for visualization, not for bond stereo configuration).
Standardize explicit hydrogens
All standardized structures in PubChem Compound are available in SDF as well as PubChem-specific ASN.1 or XML format, with explicitly specified hydrogen atoms. So far, the described standardization worked on structures with implicit hydrogen atom counts. In this last step of the standardization, those counts are converted to explicit hydrogen atoms, connected by a single bond to the parent atom.
Only atoms with one or more attached hydrogens are processed in this step, consistent with the definition of an implicit hydrogen count of 0 on all other atoms in the step Verify Hydrogen. On each processed atom, the implicit hydrogen counts are set using the function OEAssignMDLHydrogens in the OpenEye OEChem C++ Toolkit . The underlying model assumes that the atomic number and formal charge are set to their correct values, which was taken care of in the previous standardization steps. In the case of radicals, hydrogen counts are lower by the number of unpaired valence electrons. The correct position of explicit hydrogen atoms is not determined in this step. This is taken care of separately in the generation of 2-D or 3-D coordinates. The resulting structure must have the count of atoms or bonds not to exceed 999, the upper limit of what is supported by the MDL V2000 MOL file format. Otherwise it fails this standardization step. While not a technical limit of PubChem, this cutoff was a convenient choice to place a limit on what is considered a ‘small’ molecule, and may be changed in the future.
Unique identifier mapping
The final mapping from substances to entries in PubChem Compound is made based on CACTVS structural hash codes calculated for the standardized structures [111–113]. If the hash code of a standardized structure is not present in Compound, a new entry with a new compound identifier (CID) is created. If a CID with an identical hash code already exists, the substance identifier (SID) of the substance the standardized structure was generated from is associated with this CID and listed as related substance.
Standardization modification tracking
For this study, we generated a canonical isomeric SMILES (canonical SMILES with stereo information) before and after each step of the standardization procedure using the function OECreateIsoSmiString in the OpenEye OEChem C++ toolkit . This way it is possible to detect structural modifications in every step. Isomeric SMILES were generated from de-aromatized structures: prior to string generation, all perceived and annotated aromaticity flags were removed using the function OEClearAromaticFlags in the OpenEye OEChem C++ Toolkit .
An alternative structure representation for this purpose would have been the IUPAC International Chemical Identifier (InChI) [11–13]. Yet, it does not have an advantage over SMILES in this use case. During the generation of standard InChIs, an InChI-specific structure normalization is performed that would obfuscate modifications resulting from PubChem standardization. InChIs can be configured to be ‘non-standard’ and describe a structure ‘as-is’, essentially making them equivalent to SMILES for our purposes. In this case, there would have been no benefit in choosing InChI and may have created confusion. We also chose SMILES so we could resort to functionalities readily available within the OpenEye Scientific Software Inc. C++ toolkits [89–92], avoiding unnecessary conversion between toolkits or other changes that might alter subsequent analysis.
It is important to note that non-standard bonds used by PubChem are ignored when computing a SMILES. This will make some structures appear to be identical that are not if their nonstandard bonding is different or when compared to structures devoid of such bonds.
Standardization time statistics
We monitored elapsed standardization per step and total standardization time per substance using the CStopWatch class in the NCBI C++ toolkit . Time was measured as wall time on a mix-use heterogeneous compute cluster. It may not accurately provide actual time spent in cases when a server is overloaded or when using different servers with different processor speeds. With that said, it does give a relative speed on modern hardware.
Unique structure analysis
The purpose of the described PubChem standardization protocols is the identification of erroneous structures and the compensation for various aspects of chemical structures that lead to multiple valid representations of effectively the same molecular species. Consequently, the number of unique structures in a before/after comparison is expected to be less than the number of processed structures. To determine this degree of structure merging, we compared the numbers of unique structures before and after standardization using their representation as de-aromatized canonical isomeric SMILES. This approach has high structural sensitivity, as it allows distinguishing between stereoisomers as well as different Kekulé structures. Comparison was limited to structures that could be successfully standardized using the PubChem standardization protocols.
Comparison to InChI structure normalization
Structure normalization is an integral part of the generation of the IUPAC International Chemical Identifier (InChI) [11–13]. The PubChem standardization approach described here was developed independently of InChI and prior to the wide-spread use of InChI. As a first step in the comparison of PubChem standardization and the InChI normalization we compared the numbers of unique structures after standardization identified by their de-aromatized canonical SMILES to those of unique standard InChIs generated from the original structures. For this purpose, standard InChIs were generated using the InChI VC++ projects provided by the InChI Trust . The comparison was limited to the 104,669,789 substances that have complete, non-auto-generated structures. We kept track of differences in standardization/normalization success for both methods. For further analysis, the generated InChIs were converted back to structures (InChI-derived structure) and represented by de-aromatized canonical isomeric SMILES as well. InChI was never designed to be a file format and is not recommended. However, it seemed important to check whether an InChI normalized structure followed by conversion back to a chemical structure would yield the same PubChem-standardized structure to identify caveats/issues.
Results and statistics presented in this study were generated from a local copy of the PubChem Substance ASN.1 files available from the PubChem FTP repository , accessed on January 14th 2013. At that time, PubChem contained 116,641,122 substance records with a maximum substance identifier (SID) 144,075,000. The PubChem structure standardization service is accessible as a public resource under https://pubchem.ncbi.nlm.nih.gov/standardize/, and via programmatic interfaces .
VDH devised and performed the analysis of the standardization protocols and comparison to InChI, generated and analyzed the data. The manuscript was initially drafted by VDH. SK re-analyzed the data and elaborated the manuscript. EEB developed the standardization protocols. All authors commented on the manuscript and read and approved the final version.
We thank Roger Sayle (NextMove Software) for his insight in aromaticity definitions used in cheminformatics, the differences between perception models, and a concise characterization of the term ‘RoboChemistry’, as well as Paul Thiessen (NCBI–NLM–NIH) for providing an interface to the InChI library made available by the InChI Trust. We also thank Wolf D. Ihlenfeldt (Xemistry), Igor Pletnev (Lomonosov Moscow State University) and OpenEye Scientific Software (in general) for numerous discussions over the years that contributed to this work. We also thank the anonymous reviewers for their valuable remarks.
The authors declare that they have no competing interests.
Availability of data and materials
The data that support the findings of this study are available are included in this published article and its additional files. All substance records analyzed in this study can be downloaded in bulk from the PubChem FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem/).
Ethics approval and consent to participate
This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, US Department of Health and Human Services.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Brown FK (1998) Chapter 35—chemoinformatics: what is it and how does it impact drug discovery. In: James AB (ed) Annual reports in medicinal chemistry, vol 33. Academic, New York, pp 375–384. https://doi.org/10.1016/S0065-7743(08)61100-8 View ArticleGoogle Scholar
- Hann M, Green R (1999) Chemoinformatics—a new name for an old problem? Curr Opin Chem Biol 3(4):379–383. https://doi.org/10.1016/s1367-5931(99)80057-x View ArticlePubMedGoogle Scholar
- Gasteiger J (2006) Chemoinformatics: a new field with a long tradition. Anal Bioanal Chem 384(1):57–64. https://doi.org/10.1007/s00216-005-0065-y View ArticlePubMedGoogle Scholar
- Engel T (2006) Basic overview of chemoinformatics. J Chem Inf Model 46(6):2267–2277. https://doi.org/10.1021/ci600234z View ArticlePubMedGoogle Scholar
- Varnek A, Baskin II (2011) Chemoinformatics as a theoretical chemistry discipline. Mol Inform 30(1):20–32. https://doi.org/10.1002/minf.201000100 View ArticlePubMedGoogle Scholar
- Vogt M, Bajorath J (2012) Chemoinformatics: a view of the field and current trends in method development. Bioorg Med Chem 20(18):5317–5323. https://doi.org/10.1016/j.bmc.2012.03.030 View ArticlePubMedGoogle Scholar
- Brecher J (2008) Graphical representation standards for chemical structure diagrams. Pure Appl Chem 80(2):277–410. https://doi.org/10.1351/pac200880020277 View ArticleGoogle Scholar
- Food and Drug Administration Substance Registration System Standard Operation Procedure Substance Definition Manual. https://www.fda.gov/downloads/ForIndustry/DataStandards/SubstanceRegistrationSystem-UniqueIngredientIdentifierUNII/ucm127743.pdf. Accessed 13 Aug 2016
- Weininger D (1988) Smiles, a chemical language and information-system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36. https://doi.org/10.1021/ci00057a005 View ArticleGoogle Scholar
- Weininger D, Weininger A, Weininger JL (1989) Smiles. 2. Algorithm for generation of unique smiles notation. J Chem Inf Comput Sci 29(2):97–101. https://doi.org/10.1021/ci00062a008 View ArticleGoogle Scholar
- McNaught A (2006) The IUPAC international chemical identifier: InChI—a new standard for molecular informatics. Chem Int 28:12–14Google Scholar
- Heller SR, McNaught AD (2009) The IUPAC international chemical identifier. Chem Int 31:7–9Google Scholar
- Stein SE, Heller SR, Tchekhovskoi DV, Pletnev IV IUPAC International Chemical Identifier (InChI), InChI version 1, software version 1.04 (2011), Technical Manual http://www.inchi-trust.org/fileadmin/user_upload/software/inchi-v1.04/InChI_TechMan.pdf. Accessed 13 Aug 2016
- Ash S, Cline MA, Homer RW, Hurst T, Smith GB (1997) SYBYL line notation (SLN): a versatile language for chemical structure representation. J Chem Inf Comput Sci 37(1):71–79. https://doi.org/10.1021/ci960109j View ArticleGoogle Scholar
- Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD (2008) SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries. J Chem Inf Model 48(12):2294–2307. https://doi.org/10.1021/ci7004687 View ArticlePubMedGoogle Scholar
- Gakh AA, Burnett MN (2001) Modular chemical descriptor language (MCDL): composition, connectivity, and supplementary modules. J Chem Inf Comput Sci 41(6):1494–1499. https://doi.org/10.1021/ci000108y View ArticlePubMedGoogle Scholar
- Gakh AA, Burnett MN, Trepalin SV, Yarkov AV (2011) Modular chemical descriptor language (MCDL): stereochemical modules. J Cheminform 3:5. https://doi.org/10.1186/1758-2946-3-5 View ArticlePubMedPubMed CentralGoogle Scholar
- Panico R, Powell WH, Richter JC (1993) A guide to IUPAC nomenclature of organic compounds recommendations 1993. Blackwell Science, OxfordGoogle Scholar
- Favre HA, Hellwich K-H, Moss GP, Powell WH, Traynham JG (1999) Corrections to a guide to IUPAC nomenclature of organic compounds (IUPAC recommendations 1993). Pure Appl Chem 71(7):1328–1330Google Scholar
- Leigh GJ, Favre HA, Metanomski WV (1998) Principles of organic nomenclature. Blackwell Science, OxfordGoogle Scholar
- Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, Laufer J (1992) Description of several chemical-structure file formats used by computer-programs developed at molecular design limited. J Chem Inf Comput Sci 32(3):244–255. https://doi.org/10.1021/ci00007a012 View ArticleGoogle Scholar
- Accelrys CTFile Formats. http://accelrys.com/products/informatics/cheminformatics/ctfile-formats/no-fee.php. Accessed 13 Aug 2016
- TRIPOS Mol2 File Format. http://tripos.com/data/support/mol2.pdf
- Warr WA (2011) Representation of chemical structures. Wiley Interdiscip Rev Comput Mol Sci 1(4):557–579. https://doi.org/10.1002/wcms.36 View ArticleGoogle Scholar
- Urbaczek S, Kolodzik A, Fischer JR, Lippert T, Heuser S, Groth I, Schuz-Gasch T, Rarey M (2011) NAOMI: on the almost trivial task of reading molecules from different file formats. J Chem Inf Model 51(12):3199–3207. https://doi.org/10.1021/ci200324e View ArticlePubMedGoogle Scholar
- Akhondi SA, Kors JA, Muresan S (2012) Consistency of systematic chemical identifiers within and between small-molecule databases. J Cheminform 4:35. https://doi.org/10.1186/1758-2946-4-35 View ArticlePubMedPubMed CentralGoogle Scholar
- Meng EC, Lewis RA (1991) Determination of molecular topology and atomic hybridization states from heavy-atom coordinates. J Comput Chem 12(7):891–898. https://doi.org/10.1002/jcc.540120716 View ArticleGoogle Scholar
- Baber JC, Hodgkin EE (1992) Automatic assignment of chemical connectivity to organic-molecules in the Cambridge structural database. J Chem Inf Comput Sci 32(5):401–406. https://doi.org/10.1021/ci00009a001 View ArticleGoogle Scholar
- Hendlich M, Rippmann F, Barnickel G (1997) BALI: automatic assignment of bond and atom types for protein ligands in the Brookhaven Protein Databank. J Chem Inf Comput Sci 37(4):774–778. https://doi.org/10.1021/ci9603487 View ArticleGoogle Scholar
- Urbaczek S, Kolodzik A, Groth I, Heuser S, Rarey M (2013) Reading PDB: perception of molecules from 3D atomic coordinates. J Chem Inf Model 53(1):76–87. https://doi.org/10.1021/ci300358c View ArticlePubMedGoogle Scholar
- Young D, Martin T, Venkatapathy R, Harten P (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27(11–12):1337–1345. https://doi.org/10.1002/qsar.200810084 View ArticleGoogle Scholar
- Sayle RA (2010) So you think you understand tautomerism? J Comput Aided Mol Des 24(6–7):485–496. https://doi.org/10.1007/s10822-010-9329-5 View ArticlePubMedGoogle Scholar
- Katritzky AR, Hall CD, El-Dien B, El-Gendy M, Draghici B (2010) Tautomerism in drug discovery. J Comput Aided Mol Des 24(6–7):475–484. https://doi.org/10.1007/s10822-010-9359-z View ArticlePubMedGoogle Scholar
- Ferrari E, Saladini M, Pignedoli F, Spagnolo F, Benassi R (2011) Solvent effect on keto-enol tautomerism in a new beta-diketone: a comparison between experimental data and different theoretical approaches. New J Chem 35(12):2840–2847. https://doi.org/10.1039/c1nj20576e View ArticleGoogle Scholar
- Balabin RM (2009) Tautomeric equilibrium and hydrogen shifts in tetrazole and triazoles: focal-point analysis and ab initio limit. J Chem Phys 131(15):8. https://doi.org/10.1063/1.3249968 View ArticleGoogle Scholar
- Elguero J, Marzin C, Katritzky AR, Linda P (1976) The tautomerism of heterocycles. Advances in heterocyclic chemistry. Academic, New YorkGoogle Scholar
- Scior T, Bender A, Tresadern G, Medina-Franco JL, Martinez-Mayorga K, Langer T, Cuanalo-Contreras K, Agrafiotis DK (2012) Recognizing pitfalls in virtual screening: a critical review. J Chem Inf Model 52(4):867–881. https://doi.org/10.1021/ci200528d View ArticlePubMedGoogle Scholar
- Sitzmann M, Ihlenfeldt WD, Nicklaus MC (2010) Tautomerism in large databases. J Comput Aided Mol Des 24(6–7):521–551. https://doi.org/10.1007/s10822-010-9346-4 View ArticlePubMedPubMed CentralGoogle Scholar
- Pospisil P, Ballmer P, Scapozza L, Folkers G (2003) Tautomerism in computer-aided drug design. J Recept Signal Transduct Res 23(4):361–371. https://doi.org/10.1081/rrs-120026975 View ArticlePubMedGoogle Scholar
- Oellien F, Cramer J, Beyer C, Ihlenfeldt WD, Selzer PM (2006) The impact of tautomer forms on pharmacophore-based virtual screening. J Chem Inf Model 46(6):2342–2354. https://doi.org/10.1021/ci060109b View ArticlePubMedGoogle Scholar
- Todorov NP, Monthoux PH, Alberts IL (2006) The influence of variations of ligand protonation and tautomerism on protein-ligand recognition and binding energy landscape. J Chem Inf Model 46(3):1134–1142. https://doi.org/10.1021/ci050071n View ArticlePubMedGoogle Scholar
- Kalliokoski T, Salo HS, Lahtela-Kakkonen M, Poso A (2009) The effect of ligand-based tautomer and protomer prediction on structure-based virtual screening. J Chem Inf Model 49(12):2742–2748. https://doi.org/10.1021/ci900364w View ArticlePubMedGoogle Scholar
- Muchmore SW, Debe DA, Metz JT, Brown SP, Martin YC, Hajduk PJ (2008) Application of belief theory to similarity data fusion for use in analog searching and lead hopping. J Chem Inf Model 48(5):941–948. https://doi.org/10.1021/ci7004498 View ArticlePubMedGoogle Scholar
- Duarte HA, Carvalho S, Paniago EB, Simas AM (1999) Importance of tautomers in the chemical behavior of tetracyclines. J Pharm Sci 88(1):111–120. https://doi.org/10.1021/js980181r View ArticlePubMedGoogle Scholar
- Jang YH, Goddard WA, Noyes KT, Sowers LC, Hwang S, Chung DS (2002) First principles calculations of the tautomers and pK(a) values of 8-oxoguanine: implications for mutagenicity and repair. Chem Res Toxicol 15(8):1023–1035. https://doi.org/10.1021/tx010146r View ArticlePubMedGoogle Scholar
- Hastings J, Magka D, Batchelor C, Duan L, Stevens R, Ennis M, Steinbeck C (2012) Structure-based classification and ontology in chemistry. J Cheminform 4:8. https://doi.org/10.1186/1758-2946-4-8 View ArticlePubMedPubMed CentralGoogle Scholar
- Bobach C, Bohme T, Laube U, Puschel A, Weber L (2012) Automated compound classification using a chemical ontology. J Cheminform 4:40. https://doi.org/10.1186/1758-2946-4-40 View ArticlePubMedPubMed CentralGoogle Scholar
- Trepalin SV, Skorenko AV, Balakin KV, Nasonov AF, Lang SA, Ivashchenko AA, Savchuk NP (2003) Advanced exact structure searching in large databases of chemical compounds. J Chem Inf Comput Sci 43(3):852–860. https://doi.org/10.1021/ci025582d View ArticlePubMedGoogle Scholar
- Martin YC (2009) Let’s not forget tautomers. J Comput Aided Mol Des 23(10):693–704. https://doi.org/10.1007/s10822-009-9303-2 View ArticlePubMedPubMed CentralGoogle Scholar
- Milletti F, Storchi L, Sforna G, Cross S, Cruciani G (2009) Tautomer enumeration and stability prediction for virtual screening on large chemical databases. J Chem Inf Model 49(1):68–75. https://doi.org/10.1021/ci800340j View ArticlePubMedGoogle Scholar
- Greenwood JR, Calkins D, Sullivan AP, Shelley JC (2010) Towards the comprehensive, rapid, and accurate prediction of the favorable tautomeric states of drug-like molecules in aqueous solution. J Comput Aided Mol Des 24(6–7):591–604. https://doi.org/10.1007/s10822-010-9349-1 View ArticlePubMedGoogle Scholar
- Urbaczek S, Kolodzik A, Rarey M (2014) The valence state combination model: a generic framework for handling tautomers and protonation states. J Chem Inf Model 54(3):756–766. https://doi.org/10.1021/ci400724v View ArticlePubMedGoogle Scholar
- Gobbi A, Lee ML (2012) Handling of tautomerism and stereochemistry in compound registration. J Chem Inf Model 52(2):285–292. https://doi.org/10.1021/ci200330x View ArticlePubMedGoogle Scholar
- Warr WA (2010) Tautomerism in chemical information management systems. J Comput Aided Mol Des 24(6–7):497–520. https://doi.org/10.1007/s10822-010-9338-4 View ArticlePubMedGoogle Scholar
- Schleyer PV, Jiao HJ (1996) What is aromaticity? Pure Appl Chem 68(2):209–218View ArticleGoogle Scholar
- Lloyd D (1996) What is aromaticity? J Chem Inf Comput Sci 36(3):442–447. https://doi.org/10.1021/ci950158g View ArticleGoogle Scholar
- Cyranski MK, Krygowski TM, Katritzky AR, Schleyer PV (2002) To what extent can aromaticity be defined uniquely? J Org Chem 67(4):1333–1338. https://doi.org/10.1021/jo016255s View ArticlePubMedGoogle Scholar
- Randic M (2003) Aromaticity of polycyclic conjugated hydrocarbons. Chem Rev 103(9):3449–3605. https://doi.org/10.1021/cr9903656 View ArticlePubMedGoogle Scholar
- Stanger A (2009) What is… aromaticity: a critique of the concept of aromaticity-can it really be defined? Chem Commun 15:1939–1947. https://doi.org/10.1039/b816811c View ArticleGoogle Scholar
- Hückel E (1931) Quantentheoretische Beiträge zum Benzolproblem I. Die Elektronenkonfiguration des Benzols und verwandter Verbindungen. Z Phys 70:204–286View ArticleGoogle Scholar
- Hückel E (1932) Quantentheoretische Beiträge zum Benzolproblem II. Quantentheorie der induzierten Polaritäten. Z Phys 72:310–337View ArticleGoogle Scholar
- Aromaticity Perception. https://docs.eyesopen.com/toolkits/cpp/oechemtk/aromaticity.html. Accessed 23 July 2018
- Kekulé A (1865) Sur la constitution des substances aromatiques. Bull Soc Chim Paris 3:98–110Google Scholar
- Kekulé A (1866) Untersuchungen über aromatische Verbindungen. Justus Liebigs Ann Chem 137:129–196View ArticleGoogle Scholar
- Herndon WC (1973) Enumeration of resonance structures. Tetrahedron 29(1):3–12. https://doi.org/10.1016/s0040-4020(01)99369-x View ArticleGoogle Scholar
- Randic M (1976) Enumeration of the Kekule structures in conjugated hydrocarbons. J Chem Soc Faraday Trans 72:232–243. https://doi.org/10.1039/F29767200232 View ArticleGoogle Scholar
- Blazic BDJ, Trinajstic N (1982) Computer-aided enumeration and generation of the kekule structures in conjugated hydrocarbons. Comput Chem 6(3):121–132. https://doi.org/10.1016/0097-8485(82)80005-3 View ArticleGoogle Scholar
- Gutman I, Cyvin SJ (1987) A new method for the enumeration of kekule structures. Chem Phys Lett 136(2):137–140. https://doi.org/10.1016/0009-2614(87)80431-1 View ArticleGoogle Scholar
- Cai F, Shao HQ, Liu CG, Jiang YS (2005) An alternative strategy for count and storage of Kekule and longer range resonance valence bond structures. J Chem Inf Model 45(2):371–378. https://doi.org/10.1021/ci049770a View ArticlePubMedGoogle Scholar
- Rashid Z, Van Lenthe JH (2011) Generation of kekule valence structures and the corresponding valence bond wave function. J Comput Chem 32(4):696–708. https://doi.org/10.1002/jcc.21655 View ArticlePubMedGoogle Scholar
- Kearsley SK (1993) A quick robust method for assigning a kekule structure. Comput Chem 17(1):1–10. https://doi.org/10.1016/0097-8485(93)80022-6 View ArticleGoogle Scholar
- Hansen P, Zheng ML (1995) Assigning a kekule structure to a conjugated molecule. Comput Chem 19(1):21–26. https://doi.org/10.1016/0097-8485(94)00035-d View ArticleGoogle Scholar
- Blessington B (1995) A serious problem with computer-processing of stereochemistry in chemical-structure files—the need for standardization. Chirality 7(5):337–341. https://doi.org/10.1002/chir.530070505 View ArticleGoogle Scholar
- Martin E, Monge A, Duret JA, Gualandi F, Peitsch MC, Pospisil P (2012) Building an R&D chemical registration system. J Cheminform 4:11. https://doi.org/10.1186/1758-2946-4-11 View ArticlePubMedPubMed CentralGoogle Scholar
- Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50(7):1189–1204. https://doi.org/10.1021/ci100176x View ArticlePubMedPubMed CentralGoogle Scholar
- Clark RD, Waldman M (2012) Lions and tigers and bears, oh my! three barriers to progress in computer-aided molecular design. J Comput Aided Mol Des 26(1):29–34. https://doi.org/10.1007/s10822-011-9504-3 View ArticlePubMedGoogle Scholar
- Egorova KS, Toukach PV (2012) Critical analysis of CCSD data quality. J Chem Inf Model 52(11):2812–2814. https://doi.org/10.1021/ci3002815 View ArticlePubMedGoogle Scholar
- Oprea T, Olah M, Ostopovici L, Rad R, Mracec M (2003) On the propagation of errors in the QSAR literature. In: Ford M, Livingstone D, Dearden J, Waterbeemd H (eds) EuroQSAR 2002 designing drugs and crop protectants: processes, problems and solutions, 2003rd edn. Blackwell, New York, pp 314–315Google Scholar
- Olah M, Mracec M, Ostopovici L, Rad R, Bora A, Hadaruga N, Olah I, Banda M, Simon Z, Mracec M, Oprea TI (2005) WOMBAT: world of molecular bioactivity. In: Chemoinformatics in drug discovery. Wiley-VCH Verlag GmbH & Co. KGaA, pp 221–239. https://doi.org/10.1002/3527603743.ch9
- Tiikkainen P, Bellis L, Light Y, Franke L (2013) Estimating error rates in bioactivity databases. J Chem Inf Model 53(10):2499–2505. https://doi.org/10.1021/ci400099q View ArticlePubMedGoogle Scholar
- Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han LY, He JE, He SQ, Shoemaker BA, Wang JY, Yu B, Zhang J, Bryant SH (2016) PubChem substance and compound databases. Nucl Acids Res 44(D1):D1202–D1213. https://doi.org/10.1093/nar/gkv951 View ArticlePubMedGoogle Scholar
- Kim S (2016) Getting the most out of PubChem for virtual screening. Expert Opin Drug Discov 11(9):843–855. https://doi.org/10.1080/17460441.2016.1216967 View ArticlePubMedPubMed CentralGoogle Scholar
- Wang YL, Bryant SH, Cheng TJ, Wang JY, Gindulyte A, Shoemaker BA, Thiessen PA, He SQ, Zhang J (2017) PubChem BioAssay: 2017 update. Nucl Acids Res 45(D1):D955–D963. https://doi.org/10.1093/nar/gkw1118 View ArticlePubMedGoogle Scholar
- McEntyre J, Lipman D (2001) PubMed: bridging the information gap. Can Med Assoc J 164(9):1317–1319Google Scholar
- PubMed. http://www.ncbi.nlm.nih.gov/pubmed
- Bolton EE, Chen J, Kim S, Han LY, He SQ, Shi WY, Simonyan V, Sun Y, Thiessen PA, Wang JY, Yu B, Zhang J, Bryant SH (2011) PubChem3D: a new resource for scientists. J Cheminform 3:32. https://doi.org/10.1186/1758-2946-3-32 View ArticlePubMedPubMed CentralGoogle Scholar
- Bolton EE, Kim S, Bryant SH (2011) PubChem3D: conformer generation. J Cheminform 3:4. https://doi.org/10.1186/1758-2946-3-4 View ArticlePubMedPubMed CentralGoogle Scholar
- Kim S, Bolton EE, Bryant SH (2013) PubChem3D: conformer ensemble accuracy. J Cheminform 5:1. https://doi.org/10.1186/1758-2946-5-1 View ArticlePubMedPubMed CentralGoogle Scholar
- OpenEye OEChem C++ Toolkit, version 1.9.0; OpenEye Scientific Software Inc., Santa Fe, NM. http://www.eyesopen.com/oechem-tk
- OpenEye Quacpac C++ Toolkit, version 1.9.0; OpenEye Scientific Software Inc., Santa Fe, NM. http://www.eyesopen.com/quacpac-tk
- OpenEye OEDepict C++ Toolkit, version 1.9.0; OpenEye Scientific Software Inc., Santa Fe, NM. http://www.eyesopen.com/oedepict-tk
- OpenEye Lexichem C++ Toolkit, version 1.9.0; OpenEye Scientific Software Inc., Santa Fe, NMGoogle Scholar
- Warr WA (2011) Some trends in chem(o)informatics. In: Bajorath J (ed) Chemoinformatics and computational chemical biology, vol 672. Methods in molecular biology. Humana Press Inc., Totowa, pp 1–37. https://doi.org/10.1007/978-1-60761-839-3_1 View ArticleGoogle Scholar
- Fanton M, Floris M, Cristiani A, Olla S, Medda R, Sabbadin D, Bulfone A, Moro S (2013) MMsDusty: an alternative InChI-based tool to minimize chemical redundancy. Mol Inform 32(8):681–684. https://doi.org/10.1002/minf.201300061 View ArticlePubMedGoogle Scholar
- Rogers FB (1963) Medical subject heading. Bull Med Libr Assoc 51:114–116PubMedPubMed CentralGoogle Scholar
- Audi G, Bersillon O, Blachot J, Wapstra AH (2003) The NUBASE evaluation of nuclear and decay properties. Nucl Phys A 729(1):3–128. https://doi.org/10.1016/j.nuclphysa.2003.11.001 View ArticleGoogle Scholar
- Wiberg N (2007) Natürliche Nuklide. In: Lehrbuch der Anorganischen Chemie, 102. Auflage. De Gruyter, Berlin, p 2001Google Scholar
- Ehrlich HC, Rarey M (2012) Systematic benchmark of substructure search in molecular graphs—From Ullmann to VF2. J Cheminform 4:13. https://doi.org/10.1186/1758-2946-4-13 View ArticlePubMedPubMed CentralGoogle Scholar
- O’Boyle NM (2012) Towards a universal SMILES representation—a standard method to generate canonical smiles based on the InChI. J Cheminform 4:22. https://doi.org/10.1186/1758-2946-4-22 View ArticlePubMedPubMed CentralGoogle Scholar
- Clark AM (2011) Accurate specification of molecular structures: the case for zero-order bonds and explicit hydrogen counting. J Chem Inf Model 51(12):3149–3157. https://doi.org/10.1021/ci200488k View ArticlePubMedGoogle Scholar
- Brecher J (2006) Graphical representation of stereochemical configuration—(IUPAC recommendations 2006). Pure Appl Chem 78(10):1897–1970. https://doi.org/10.1351/pac200678101897 View ArticleGoogle Scholar
- Razinger M, Balasubramanian K, Perdih M, Munk ME (1993) Stereoisomer generation in computer-enhanced structure elucidation. J Chem Inf Comput Sci 33(6):812–825. https://doi.org/10.1021/ci00016a003 View ArticlePubMedGoogle Scholar
- Perdih M, Razinger M (1994) Stereochemistry and sequence rules—a proposal for modification of Cahn–Ingold–Prelog system. Tetrahedron Asymmetry 5(5):835–861. https://doi.org/10.1016/s0957-4166(00)86237-0 View ArticleGoogle Scholar
- Cieplak T, Wisniewski JL (2001) A new effective algorithm for the unambiguous identification of the stereochemical characteristics of compounds during their registration in databases. Molecules 6(11):915–926. https://doi.org/10.3390/61100915 View ArticleGoogle Scholar
- Wild DJ (2009) Grand challenges for cheminformatics. J Cheminform 1:1. https://doi.org/10.1186/1758-2946-1-1 View ArticlePubMedPubMed CentralGoogle Scholar
- Schneider G (2010) Virtual screening: an endless staircase? Nat Rev Drug Discov 9(4):273–276. https://doi.org/10.1038/nrd3139 View ArticlePubMedGoogle Scholar
- Cahn RS, Ingold C, Prelog V (1966) Specification of molecular chirality. Angew Chem Int Ed Engl 5(4):385–415. https://doi.org/10.1002/anie.196603851 View ArticleGoogle Scholar
- Ertl P (2010) Molecular structure input on the web. J Cheminform 2:1. https://doi.org/10.1186/1758-2946-2-1 View ArticlePubMedPubMed CentralGoogle Scholar
- Morgan HL (1965) The generation of a unique machine description for chemical structures—a technique developed at chemical abstracts service. J Chem Doc 5(2):107–113. https://doi.org/10.1021/c160017a018 View ArticleGoogle Scholar
- Figueras J (1993) Morgan revisited. J Chem Inf Comput Sci 33(5):717–718. https://doi.org/10.1021/ci00015a009 View ArticleGoogle Scholar
- Ihlenfeldt WD, Takahashi Y, Abe H, Sasaki S (1994) Computation and management of chemical-properties in CACTVS—an extensible networked approach toward modularity and compatibility. J Chem Inf Comput Sci 34(1):109–116. https://doi.org/10.1021/ci00017a013 View ArticleGoogle Scholar
- Ihlenfeldt WD, Gasteiger J (1994) Hash codes for the identification and classification of molecular-structure elements. J Comput Chem 15(8):793–813. https://doi.org/10.1002/jcc.540150802 View ArticleGoogle Scholar
- CACTVS Chemoinformatics Toolkit version 3.365, Xemistry GmbH, Lahntal, Germany. http://www.xemistry.com
- NCBI C++ Toolkit. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/
- InChI Trust, InChI software version 1.04 for Standard and Non-Standard InChI/InChIKey. http://www.inchi-trust.org/fileadmin/user_upload/software/inchi-v1.04/INCHI-1-API.ZIP
- PubChem FTP. ftp://ftp.ncbi.nlm.nih.gov/pubchem/
- Kim S, Thiessen PA, Bolton EE, Bryant SH (2015) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucl Acids Res 43(W1):W605–W611. https://doi.org/10.1093/nar/gkv396 View ArticlePubMedGoogle Scholar