Line notations are linear representations of chemical structures that encode the connection table and (usually) the stereochemistry of a molecule as a line of text . They are widely used for storing, representing, communicating and checking the identity of chemical structures. Their popularity derives from one or more of the following: they encode the chemical structure in a compact form; they may be human-readable and/or human-writable; they are easily entered into software (for example, by copying and pasting into a text entry box on a website or a dialog box in a GUI, or entered into a spreadsheet cell); they may be canonical (that is, provide a unique representation for a particular molecule), in which case they may easily be used to check identity, search databases or even search the web.
While line notations typically do not allow the incorporation of additional information beyond the connection table and its associated chemistry (with the notable exception of SYBYL Line Notation [2, 3]), even where the underlying data is stored in a 2D or 3D file format as in several web databases (for example, PubChem ), linear representations of the data are usually provided for convenience. Apart from IUPAC nomenclature , the two most widely used line notations and the focus of the current work are the SMILES (Simplified Molecular Input Line Entry System) string developed by Weininger  and Daylight Chemical Information Systems , and IUPAC’s InChI (International Chemistry Identifier) representation [8, 9]. Others include SLN (SYBYL Line Notation), ROSDAL  (from the Beilstein Institute), WLN (Wiswesser Line Notation ), MCDL (Modular Chemical Descriptor Language [12, 13]), the InChIKey (a hashed representation of the InChI) and more [1, 14–20].
The SMILES format is the most popular line notation in use today. Created by David Weininger in 1986 at the US Environmental Research Laboratory (USEPA), and further developed at the company he co-founded, Daylight Chemical Information Systems, the SMILES format is particularly attractive as it is easily learnt, is both human-readable and -writable, and encodes stereochemistry in an intuitive way. Since no formal specification of the SMILES format was ever published and there are several ambiguities that have led to differences in implementation, in 2007 Craig James (eMolecules, Inc., and formerly of Daylight) initiated a community approach to develop a specification for SMILES, the OpenSMILES specification . The SMILES format is not without some drawbacks: it is focused on molecules whose bonds fit the 2-electron valence model, it handles a limited array of stereochemistry types, and as yet there is no standard for handling aromaticity. However, perhaps the greatest limitation of the SMILES format is that there is no standard way to generate a canonical representation. While Weininger et al.  did publish a canonicalisation procedure (CANGEN) for SMILES, the procedure did not include a treatment of stereochemistry, one of the most difficult aspects of the problem. Daylight subsequently provided a commercial product to generate canonical SMILES but as the algorithm was proprietary, other commercial and open-source software developed their own algorithms for generating canonical SMILES all of which differed from each other and none of which are published.
In 1999, the need for a community standard for a canonical linear representation led to a proposal by Steve Heller and Steve Stein at the National Institute of Standards and Technology (NIST) in the US for a new representation, the InChI (International Chemical Identifier), which was subsequently developed as an IUPAC initiative in collaboration with NIST . The first version of the InChI was released in 2005, and in 2009 the InChI Trust  was formed to oversee its development. The goal of the InChI is to provide an canonical representation that can be used to link information from different databases on the same molecules. To do this, the InChI algorithm combines a normalisation procedure, a canonicalisation algorithm, and a layered structure that helps identify isomers.
This work describes a method to generate canonical SMILES using canonical labels from the InChI. While other canonicalisation methods have been developed that take stereochemistry into account (for example, Koichi et al.  as well as all of the (unpublished) methods used by the various cheminformatics toolkits), only the InChI is suitable for the development of a standard canonical SMILES string that can easily be supported by many different software libraries, as there exists only a single implementation, the code for which is freely available under an Open Source license. This implementation has been incorporated into several Open Source cheminformatics libraries (Open Babel , the Chemistry Development Kit , RDKit , Chemkit  and Indigo ) as well as proprietary software from several companies (for example, ACD/ChemSketch , CACTVS , JChem , and planned for OEChem ). This means that all of these programs can generate the same InChI as the official InChI software, and thus they have the capability to generate the same canonical SMILES.
Here I describe Universal SMILES and Inchified SMILES, easily-implemented methods that use the canonical labels provided by the InChI to generate canonical SMILES. The term Universal is used as the method can be universally adopted by any software with access to the InChI library or executable, without the need for any changes to the InChI software. These SMILES strings do not use any extensions to the SMILES standard, and so are completely interchangeable with the existing SMILES strings used by many databases. The advantage of replacing existing SMILES strings with Universal SMILES or Inchified SMILES is that the ease of use and readability of SMILES strings is enhanced by the indexing and linking ability associated with a canonical representation such as the InChI.
This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. Apart from yaInChI (a modification of the InChI by Cho et al. ), it is also the first time that the canonical labels from the InChI have been used to generate an alternative canonical representation, although in fact the idea itself has previously been proposed by Murray-Rust (on the Open Babel mailing list, February 2005 ). However, there are other studies that share the idea of exploiting information contained in the InChI for purposes other than uniquely identifying a molecule. Thalheim et al.  implemented a tautomer enumeration procedure based on information contained in the InChI. The InChI normalises all (supported) tautomers to the same representation, and stores the normalised information in the mobile hydrogen layer of the InChI which describes how one or more hydrogens is shared between a set of heteroatoms. The authors extracted this layer, and developed an algorithm that generated all of the tautomers consistent with it. The layered structure of the InChI can be exploited to identify isomers of various types; in a crystallographic study, Fábián and Brock  used the enantiomer layer to identify a particular class of racemic crystal, kryptoracemates, where the enantiomers are not related by space-group symmetry.
I propose two approaches for the generation of a canonical SMILES based on the InChI, one of which includes the normalisation steps introduced by the InChI:
The Inchified SMILES can be considered as a canonical SMILES string that corresponds to the Standard InChI. All of the normalisations applied to the structure by the InChI are passed onto the Inchified SMILES, and there is a one-to-one relationship between the two.
In contrast, the Universal SMILES retains the original structure (and tautomeric state) but uses the canonical labels from the InChI to create a canonical SMILES string. It can be considered a drop-in replacement for existing SMILES, with the added benefit of being a canonical representation.
The Methods section describes how to generate Inchified and Universal SMILES. The Results section covers how these approaches were tested by implementing them as part of the Open Babel cheminformatics toolkit. Additional comments on the implementation as well as the implications of a widely-available standard canonical form for SMILES are contained in the Discussion.