Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data
© Clark et al.; licensee Springer. 2015
Received: 24 November 2014
Accepted: 23 February 2015
Published: 22 March 2015
The current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment. We propose that this trend be accompanied by a thorough examination of data sharing priorities. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats.
KeywordsCheminformatics File formats Open lab notebooks Public data Machine learning
The increasing availability of freely accessible data for chemical compounds and their associated properties and web links is driving a significant shift in the way research is carried out. The multitude of public databases [1-6], freely distributed vendor compound librariesa and directly shared lab notebooks  make it possible for scientists to prospectively gather together a large knowledgebase. The data may be useful to test a hypothesis in the laboratory or to build computational models. Traditionally this process involved scouring the peer reviewed literature, either online through paywalls or physically within the walls of a library, and in some cases perusing privately collected data on the subject . The reuse of such data may require data licensing and we have suggested some rules that could be helpful .
Despite the major shift that is trending right now, there is an important caveat: many of the hosts of online data do not necessarily give proper consideration to what may well be the most important consumer of their data, namely software algorithms, especially at a time when the ongoing development of the semantic web is hyperdependent on algorithms and mappings. A scientific publication is typically downloaded and perused by hundreds or perhaps thousands of humans, but the number of people who carefully study the data content, by carefully examining the constituent chemical structures, physical properties, reaction schemes, spectral assignments, etc., is usually just a handful. The inherently low scalability of scientists’ time is in stark contrast with the ever increasing ability of software algorithms to assimilate vast quantities of data and deliver meaningful insights that could not have been observed by more traditional means. The ability for a well-designed informatics platform to productively use as much data as can be made available means that in principle every publicly available scientific data point that is relevant to a machine learning algorithm’s domain should be injected into the training set. Were this ideal state of affairs to be achieved, it would mean that every hard-won experimental result would have its chance to inform future experiments, rather than languishing in obscurity. Chemists would be able to benefit from all prior art within the field, and the quality of insights would improve over time as the volume of data increases and algorithms are improved.
While there have been many efforts to extract such data from the literature, there are major flaws with the methods used for extraction. The root cause is that the data entry is seldom being done by the scientists who were responsible for the experiment: for the most part, machine readable data from the published literature is created by paid curators or algorithms designed to extract information from the intractable formats used by the primary literature and patents . A mistake made by a human curator, or an algorithmic extraction method, is unlikely to ever be verified by an expert familiar with the original experiment, which means that even if the provenance of the data is recorded (i.e. a citation to the original source), it is statistically improbable that it will be verified once it is incorporated into a database.
The reality of machine readable data in 2015 is that most collections of chemical structures and properties have been laundered through a number of data entry sources, few of which record the original pre-digital origin, and even fewer of which were created by scientists who are both connected with the research and have a personal vested interest in ensuring that the digitally represented version is correct. While scientists take great pains to ensure that graphical figures in their manuscripts are free of error, since it is a career-affecting embarrassment to publish incorrect data for cognition by other humans, there is no such community-enforced covenant for data that is intended for consumption by algorithms.
The disconnect between human- and machine-readable content also gets to the heart of the notion of scalability of scientific data. In most parts of the contemporary technology industry, software scalability refers to the ability to handle larger numbers of bytes, whether it be by ramping up database storage from gigabytes to terabytes to petabytes, or by serving millions of web page views per unit time. For the experimental sciences, the critical limitation is the evaporation of context. For example, a scientist who has been working on a project for a few weeks could have each experiment written down in shorthand notation in a paper notebook, and easily recall the remaining details from memory. After a year, shorthand notes and abbreviated sketches may be insufficient; once the lab notebooks start to pile up, other scientists start making use of the recorded processes, and eventually the original scientist moves on to another project or leaves the institution, an experimental record is seriously deficient without detailed explanation. It is all too often the case that there is insufficient context to recreate what was once institutional knowledge: the science is now effectively lost. This notion of scalability across time and personnel is a consistent entropic trend within experimental research groups, which is managed to some extent by executively curating the information that is deemed most worth preserving, and documenting it in more detail. This is formalised when preparing a manuscript for publication, or writing a thesis or research report. For releasing open data directly to the Internet, however, these mechanisms are stripped away: data that can be consumed in real time by a complete stranger on the other side of the world, or by a software algorithm, is completely dependent on whatever context was contained within the electronic document at the moment it was released. Scientific data with incomplete context can be corrected by an expert, who can infer missing information from personal knowledge, from the literature, or by conducting additional experiments in order to obtain the missing information. But these steps are the very definition of an unscalable process, and indeed this is the very problem that open data is attempting to solve.
Formats such as PDF files, HTML pages, word processor documents, and bitmapped or vector graphics are effectively dead formats, as far as machine interpretation of chemistry is concerned. There are efforts to extend the formats using chemistry enabled capabilities, examples being Chem4Word  but this has limited reach and capability relative to the overall needs for data access. While there has been significant success in many fields regarding the interpretation of human readable text, the obvious example being Internet search engines, the same cannot be said for chemical structures, which are a fundamental datatype in chemistry. Because chemical structures, and meta-groups of structures such as reaction schemes, are represented using opaque formats intended only for visual display, it means that almost all published chemical information is essentially dark data. Noble efforts to extract this information by text mining of chemical names [12-15] or optical structure recognition [16-18] have resulted in an error rate that is so high that it is arguably making the data scarcity problem worse. Injecting such data into the overall knowledgebase without provenance degrades the ability of any efforts to use this information. On the other hand, efforts to encourage scientists to publish quasi-formatted data, such as Excel spreadsheets, online collaborative documents, or comma separated text files with SMILES or InChI hash codes, are problematic. While these formats have a much higher degree of machine interpretability than those designed only for visual presentation, they are highly flawed due to a combination of incompleteness of description and high degrees of freedom, which join forces to ensure that such data sources are rarely meaningful to software without an expert scientist on hand to provide the missing context.
The thesis of this article is that chemical data in general, and freely available open data in particular, needs to undergo an inversion of priorities: whether explicitly or not, when scientists publish chemical information, their first and most important customer base is software algorithms, while their secondary audience is human beings familiar with the subject material. The justification for this ranking is quite simple: machines are difficult to please. They have no ability to acquire context, and whenever they are required to make a judgment call, they are only as good as the foresight of their programmer, who needs to have anticipated any possible form of ambiguity and preemptively designed a foolproof solution for resolving it. Since this is almost never completely the case for unsupervised algorithms, it is generally appropriate to assume that when handing over data with more than one possible interpretation, the algorithm will end up guessing which is correct, and frequently guessing wrong. And, to make matters worse, the results of these interpretation guesses are often stored in persistent form, which happens every time a format interconversion occurs, meaning that data that was initially flawed and incomplete becomes even more so as it is propagated. This is, in a nutshell, why most chemical data is inaccurate [19-23]. The solution to this problem is to bring the originating scientist directly into the loop, and ensure that they are involved in making sure that the data is meaningful to software, and by induction, therefore can also be made meaningful to other scientists. While much of the burden for this transformation will be dependent on greater awareness and training of the experimental scientists who create the data, the expectation of progress is only realistic if it can march in lock-step with improvement of the standard tools that chemists use for data entry, as well as improvements to the data submission standards mandated by those in charge of data collection (e.g. publishers, librarians, database curators, etc.). This parallels an increasing need, especially in academia and early career immersion, in routine procedures regarding structure representation and searching.
The scientist (who was directly involved with the research) enters the data, which typically includes structure diagrams, numbers, and other annotations;
The data is sent to an algorithm which attempts to parse the data, and in the event that any data has 0 or >1 possible interpretations, the problem is reported or warnings issued, and the data is rejected;
The scientist views the data as rendered according to the interpretation of the algorithm; once this is consistent with the scientist’s original assertion, it can be released openly, in its raw, machine-interpretable form;
A service can be conveniently invoked to turn the machine-friendly data into diagrams that can be viewed in a form that is most convenient to any scientist who wishes to view the data, and can be easily embedded within a common manuscript format.
We are at this point particularly concerned with chemical structure representations, their composition within larger schemes such as reactions, and their association with measurement data such as physical properties. This approach can be extended to analytical data including, where feasible, validation checking between spectra and their associated compounds [24,25], or CIF checking . In the greater scheme of things, the amount of detail and nuance contained in any scientific experiment involves far more context than just molecules and corresponding data, but chemical structures and simple properties are a good place to start, since they are so fundamental: without representing these well, there is little hope for any of the remaining data. The status quo for data entry and representation of structures and properties leaves considerable room for improvement.
In the remainder of this article, we will discuss some of the existing services that are working towards this idealised workflow, some of the common pitfalls, and practical methods for working around them.
The second example, shown in Figure 2(c) and (d) for mobile and web results respectively, shows a structural representation of a drug material: aminophylline. It is not immediately obvious that there is anything wrong with the structure, given that the active ingredient is drawn correctly, and the adduct is present. However, the synonyms that have been imported for the record are quite explicit about the adduct ratio being 2:1, which makes the structure inconsistent with the primary name and all of the synonyms. It is not necessarily clear to most chemists whether the structure should be modified, or the name, or whether the distinction is important. In this case a resolution is quite likely to be obtained, because the compound in question is a well studied drug, and there is a fair chance that an expert with specific knowledge will encounter the datum and be able to provide an authoritative correction. However there are tens of millions of compounds that are much more obscure, and any of them could have an accidental extra methylene, an incorrect chiral centre, or any number of errors that encode for molecules that are valid drawings. A structure that is valid, but represents the wrong molecule, will never be corrected by an algorithm, and will probably never be encountered by the handful of scientists who were involved in its use or synthesis. In these large majority of cases, the mechanism by which the corrupted data was injected into the greater corpus of knowledge has created a permanent stain.
One of the major factors behind the increasing availability of chemical structure data is the business value that is associated with a vendor of chemicals making their wares as easy as possible to find. Making it simple for anyone to incorporate structures and product identifiers into a generic chemical searching service is a clear value proposition, and an additional encouraging feature of these data sources is that a company that is responsible for selling a physical package with a particular chemical compound has a high degree of liability, and is hence motivated to make a reasonable effort to correctly represent the data.
Unfortunately these primary sources are diluted by providers of large scale high throughput screening data, who prime their customers to expect a certain degree of noise, and the potentially poor state of the informatics component is just one of the many failure modes for experiments that are designed to use quantity to compensate for lack of quality. And, perhaps more problematic, are the large number of companies that collate vendor catalogues from many other sources, losing much of the provenance along the way, and introducing layers of errors that cannot be traced back to a single source. Many of these repackaged vendor libraries have been submitted to public databases such as PubChem and ChemSpider, and many of the companies are no longer contactable, and for whom there is no business value proposition to propagate error corrections. It should be noted that the hosts of the ChemSpider database developed more stringent acceptance criteria regarding vendor catalogues soon after the initial release of the database . Coupled with their pre-filtering efforts and providing direct feedback to the vendors themselves to encourage clean-up of their data has resulted in improved data quality not only in later depositions into ChemSpider but likely also for the community in general, but this is an ongoing effort.
From a data quality perspective, the most promising property of open lab notebooks is their directness. The term refers to a specific kind of electronic lab notebook that is made openly available to the scientific community shortly after the experiment writeup is complete, circumventing the usual lengthy publication cycle and any proprietary access restrictions. Typically a data unit, whether it be a reaction, a measurement, or a characterisation analysis, is prepared and released directly by the scientist who performed the experiment, and in some cases a second opinion is provided by a principal investigator or reviewer (albeit with a much faster turnaround time than for conventional publication, and also foregoing the requirement of novelty). This means that not only is it possible to find out the individual and organisation responsible for the contribution, but it also introduces the opportunity for the experimental scientist, whose knowledge at that point exceeds that of anyone else on this specific piece of data, to verify that the transmission of the data was carried out correctly.
A pertinent example is ChemSpider SyntheticPages (CSSP) , which is an online “micro-publishing” site serving chemists interested in chemical syntheses. Chemists are encouraged to publish the details of their experiments in order to communicate the details of their work. CSSP uses a template-based entry form and multimedia support including interactive display of various types of analytical data. The articles are reviewed by the CSSP editorial board, made up of university professors, as well as then being peer-reviewed by the incorporation of public commentary post-publication. The pre-publication peer review is generally very fast (24–48 hours) and even post-publication edits can be made as CSSP is a hybrid publication-database. Each micro-publication includes a digital object identifier (DOI) making the CSSP contribution a citable object on a CV.
It is the scientist-to-Internet transmission step that we believe is in most need of attention. Most chemists working to produce new knowledge in experimental laboratories are not trained cheminformaticians, and have a strong tendency to follow currently accepted best practices for documenting their results. At the present time, this typically involves using off-the-shelf documentation software, such as the ubiquitous Microsoft Word and Excel, and software such as ChemDraw or ChemDoodle that is specially designed to help chemists create graphics for incorporation into such general purpose packages. Unfortunately the use of these software tools all too often makes correct machine interpretation of the data impossible: even in cases where data from drawing packages is available, the reality is that these tools are designed for creation of diagrams, not machine interpretable data, and there is no guidance as to which visual aids are completely and unambiguously meaningful to an algorithm. There are documented standards for visual representation [33,34], but there has been little effort to implement these for the purpose of lossless interconversion between presentation and informatics.
Much attention of late has been given to modern online collaborative tools, such as using Google Docs to coauthor and share content, and for using electronic lab notebooks (ELNs) with a blog-like interface . While excellent for sharing data in real time, they do nothing to solve the problem of machine interpretability of chemical data. Freeform text and uploading of arbitrary supporting files gives the maximum scope for scientists to describe their experiments, but it is also the worst case scenario for creating a fully automated script to gather diverse data into a single collection of relevant content in order to provide actionable intelligence. Some progress in terms of checking data formats is being made by the utilisation of chemistry specific components into the Labtrove platform .
The pitfalls of using line notation for structures also apply when using database references in lieu of structure, which is an option when all of the entries are known to already be in the database, e.g. using ChemSpider ID codes, but this is equally destructive of data. A public database will typically have one globally preferred structure representation which has been normalised and drawn in some preferred manner which is not necessarily most appropriate for the task at hand. Using external identifiers also introduces a slew of additional issues, e.g. if access to the Internet resource is interrupted, the data becomes unusable until it is restored. Also, records can be changed or deleted, and there is always the possibility that the database provider may one day cease to provide the service at all.
Besides issues with structures, there are many other reasons to avoid using an overly simplified text representation when properties are being included. For example, a comma delimited text file might contain a heading column described as “IC50”, and for which each following row has a number. It is pervasive practice to omit any further information such as units, errors, target, sample size, conditions, etc., which means that data stored in these files contains an enormous amount of implied context. If the data is being shared between two scientists working on the same project this may not be an issue, but if it is being uploaded into an aggregate dataset for purposes of machine learning or database reference, it is worse than useless, due to its lack of provenance and context.
Can an algorithm correctly and unambiguously determine the molecular formula?
Is it possible for software to use the representation to create a diagram that reflects what the scientist originally drew?
There are surprisingly many chemical structure formats that are unable to guarantee that an algorithm can determine the correct molecular formula from the drawing. These shortcomings have mostly to do with implicit hydrogen atoms and inline abbreviations. The implicit hydrogen problem is a side effect of chemists’ shorthand, and works well in simple cases, but poorly defined valence rules for unusual bond types, and the absence of a common method for overriding the default formula, means that many nontrivial molecules cannot be drawn using the most popular cheminformatics file formats, such as MDL Molfile . Abbreviations are also a persistent problem, since many structures are difficult to represent in a human-readable way without abbreviating certain groups, but since there is no universal repository for abbreviations, and many research groups invent their own overlapping sets for localised use, it is necessary to have a way to define these internally as part of the structure definition. Additional problems are introduced when using drawing software designed for diagram creation, which offers a large variety of drawing primitives that have no meaning at all (e.g. circles, symbols, free form text, etc.). These file formats are a superset of the collection of meaningful objects, and they cannot be used unless the operator has a strong understanding of which objects are valid and which are not. This information is not generally known to experimentalists, and not communicated by the drawing software.
The need to recreate a diagram with the original layout, orientation, wedge-bonds, resonance patterns and various other nuances rules out the use of any popular line-based formats that exclude atomic coordinates. There are many advocates of the use of SMILES or InChI codes for raw structure representation , mainly because they are convenient for storing in spreadsheets or text files, and have intrinsic canonical properties. Both of these features enable limited use of chemical data by general purpose software that has no cheminformatics capabilities, which is often a necessary evil for data manipulation. However, the amount of data destruction involved in converting a 2D sketch into a short canonical string is highly detrimental to data integrity. As long as chemically aware software is available, there are no advantages to using canonical strings to represent structures, since these can be derived on demand from the original representation, which creates a break-even-or-lose scenario: for this reason it should not be done unless there is no alternativeb.
Atom and bond properties, and currently reserved extensions, used by the SketchEl molecule format
Atom core properties
An arbitrary string, which typically matches one of the symbols from the periodic table. If not an element, and there is no inline abbreviation for the atom, then the overall representation does not encode a molecule, but rather a template or query.
2D layout positions, in quasi-Angstrom units, with the idealised bond length being 1.5.
Formal atomic charge for the chemical species: must be an integer.
Number of unpaired electrons: a whole number. This is used to help calculate the valence, and is primarily relevant only for main block elements.
By default, implicit hydrogen atoms are calculated automatically for C, N, O, P and S, and zero for all other elements. Non default values allow the number of extra hydrogens to be specified explicitly, as 0 or more.
An arbitrary list of strings associated with the atom, some of which have prefixes that are reserved (see below).
Bond core properties
The two connecting atoms for the bond.
Bond order: a whole number, which is typically one of 0, 1, 2, 3, 4 or 5. Values of 4 and 5 are extremely rare, while values of 0 are used extensively for bonding arrangements that do not follow the simple Lewis octet rule.
Flat by default, but can also be inclined or declined (so-called wedge bonds) or non-stereospecific (usually drawn as squiggly lines).
An arbitrary list of strings associated with the atom, some of which have prefixes that are reserved (see below).
Atom reserved extension properties
Optional third dimension: the existence of z-coordinates implies that the molecule is not a flat 2D depiction but rather a 3D conformation.
Specific isotope enrichment, where the default value of 0 implies a natural isotope distribution.
Integer mapping number associated with the atom. This can be used for any purpose, but is often for correlating atoms in a series or a reaction.
Query properties used to specify how to match a variety of atom types.
Inline abbreviation, containing a terminal substructure fragment that defines the entire molecular species that the placeholder atom represents. Can be recursive, i.e. the abbreviation can contain its own abbreviations.
Bond reserved extension properties
Query properties used to specify how to match a variety of bond types.
In spite of its minimalism, the format is also extensible in a way that is both forward- and backward-compatible. A number of properties are optional: these, properties that will be defined in the future, and custom properties that are not a part of the formal specification, are stored in a way that preserves the read/modify/write integrity for algorithms that do not care to implement them. This is in contrast to formats like MDL Molfile: if a software package writes a molecule that makes use of a property that is not part of the lowest common denominator subset that most implementations can handle, or defines its own extensions, the extra data will be either deleted or corrupted if it is submitted to a software algorithm that does not implement the property correctly.
While the SketchEl format cannot intrinsically represent many higher orders of metadata, e.g. mixtures of compounds with different stereochemistry or constitutional isomers, these definitions are difficult to pinpoint with a single proto-structure in a way that is not ad hoc: a more rigorous approach is to define a higher layer of abstraction, which either enumerates the different molecular species explicitly, or in complex but regular cases such as Markush collections, defines its own enumeration formula.
Assembling collections of molecular structures brings a similar capability wish list: a format should be minimalistic, simple to define, easy to implement, and most importantly it should be forward and backward compatible, so any implementation of the specification can read, view, modify and write the data with reasonable assurance that important content will not be destroyed or corrupted. Since the most common use case scenario for multiple structures and data involves representation in a tabular format, roughly analogous to a spreadsheet with a molecular datatype, it makes sense to define the core data format in this way, whereby more exotic data arrangements are mapped onto the table with higher order metadata.
These characteristics are implemented by the datasheet XML format, which is also used by the SketchEl package, for editing collections of structures and data. At its core it is a very simple tabular format, where each column is strongly typed, and is one of molecule, string, integer, real number, or boolean. Molecules are embedded using the SketchEl molecule format. For many data collections, e.g. lists of molecular structures with their associated identifiers (name, link, database ID, etc.) and properties (activity, solubility, melting point, etc.), the core format is quite adequate.
For additional labelling, or for higher order organisation such as arranging multiple structures into reaction schemes, the datasheet header allows for extensions, which contain arbitrary content that is generally not shown to the end user. Extensions that follow a specific protocol are described as aspects. The principle of the aspect extension mechanism is that if a software application implements the aspect, it should provide additional capabilities, such as alternate viewing modes, specialised editors or additional classification information. An aspect is required to be as tolerant as possible of disruptive external modifications, e.g. if one of its necessary columns is deleted, it will recreate the column using default values. If the software application does not recognise the aspect, it should still be able to load the datasheet, present it to the user in its default tabular form, allow cells to be edited, rows to be added or deleted or moved, and in some cases modification of column names and types, without necessarily disrupting the higher order markup. For example, if an aspect defines the default units for a given column, loading the datasheet with an unaware editor and modifying the quantity values preserves the read, view, modify, save integrity, as long as the user is aware of what the numbers represent. If an aspect defines a chemical reaction, where a number of molecule columns are used to define the various components, it is possible to use a minimal editor to change some of the molecular structures, and still preserve the reaction definition.
Figure 6(c) and (d) both show renditions of aspects that encode for chemical reactions. The Reaction aspect, shown in (c) rendered by the Mobile Molecular DataSheet app , combines a number of columns containing molecules, text and numbers into a single reaction step, which is ideally rendered as a single graphical object representing multiple components. The Experiment aspect, shown in (d) rendered by the Green Lab Notebook app, is a subclass of Reaction, which augments the definitions with additional information such as quantities and component roles, and also allows for multiple steps. While datasheets containing the Reaction or Experiment aspects can be viewed and edited by software that does not recognise the aspects, the meaning of the content is less clear to the viewer, since a spreadsheet-like representation is quite different to the reaction component layout algorithms used by the software that implements these aspects.
Chemical data that is stored using the datasheet XML format, with embedded SketchEl molecules, conforms to a very well defined, easy to implement core specification for data purity. Both molecules and datasheets can be extended as necessary to describe more complex concepts, which are often necessary to ensure machine readability, but their core definitions are highly functional, and generally safe to edit without specific knowledge of higher order markup.
One of the valuable properties of open data is the close connection between the scientist and the content, which can be treated as an opportunity to solve some of the most pernicious data quality issues in chemistry. When it comes to aggregated collections of molecular structures, there are two main kinds of problems: structure representations that are demonstrably wrong in the absolute sense, and those which could be correct, but in the given context, the wrong chemical species is being described.
A significant amount of work has been invested in the former category, for example the ChemSpider Validation and Standardization Platform (CVSP)  (also see: Karapetyan K, Williams AJ, Batchelor C, Sharpe D, Tkachenko V: The Chemical Validation and Standardization Platform (CVSP). Large-scale automated validation of chemical structure datasets, accepted for publication to Journal of Cheminformatics). This tool embodies chemical knowledge that can search for a number of common structure mistakes, or representations that do not follow a protocol, such as covalently bound salts (Na-Cl) or pentavalent nitro groups. Many of these examples are common and easy to fix, but there are many more examples that cannot be corrected without knowledge of context. Furthermore, there is no way to ascertain whether a molecule is the right one when there are multiple reference points. For example, a data record that provides a bioactivity measurement for a molecule named “aspirin”, for which the structure given is salicylic acid, even a smart algorithm that is able to find out that aspirin is the acetylated form cannot know whether the data record provided the wrong structure or the wrong name, unless the provenance is somehow recorded. Whether the molecule or the name should be trusted preferentially, and if there are conflicts within either of these, which source has precedence, means that each data collection needs a complex and elaborate policy for judging data quality. These challenges have directly influenced some of the approaches associated with the Open PHACTS semantic web project  where a “chemical lenses” approach has been utilised to focus the user in on various forms of the chemical .
Nonetheless, the tools for ensuring that a molecule is both valid and standardised according to a set of rules is extremely valuable when incorporated into the editing workflow, i.e. at the source of entry when user intervention is still an option, rather than using automated scripts further downstream. Violations can be highlighted during the editing of a molecular structure, and flagged again if the user attempts to submit the entry to a database.
The adherence to hard rules for structure validity is often appropriate for processing large databases with preexisting quality issues, but a heavy-handed approach is not appropriate for original data entry. For example, when a chemist draws a pentavalent carbon atom, it is usually a mistake, and software that can call attention to this likely mistake as early as possible is beneficial. Nonetheless, there are real reasons for representing such species (e.g. carboranes ), which raises an important point that is sometimes lost in software: the originating scientist is always the final arbiter. There is usually at that time nobody in the world who knows more about the particular unit of research being described, and the rule set designed for a particular software package is far down the list of contenders.
While the second structure diagram in Figure 11 is much less appealing for purposes of manuscript preparation, this illustrates the primary argument of this article: imagine two algorithms, one designed to automatically convert from the human-friendly format (a) to the machine-friendly format (b), and the other to perform the opposite. If both algorithms have a comparably high but imperfect success rate for a given domain (e.g. 99%), it is overwhelmingly preferable to use the machine-friendly format for the primary repository, because of the asymmetry of consequences. When a structure drawn for humans is parsed incorrectly into a machine format and injected into a database, all too often the error goes unnoticed, and if the provenance is not retained, then the corrupted data will surely find its way into the body of scientific knowledge and continue on to befoul any and all data processing operations that it comes into contact with. If on the other hand the data is represented in a machine-friendly way, and algorithmically converted into a human-friendly graphical format as needed, the consequences of failure are minor. For high quality uses such as manuscript preparation, rare flaws will generally be noticed and can be corrected easily enough, since literature publications are carefully scrutinised by several reviewers prior to publication. Even if a sub-optimal drawing is published, as long as it is correct, the fallout is likely to be manageable. For low quality uses, like browsing search results from a database query, occasional representation of structures in a way that is correct but not aesthetically ideal is a small nuisance, compared to data corruption.
As well as such valence issues, a large category of data entry issues arise from the use of text. As a general rule, any text in a structure diagram that does not map to an element in the periodic table brings with it an additional burden for ensuring that its meaning is strictly defined. Free text, e.g. a label that says “chiral” or “cis/trans”, is clearly not applicable, but as mentioned earlier, abbreviations can be dealt with by ensuring that they are defined within the chemical structure - though not with free text labels such as “L = PPh3”. Other kinds of abbreviations, such as X, R, R1, R2, etc., serve as element placeholders and their presence implies that the representation is of a template, rather than a structure. It is important to ensure that these structures are never submitted to a database without some accompanying formula that specifies how the template should be converted into actual chemical species, but fortunately this is easy for data curation algorithms to detect, and reject due to missing information.
One of the best visual aids for educating scientists about what their structure diagrams actually mean to a machine algorithm is simply to display the computed molecular formula for particular fragments. This is a concept that is deeply ingrained for all chemists, regardless of their level of affinity for cheminformatics software. If the example in Figure 11(a) is reported as having a formula of C14H28Fe2[CO][OC], the chemist does not need to be convinced that there is a problem: this is not the chemical composition of the molecule, which means that knowingly submitting this representation to a database is tantamount to scientific fraud, and therefore something must be done about it.
The most effective way to ensure that structures are represented accurately is to use data entry tools that operate on a fundamental datastructure, such as the SketchEl molecule format, or an enhanced variant of the industry standard MDL Molfile . Using graphical diagram drawing tools is problematic, because the functionality they provide is a superset of what is valid for cheminformatics purposes, and there are no algorithms that can transform an aesthetically styled structure into a machine readable valid equivalent with a success rate that is acceptable. In principle, though, it may be effective to create a plugin for such software to show the machine interpretation and structural formula breakdown, updated in real time, in order to ensure that users are aware of when their stylistic choices result in misleading content, but such tools are not currently available. This represents a potential unmet need.
Direct data hosting
There are many services that store user-provided chemical data in a fundamental cheminformatics format, including the aforementioned ChemSpider and PubChem databases. These services make use of elaborate content aggregation features, which involves a large amount of automated correction. For most organic structures that conform to simple Lewis octet rules this can be trusted to leave a well-drawn structure unmolested, but problems arise when leaving this domain.
The molsync.com site  provides an example of a service that openly hosts chemical data, in its most pure form, and allows it to be consumed in a variety of different downstream formats, for either humans or machines. We describe some of the properties of this service, because it differs from most ad hoc Internet sharing facilities in that it provides interpretation and visualisation of raw chemical data. It demonstrates several key proof of concept features that should be standard for chemical data hosting, and can be incorporated into open lab notebook software.
Data can be uploaded using a simple REST-based API, after which point it is stored in a database and assigned an identifier. The data is typically uploaded as either a SketchEl molecule or datasheet XML document, but other related formats such as MDL Molfiles will be automatically converted. From the identifier, a URL can be constructed, which allows anyone with an Internet connection to be open the page in a browser.
The outline of the browser presentation uses simple HTML and CSS. Individual molecular structures, and some of the higher order metadata specified by aspects that are implemented by the server - such as chemical reactions - are drawn using a high grade rendering layout algorithm  and passed to the front end as vector drawing instructions, which means that the page can be rendered to any resolution, and also can be sent to a printer or converted into a PDF file without any loss of quality.
There are two main use cases for data conversion: migrating to a different cheminformatics format to be consumed by a specific software application, and the generation of graphics for presentation or publication purposes.
When raw data is stored in a rigorously minimalistic and unambiguous format, it is generally effective to convert this data into the lowest common denominator subset of a less rigorous format, with some potential for information loss, which may remain theoretical for a reasonable domain of use cases. For example, converting a structure into a V2000 MDL Molfile that is readable by the large majority of software that can parse the format can be expected to preserve all of the pertinent information in many molecule types. For nonorganic structures that cannot be properly represented with the V2000 format, or for structures that use inline abbreviations, the conversion cannot survive a round trip intact, and so the conversion is an irreversible downstream one. For information that is pertinent to the destination format, but does not exist in the core specification of a SketchEl molecule, the extensibility mechanism holds the door open for future improvement, in a backward and forward compatible way. For example, MDL Molfiles provide a number of capabilities for specifying chemical queries  as atom and bond annotations. The SketchEl molecule format can optionally incorporate analogous extensions, and if the data hosting service is subsequently upgraded so that it can convert the overlapping subset of functionality to the MDL Molfile equivalent, then this capability can be introduced at any time. The operation is commutative to the extent that the definitions match.
Similarly with collections being exported as MDL SDfiles, a significant amount of metadata is lost, particularly regarding the columns and types, and so it cannot always be assumed that an upstream conversion will preserve all of the original data. Other destination formats have more interesting caveats. For example, the Chemical Markup Language (CML)  is for all practical purposes a superset of all possible chemical formats, since additional tags can be introduced by any writer without affecting validity, which passes the interpretation problem down the line: there is no guarantee that other software will understand the choice of properties, meaning that interoperability is very low.
Converting a rigorous, minimalistic cheminformatics format into manuscript quality graphics is not a simple task. Because high level aesthetic style information has no place being stored in the core definition of a datastructure that is intended to describe the chemistry in a way that is understandable to machines, it means that the rendering process involves the creation of a lot of additional information, namely the positioning for each of the labels, bonds and various other annotations . While the loss of layout cues in the core datastructure is unfortunate in the case of structures that were originally imported from a drawing program that allowed the user to specify such preferences, it does mean that all structures are created equal as far as visualisation is concerned, as long as the 2D coordinates and wedge bonds for each of the non-virtual atoms are chosen to suit. Since many structures are partially or completely composed using algorithms, rather than being hand drawn, it is highly beneficial to be able to create high quality diagrams without additional user intervention. One alternative to insisting on algorithmic recreation of aesthetic properties is to store layout hints (e.g. atom colours, charge positions, etc.) as optional non-fundamental extension properties.
As with cheminformatics formats, there are a number of graphics formats to choose from, and the most appropriate of these varies depending on the destination. The most universally recognised format is the Portable Network Graphics (PNG) format, which is a bitmapped format. Until recently this was the only practical method for displaying custom graphics on a web page, but has major limitations, e.g. the resolution has to be selected prior to generating the page, as well as a litany of other inconveniences. All too often manuscripts created with wordprocessing software incorporate bitmapped graphics, and these need to be generated at a much higher resolution than what is suitable for screen viewer. A document with screen-resolution bitmapped graphics appears shoddy when zoomed to a non-default resolution, and frequently almost illegible when printed or converted into a PDF file, which describes both of the primary use cases for manuscript preparation. Since molecular structures are inherently vector diagrams, being originally composed by the software using a small dictionary of shapes: lines, circles, curves, etc., it is strongly preferable to represent the drawings in a vector format, which ensures that they can be rendered as perfectly as the device allows, whether it be a screen, a printer, or a print-ready file format like PDF. There are a number of vector graphics formats to choose from, and these include Scalable Vector Graphics (SVG), Encapsulated PostScript (EPS), and embedded graphics formats like DrawingML, which can be used to compose vector diagrams inside Microsoft Word, Excel or PowerPoint documents.
The reason for taking the approach of storing chemical data in the most rigorous cheminformatics format, and converting it on demand, is that functionality can be provided as it is needed. Because the data is stored in a format that is understandable to an algorithm at a fundamental level, it can be converted into any format that the service is currently capable of creating, and taking into account the needs of any aspects that have currently been implemented by the service. Figure 12(b) shows the dialog that is presented when requesting the downloading of a datasheet with an embedded Experiment aspect. The list of available formats includes several informatics formats, and number of different ways to render the content as graphics which can subsequently be used by other presentation packages, including Microsoft Word format with embedded vector diagrams. The combination of machine-readable raw data and a chemistry aware service has two clear advantages over storing pre-prepared files in several formats: additional output formats can be added at any time, and the user is also given the option to customise the output, e.g. by selecting the resolution and colour scheme for molecular graphics. This approach satisfies the needs of machines and humans.
The Internet provides a seemingly limitless menu of ways to share information across the globe, and most of them can be adapted to chemistry in some way, but other than approaches such as that taken by molsync.com, these seldom have the ability to form a strong association between the machine interpretable data and the human viewable rendering thereof. For example, a user can easily use Twitter to share a graphical picture of a molecular structure, but since this is just a bitmapped image, to a machine it is largely indistinguishable from a photograph of a kitten. The data only regains its full value if an individual human redraws the structure using a chemical drawing package (or attempts to parse it with a bitmap-to-structure conversion tool).
Sharing of machine interpretable data is leveraged from within the ODDT app, and it is easy to obtain it and incorporate it within a cheminformatics workflow. The data is acquired in its pure state, and there is no need to reenter it, because no information was lost during the transition. We have described how mobile technologies can be used for secure sharing of data prior to open sharing in ODDT . In addition we have shown how ODDT can be used to surface structure activity relationship (SAR) data from behind paywalls  and raise awareness of specific topics [61,62]. Twitter is also a valuable tool for realtime microblogging from scientific conferences : there are an increasing number of scientists who routinely “live tweet” what they learn during conferences, and there is no reason why digitally accessible data cannot be incorporated into this stream.
Another novelty feature that the molsync.com service provides is the display of a molecular glyph, which is the equivalent of a chemical QR code: its role is equivalent to a URL, except that when it is printed out on paper, e.g. on a poster or a label, it is possible to use the Living Molecules mobile app  to photograph the glyph. Once the payload is extracted, the app is able to go directly to the source of the data, and download it in its pure form, i.e. it is now loaded into the app itself, and from there it can be viewed, exported, re-shared or used in any other way that raw cheminformatics data can. We have shown how this glyph could be used practically to encode chemical ingredients in consumer products .
The increasing importance of data-intensive cheminformatics algorithms, the growing recognition of problems with existing data collections, and the rising prominence of open lab notebook data means that the community has an opportunity to correct some of the persistent data quality problems that have plagued the field ever since large datasets began to be made publicly available on the Internet. Addressing these problems will require a significant amount of effort from all participants, starting with the creators of chemical software tools used for data entry. Alongside the improvement of available user-facing tools, an increased awareness is required of individual experimentalists who provide the raw data, and the cheminformaticians who build systems for collecting and assimilating it. Some of the data entry tools in current use can already be used to generate high quality machine readable data, but in many cases only if there is a significant educational push to ensure that scientists use them correctly, and this is unlikely to happen in isolation, unless the tools themselves are greatly improved. Software creators need to ensure that their products evolve to make it easier for chemists to operate them in a way that satisfies the requirements for presentation and digital interpretability.
The need to improve the quality of public data, which is growing in volume at a very fast pace, is an urgent action item for the cheminformatics community, but the introduction of open lab notebooks is an opportunity to make a profound change, because unlike most other sources, the data is produced by the scientists who conduct the experiments. This immediacy removes the most intractable problems with correct data representation. That being said, if we miss this opportunity to train scientists to produce machine readable data, or fail to deliver adequate tools form to do so without an unreasonable amount of extra effort, we will end up in the unenviable position of having an ever increasing quantity of bad data.
Should we be successful in rising to this challenge, the outlook for cheminformatics is exciting, since this relatively young industry was incubated during a regime of scarce data, then came of age in an era of very noisy and low quality data. It is hard to know for sure how many of the common techniques in our industry provide chemical intelligence of middling quality, simply because the available training data is so poor, and requires so much effort to extract information from inappropriate data structures. As the available data simultaneously becomes more open, more abundant and of better quality, we can expect to see improvements to all kinds of chemical algorithms, and new use cases that were previously not viable due to data problems. We can also expect more democratisation of chemical data, since the combination of micropublications with digitally coherent content means that experimental results will often be published regardless of whether they are suitable for inclusion in a full length research article, and it also means that this data will actually be used. As long as the provenance of the data is retained, the data collation services that are exposed to any particular source can make their own decisions about level of trust. This is in contrast to the current situation, which more often than not can be described as blind.
The combination of these trends with use of publicly accessible social networks, such as Twitter, already has some proof of concept technology, such as the Open Drug Discovery Teams project. We anticipate that aggregation and evaluation of quality will become a highly active area of research unto itself, likely with a large crowd-sourcing component.
In this article we have concentrated primarily on chemical structures, since these are most urgently in need of attention in the field of cheminformatics, but there are numerous other kinds of metadata that can and should be incorporated into digital research publications. Allowing for different kinds of provenance is an important consideration, especially when integrating with the current open data options, e.g. whether a fact was directly provided as the result of an experiment carried out by a particular scientist, reentered from another source, text-mined from an earlier document, etc. For physical properties and activity determinations, it is useful to know more than just the units and standard errors: information about the experiment setup, calibration, the target organism, which measurement run the results were obtained from, etc., are all important. The emergence of standards for capturing this kind of high level metadata in a semantic form  is an essential step toward enabling the construction of algorithms that can mine the Internet for available knowledge, and create robust models that are based on something other than noise.
In short, the solution to the problem of open notebook science data quality is to apply the same level of rigour to the machine readability of the data as would normally be applied to a printable manuscript. A published paper is not considered viable until it can be understood unambiguously by chemists, and so exported digital content should not be released until a machine algorithm can interpret it without loss or corruption of essential information. Accomplishing this goal begins with the improvement of software tools for data entry and use of the most rigorously complete and well defined data formats, and culminates in changes to the culture of data publication. This culture shift requires a recognition of the primacy of machine readability: database maintainers and journals must do their best to ensure that digital content makes sense (e.g. chemical structures can be resolved to a distinct molecular formula, properties have units, etc.). The experimentalists who submit this content must be provided with better tools for avoiding common mistakes (e.g. segregating sketcher tools for creating non-chemical objects like free text or circles), and have an increased awareness of the importance of doing so. In the event of errors in digital content, the traceability of open lab notebooks leads back to the experimentalist who created it, and it must be understood that releasing flawed digital content is as much of a scientific faux pas as publishing an incorrect or misleading figure.
As cheminformaticians, these issues are our domain: it is up to us to build the tools, and ensure that they are understood and used correctly by experimentalists, so that we can leverage the full potential of open science.
aThe number of sellers and resellers of chemical compounds who make their catalogs available to download in an accessible format, such as MDL SDfile, is large. Specific instances are not listed in this article for timeliness purposes, since additions and deletions are frequent.
bIt should be noted for completeness though that InChI does include an AuxInfo layer which can optionally encode the coordinates for a structure (http://www.inchi-trust.org/technical-faq/#11.1) but few are aware of this capability and it is rarely used.
We dedicate this article to Dr. Jean-Claude Bradley who has done more than anyone else in the field of chemistry to convince us of the benefits of open data and collaboration.
- Pence HE, Williams AJ. ChemSpider: An Online Chemical Information Resource. J Chem Educ. 2010;87:1123–4.View ArticleGoogle Scholar
- Williams AJ. ChemSpider: Integrating Structure-Based Resources Distributed across the Internet. In: Belford RE, Moore JW, Pence HE, editors. Enhancing Learning with Online Resources, Social Networking, and Digital Libraries. Washington: American Chemical Society; 2010. doi:10.1021/bk-2010-1060.ch002.Google Scholar
- Williams AJ. Public Compound Databases – How ChemSpider changed the rules making molecules on the web free. In Collaborative Computational Technologies for the Life Sciences, Edited by Ekins S, Hupcey MAZ and Williams AJ.Google Scholar
- Hastings J, Chepelev L, Willighagen E, Adams N, Steinbeck C, Dumontier M. The chemical information ontology: provenance and disambiguation for chemical data on the biological semantic web. PLoS One. 2011;6:e25513.View ArticleGoogle Scholar
- Li Q, Cheng T, Wang Y, Bryant SH. PubChem as a public resource for drug discovery. Drug Discov Today. 2010;15:1052–7.View ArticleGoogle Scholar
- Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG. ZINC: A Free Tool to Discover Chemistry for Biology. J Chem Inf Model. 2012;52:1757–68.View ArticleGoogle Scholar
- Interview with Jean-Claude Bradley. The Impact of Open Notebook Science. 2014 [http://www.infotoday.com/IT/sep10/poynder.shtml]Google Scholar
- Harvey MJ, Mason NJ, Rzepa HS. Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks. J Chem Inf Model. 2014;54:2627–35.View ArticleGoogle Scholar
- Williams AJ, Wilbanks J, Ekins S. Why open drug discovery needs four simple rules for licensing data and models. PLoS Comput Biol. 2012;8:e1002706.View ArticleGoogle Scholar
- Attwood TK, Kell DB, McDermott P, Marsh J, Pettifer SR, Thorne D. Utopia documents: linking scholarly literature with research data. Bioinformatics. 2010;26:568–74.View ArticleGoogle Scholar
- Chemistry Add-in for Word [http://research.microsoft.com/en-us/projects/chem4word] (accessed October 2014)
- Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P. OSCAR4: a flexible architecture for chemical text-mining. J Cheminf. 2011;3:41.View ArticleGoogle Scholar
- Hawizy L, Jessop DM, Adams N, Murray-Rust P. ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminf. 2011;3:17.View ArticleGoogle Scholar
- Corbett P, Murray-Rust P. High-Throughput Identification of Chemistry in Life Science Texts. In: Berthold MR, Glen R, Fischer I, editors. Computational Life Sciences II. Heidelberg: Springer Berlin; 2006. p. 107–18.View ArticleGoogle Scholar
- Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform. 2005;6:57–71.View ArticleGoogle Scholar
- Filippov IV, Nicklaus MC. Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. J Chem Inf Model. 2009;49:740–3.View ArticleGoogle Scholar
- Ibison P, Jacquot M, Kam F, Neville AG, Simpson RW, Tonnelier C, et al. Chemical literature data extraction: The CLiDE Project. J Chem Inf Comp Sci. 1993;33:338–34.View ArticleGoogle Scholar
- Valko AT, Johnson AP. CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition. J Chem Inf Model. 2009;49:780–7.View ArticleGoogle Scholar
- Williams AJ, Ekins E. A quality alert and call for improved curation of public chemistry databases. Drug Discov Today. 2011;16:747–50.View ArticleGoogle Scholar
- Williams AJ, Ekins S, Tkachenko V. Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today. 2012;17:685–701.View ArticleGoogle Scholar
- Clark AM: The real reason for junk chemical data [http://cheminf20.org/2011/05/17/the-real-reason-for-junk-chemical-data] (accessed October 2014).
- Fant A, Muratov E, Fourches D, Sharpe D, Williams AJ, Tropsha A: On the Accuracy of Chemical Structures Found on the Internet. ACS San Diego, March 2012: [http://www.slideshare.net/AntonyWilliams/on-the-accuracy-of-chemical-structures-found-on-the-internet] (accessed October 2014)
- Williams AJ, Ekins S, Tkachenko V: Mining public domain data as a basis for drug repurposing. ACS Philadelphia, August 2012 [http://www.slideshare.net/AntonyWilliams/mining-public-domain-data-as-a-basis-for-drug-repurposing] (accessed October 2014)
- Golotvin SS, Vodopianov E, Lefebvre BA, Williams AJ, Spitzer TD. Automated structure verification based on 1H NMR prediction. Magn Reson Chem. 2006;44:524.View ArticleGoogle Scholar
- Golotvin SS, Vodopianov E, Pol R, Lefebvre BA, Williams AJ, Rutkowske RD, et al. Automated structure verification based on a combination of 1D 1H NMR and 2D 1H–13C HSQC spectra. Magn Reson Chem. 2007;45:803–13.View ArticleGoogle Scholar
- checkCIF: [http://journals.iucr.org/services/cif/checkcif.html] (accessed October 2014).
- PubChem, ChemSpider and ChEBI are regularly cited internet resources, which can be accessed via the URLs [http://pubchem.ncbi.nlm.nih.gov], [http://chemspider.com] and [http://www.ebi.ac.uk/chebi] respectively (accessed October 2014).
- Antony J. Williams, private communication: [http://www.chemspider.com/feedbackcurated.aspx]
- Slide 56: [http://www.slideshare.net/AntonyWilliams/crowdsourcing-chemistry-for-the-community-5-years-of-experiences] (accessed October 2014)
- The mobile app is available without charge for both iOS- and Android-based mobile devices. AppStore and Google Play links can be found on the main ChemSpider page: [http://chemspider.com] (accessed October 2014)
- ChemSpider JSON API. [http://www.chemspider.com/JSON.ashx] (accessed October 2014).
- ChemSpider Synthetic Pages. [http://cssp.chemspider.com] (accessed October 2014)
- Brecher J. Graphical representation of stereochemical configuration (IUPAC Recommendations 2006). Pure Appl Chem. 2006;78:1897–970.View ArticleGoogle Scholar
- Brecher J. Graphical representation standards for chemical structure diagrams (IUPAC Recommendations 2008). Pure Appl Chem. 2008;80:277–410.View ArticleGoogle Scholar
- Coles SJ, Frey JG, Bird CL, Whitby RJ, Day AE. First steps towards semantic descriptions of electronic laboratory notebook records. J Cheminf. 2013;5:52.View ArticleGoogle Scholar
- Day AE, Coles SJ, Bird CL, Frey JG, Whitby RJ, Tkachenko VE, et al. ChemTrove: Enabling a generic ELN to support Chemistry through the use of transferable plug-ins and online data sources. J Chem Inf Model, ASAP Article, doi:10.1021/ci5005948.Google Scholar
- Clark AM, Labute P, Santavy M. 2D Structure Depiction. J Chem Inf Model. 2006;46:1107–23.View ArticleGoogle Scholar
- Clark AM. Detection and Assignment of Common Scaffolds in Project Databases of Lead Molecules. J Med Chem. 2009;52:469–83.View ArticleGoogle Scholar
- Clark AM. 2D Depiction of Fragment Hierarchies. J Chem Inf Model. 2010;50:37–46.View ArticleGoogle Scholar
- Clark AM. Accurate Specification of Molecular Structures: The Case for Zero-Order Bonds and Explicit Hydrogen Counting. J Chem Inf Model. 2011;52:3149–57.View ArticleGoogle Scholar
- Bachrach SM. InChI: a user’s perspective. J Cheminf. 2012;4:344.View ArticleGoogle Scholar
- SketchEl SourceForge Page [http://sketchel.sourceforge.net] (accessed October 2014)
- SketchEl molecule format definition: [http://molmatinf.com/fmtsketcher.html] (accessed October 2014)
- Green Lab Notebook app: [http://molmatinf.com/products.html#gln] (accessed October 2014).
- SAR Table app: [http://molmatinf.com/products.html#sartable] (accessed October 2014).
- Mobile Molecular DataSheet app: [http://molmatinf.com/products.html#mmds] (accessed October 2014).
- Karapetyan K, Tkachenko V, Batchelor C, Sharpe D, Williams AJ. The RSC chemical validation and standardization platform, a potential path to quality-conscious databases. ACS Spring Meeting, New Orleans, April 2013 [http://www.slideshare.net/AntonyWilliams/the-rsc-chemical-validation-and-standardization-platform-a-potential-path-to-qualityconscious-databases] (accessed October 2014).
- Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL. Open PHACTS: Semantic interoperability for drug discovery. Drug Discov Today. 2012;17:1188–98.View ArticleGoogle Scholar
- Batchelor C, Brenninkmeijer CYA, Chichester C, Davies M, Digles D, Dunlop I, et al. Scientific Lenses to Support Multiple Views over Linked Chemistry Data. The Semantic Web – ISWC. 2014;8796:98–113.Google Scholar
- Linstrom PJ, Mallard WG. NIST Chemistry WebBook. Gaithersburg MD: NIST Standard Reference Database Number 69, National Institute of Standards and Technology; 2014. p. 20899 [http://webbook.nist.gov].Google Scholar
- Cotton FA, Wilkinson G, Gaus PL. Basic Inorganic Chemistry. 3rd ed. New York: John Wiley; 1995. ISBN 978-0-471-50532-7.Google Scholar
- Theys RD, Dudley ME, Hossain MM. Recent chemistry of the 5-cyclopentadienyl dicarbonyl iron anion. Coord Chem Rev. 2009;253:180–234.View ArticleGoogle Scholar
- Hosted by Molecular Materials Informatics, Inc. http://molmatinf.com
- Clark AM. Rendering Molecular Sketches for Publication Quality Output. Mol Inf. 2013;32:291–301.View ArticleGoogle Scholar
- Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, et al. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J Chem Inf Com Sci. 1992;32:244.View ArticleGoogle Scholar
- Townsend JA, Murray-Rust P. CMLLite: a design philosophy for CML. J Cheminf. 2011;3:39.View ArticleGoogle Scholar
- Rzepa HS, Murray-Rust P, Whitaker BJ. The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange”. J Chem Inf Comp Sci. 1998;38:976–82.View ArticleGoogle Scholar
- Ekins S, Clark AM, Williams AJ. Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration. Mol Inf. 2012;31:585–97.View ArticleGoogle Scholar
- Ekins S, Clark AM: Secure sharing with mobile cheminformatics apps [http://figshare.com/articles/Secure_sharing_with_mobile_cheminformatics_apps/95654] (accessed October 2014)
- Ekins S, Clark AM: Using The Open Drug Discovery Teams (ODDT) Mobile App To Bring Molecules & SAR From Behind Journal Paywalls [http://figshare.com/articles/Using_The_Open_Drug_Discovery_Teams_%28ODDT%29_Mobile_App_To_Bring_Molecules_&_SAR_From_Behind_Journal_Paywalls/93007] (accessed October 2014)
- Ekins S, Clark AM, Wood J: Raising Awareness of the Rare Disease Sanfilippo Syndrome C Using The Open Drug Discovery Teams (ODDT) Mobile App [http://figshare.com/articles/Raising_Awareness_of_the_Rare_Disease_Sanfilippo_Syndrome_C_Using_The_Open_Drug_Discovery_Teams_ODDT_Mobile_App/156522 (accessed October 2014)
- Ekins S, Clark AM: The Open Drug Discovery Teams (ODDT) Mobile App For Green Chemistry [http://figshare.com/articles/The_Open_Drug_Discovery_Teams_%28ODDT%29_Mobile_App_For_Green_Chemistry/92858] (accessed October 2014)
- Ekins S, Perlstein E. Ten Simple Rules of Live Tweeting at Scientific Conferences. PLoS ONE Comp Biol 2014 doi:10.1371/journal.pcbi.1003789.Google Scholar
- Living Molecules app: [http://molmatinf.com/products.html#livingmolecles] (accessed October 2014)
- Ekins S, Clark AM. Living Molecules App to create Ingredients lists [http://figshare.com/articles/Living_Molecules_App_to_create_Ingredients_lists/712593] (accessed October 2014).
- Clark AM, Bunin BA, Litterman NK, Schürer SC, Visser U. Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation. PeerJ 2014, 524 doi:10.7717/peerj.524.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.