The Open Spectral Database: an open platform for sharing and searching spectral data
© The Author(s) 2016
Received: 14 March 2016
Accepted: 4 October 2016
Published: 14 October 2016
A number of websites make available spectral data for download (typically as JCAMP-DX text files) and one (ChemSpider) that also allows users to contribute spectral files. As a result, searching and retrieving such spectral data can be time consuming, and difficult to reuse if the data is compressed in the JCAMP-DX file. What is needed is a single resource that allows submission of JCAMP-DX files, export of the raw data in multiple formats, searching based on multiple chemical identifiers, and is open in terms of license and access. To address these issues a new online resource called the Open Spectral Database (OSDB) http://osdb.info/ has been developed and is now available. Built using open source tools, using open code (hosted on GitHub), providing open data, and open to community input about design and functionality, the OSDB is available for anyone to submit spectral data, making it searchable and available to the scientific community. This paper details the concept and coding, internal architecture, export formats, Representational State Transfer (REST) Application Programming Interface and options for submission of data.
The OSDB website went live in November 2015. Concurrently, the GitHub repository was made available at https://github.com/stuchalk/OSDB/, and is open for collaborators to join the project, submit issues, and contribute code.
The combination of a scripting environment (PHPStorm), a PHP Framework (CakePHP), a relational database (MySQL) and a code repository (GitHub) provides all the capabilities to easily develop REST based websites for ingestion, curation and exposure of open chemical data to the community at all levels. It is hoped this software stack (or equivalent ones in other scripting languages) will be leveraged to make more chemical data available for both humans and computers.
KeywordsSpectral data REST API Open science Open data JCAMP-DX XML Scientific data model
Tools to make research data freely available are vitally important to the open science movement. Such tools must play well with both humans and computers because of the importance of data import/export into other systems for analysis, verification, and data mining. One important data type in this area is instrumental spectra, used for identification and analysis in a variety of different application areas. Many websites (e.g. NIST Webbook , ChemSpider , University of the West Indies—Chemistry ) contain spectral files available in the current de-facto data standard, Joint Committee on Atomic and Molecular Physical Data—Data Exchange format (JCAMP-DX) [4–7] and this format can be exported from the majority of instrument software available today. However, the usefulness of spectral data in JCAMP-DX format is somewhat limited due to the specification being over 30 years old, and if saved using compression, difficult to transfer to other software. Providing a mechanism to allow conversion of legacy data in JCAMP-DX format is an important activity in-of-itself, as the community needs spectral data for comparison/standardization in many different applications.
The website has been developed using open-source software (as far as possible), using open standards, and is openly being made available using the GitHub code repository. The website is built in the Representational State Transfer (REST) style  and has a documented Application Programming Interface (API)  for computer based discovery and export.
The foundation of the OSDB website is the common Apache , MySQL , and PHP  software stack that can be installed on any computer system as: LAMP (for Linux), WAMP (for Windows) and MAMP [for OSX (Mac)]. Coding was done using the PHPStorm  Integrated Development Environment (IDE) (free for faculty and students) and scripts are written in PHP implementing the CakePHP object oriented framework . Because of the use of this standard open-source software developers can either; deploy on their own physical server, publish using one of a number of online hosting sites, or use a virtual machine, for creation of data websites.
Note that the function in Fig. 1 has a variable ($format) in the function arguments and when no value is passed the default of an empty string is set. When $format is tested as being equal to an empty string the $this->set command completes and the HTML file ‘index.ctp’ is rendered. However, if the URL “/systems/index/XML” is accessed, the same call is made except the data is not sent to the view and is instead is converted to an array and reformatted as XML ($this->Export->xml()) and passed to the browser. Note that the action ‘index’ must be included in the URL so that CakePHP does not try and run the action ‘XML’ (i.e. “/systems/XML” will cause an error).
Spectral file format
Pseudo-digits for ASDF
1. ASCII digits
2. Positive SQZ digits
3. Negative SQZ digits
4. Positive DIF digits
5. Negative DIF digits
6. Positive DUP digits
Example of ASDF Formats (only Y data points shown)
FIX form: (22 chars)
PAC form: (19 chars)
1 + 2 + 3 + 3 + 2 + 1 + 0 − 1 − 2 − 3
1 2 3 3 2 1 0 − 1 − 2 − 3
SQZ form: (10 chars)
DIF form: (10 chars)
DIFDUP form: (7 chars)
Clean—remove non-ACSII characters extra spaces at the start/end of lines
Uncomment—remove (and save) comments (indicated by $$)
Get LDRs—detect LDRs in the file
Validate—check the LDRs to identify if the file is valid JCAMP-DX
Standardize—standardize data in certain LDR fields
Decompress—expand data in any of the ASDF formats and calculates respective X values
Spectral data in the JCAMP file is stored both in its <raw> state along with the <pro> (cessed) expanded format. Any discrepancies between the data in the original JCAMP file and the process data are annotated in the errors element, with details of the issues.
Graphical user interface
In addition to the index view for spectra, users can search of a specific compound using the search box at the top of the page. The search is performed over the ‘identifier’ database table which is populated from the PubChem PUGREST interface  and contains names, SMILES, PubChem CIDs, InChI strings and InChIKeys.
The basic REST website provides a mechanism to add data to the repository, access it in a standardize way and download the data in multiple formats. However, the key to making the data truly accessible is by integration with other platforms and expanded search capabilities. These features have been added to the OSDB website through the following additions.
PubChem lookup for chemical metadata
<inputspecification>/<operation specification>/[<output specification>][?<operation_options>]”
The function cid has two arguments $name, and $debug (used to check that the code is working correctly). First, access to the CakePHP HttpSocket is established , and URL constructed from the base PubChem API address , the compound name ($name), and ‘/synonym/JSON’ . The URL is requested (equivalent to a web browser)  and the resulting JSON data converted to a PHP array $syns . The code then checks for errors in the response  and retrieves the CID for the compound . The value of the CID is then returned to the calling function. Other functions in the Chemical class return all the synonyms for a compound, and property data for a compound.
Retrieve Wikidata ID
Generate the Splash for a spectrum
The OSDB website, as outlined above, provides access to spectral data and its metadata in a standardize way. However, it is important to point out that what can be done with the data is up to the user. This applies to the OSDB website as well as after spectra have been downloaded. For instance, the website does not currently allow for searching the raw data/metadata across all spectra (all of it is in the database but can only be found searching for a complete spectrum).
In order to make this site truly useful the code and data of the project should be made openly available. In this way the user is not limited to the functionality that the original developers envisioned but can develop their own functions/features, enhance the integration of the site, and output the data in new formats for new web or mobile applications. In addition, the openness of the project means it can be used in education as a tool to develop the next generation of cheminformaticians—potentially building their own website from the source code as a course project.
For all these reasons (and many more) the project is available as a free download on GitHub . GitHub is a hosting service for the well-respected Git source code repository system . Git allows multiple developers to write code for one project and centrally coordinate version control, patching, extension and attribution. GitHub does this though a website and adds features like issue tracking, collaborative (discussion based) code review, and team management. Anyone can download the code, work on an enhancement or issue, submit updates, fix issues, and discuss project goals, timelines, and features. The basic site has been built and users can let the developers (that’s all of us) know what needs to be added, changed or removed, and implement it themselves. Readers are encouraged to check out the ‘Projects’ page  for ideas on additional features/enhancements that you could work on.
This paper describes a new project to support open spectral research data on the web. Anyone can contribute to the content, to the code, to the concept, or to the management/vision. This paper also outlines the components needed to put together such a project and it can be used as a template to build other websites with different functionality and/or different types of chemical data.
The current version of the OSDB is just a starting point. There are many additional features one can envision for the site and it is a hope that the reader has ideas of their own and adds them. Open source code has become a mainstay in the computing world. With the tools, concepts and frameworks outlined in this paper, open source research data will hopefully become a mainstay of the scientific community.
Application Programming Interface
ASCII Squeezed Difference Form
Cascading Style Sheet
graphical user interface
Hypertext Markup Language
Linux, Apache, MySQL, and PHP
Joint Committee on Atomic and Molecular Physical Data—Data Exchange
Linked Data Record
Mac, Apache, MySQL, and PHP
nuclear magnetic resonance
object oriented programming
Open Spectral Database
web ontology language
representational state transfer
scientific data model
simplified molecular-input line-entry system
SPARQL protocol and RDF query language
structured query language
uniform resource identifier
Windows, Apache, MySQL, and PHP
extensible markup language
Thanks to Tony Williams for encouraging me to start this project. Thanks to J. C. Bradley for pioneering open science and paving the way for projects like this to be conceptualized.
The author declares that he has no competing interests.
Availability of data and materials
All data associated with this project is available at https://github.com/stuchalk/OSDB.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- NIST Materials Measurement Laboratory (2016) NIST chemistry WebBook. National Institute for Standards and Technology, Gaithersburg. http://webbook.nist.gov/. Accessed 19 July 2016
- Williams T, Tkachenko V (2016) ChemSpider. Royal Society of Chemistry, Cambridge. http://www.chemspider.com/. Accessed 19 July 2016
- Lancashire R (2016) JCAMP-DX sample files. The University of the West Indies, Mona. http://wwwchem.uwimona.edu.jm/spectra/index.html. Accessed 19 July 2016
- IUPAC (2016) IUPAC subcommittee on electronic data standards. http://jcamp-dx.org/. Accessed 1 Mar 2016
- Davies AN, Lampen P (1993) JCAMP-DX for NMR. Appl Spectrosc. doi:10.1366/0003702934067874 Google Scholar
- Grasselli JG (1991) JCAMP-DX, a standard format for exchange of infrared-spectra in computer readable form. Pure Appl Chem. doi:10.1351/pac199163121781 Google Scholar
- Lampen P, Hillig H, Davies AN, Linscheid M (1994) JCAMP-DX for mass-spectrometry. Appl Spectrosc. doi:10.1366/0003702944027840 Google Scholar
- Chalk S (2016) The Open Spectral Database. University of North Florida, Jacksonville. http://osdb.info/. Accessed 1 Mar 2016
- Sporny M, Longley D, Kellogg G, Lanthaler M, Lindström N (2016) JSON-LD 1.0—a JSON-based Serialization for Linked Data. The World Wide Web Consortium. http://www.w3.org/TR/json-ld/. Accessed 1 Mar 2016
- Chalk S (2016) SciData—a scientific data model. University of North Florida, Jacksonville. http://stuchalk.github.io/scidata/. Accessed 1 Mar 2016
- Fielding RT, Taylor RN (2002) Principled design of the modern Web architecture. ACM Trans Internet Technol 2(2):115–150. doi:10.1145/514183.514185 View ArticleGoogle Scholar
- Mann A (2014) What’s an API? A beginner’s guide to the application programming interface. http://www.slideshare.net/CAinc/whats-an-api-a-beginners-guide-to-the-application-programming-interface. Accessed 23 June 2016
- ASF (2016) The Apache HTTP server project. The Apache Software Foundation (ASF), Forest Hill. http://httpd.apache.org/. Accessed 1 Mar 2016
- Oracle (2016) MySQL open-source database oracle corporation. http://www.mysql.com/. Accessed 1 Mar 2016
- The PHP Group (2016) PHP: hypertext preprocessor. The PHP Group. http://php.net/. Accessed 1 Mar 2016
- JetBrains (2016) PHPStorm. PHP IDE. https://www.jetbrains.com/phpstorm/. Accessed 1 Mar 2016
- CSF (2016) CakePHP: the rapid PHP development framework. Cake Software Foundation (CSF). http://cakephp.org/. Accessed 1 Mar 2016
- Fowler M (2006) GUI architectures: model–view–controller. ModelViewController. http://martinfowler.com/eaaDev/uiArchs.html. Accessed 23 June 2016
- Oracle (2015) Lesson: object-oriented programming concepts. https://docs.oracle.com/javase/tutorial/java/concepts/. Accessed 23 June 2016
- Chalk S (2016) The Open Spectral Database—system index. University of North Florida, Jacksonville. http://osdb.info/systems. Accessed 1 Mar 2016
- Baumbach JI, Davies AN, Lampen P, Schmidt H (2001) JCAMP-DX. A standard format for the exchange of ion mobility spectrometry data (IUPAC recommendations 2001). Pure Appl Chem. doi:10.1351/pac200173111765 Google Scholar
- Cammack R, Fann Y, Lancashire RJ, Maher JP, McIntyre PS, Morse R (2006) JCAMP-DX for electron magnetic resonance (EMR). Pure Appl Chem. doi:10.1351/pac200678030613 Google Scholar
- Woollett B, Klose D, Cammack R, Janes RW, Wallace BA (2012) JCAMP-DX for circular dichroism spectra and metadata (IUPAC Recommendations 2012). Pure Appl Chem. doi:10.1351/PAC-REC-12-02-03 Google Scholar
- Chalk S (2016) SciData: a data model and ontology for semantic representation of scientific data. J Cheminform
- BCT (2016) Bootstrap Bootstrap Core Team. http://getbootstrap.com/. Accessed 1 Mar 2016
- Çelik T, Lilley C, Baron LD, Pemberton S, Pettit B (2016) Cascading Style Sheets working group. The World Wide Web Consortium. https://www.w3.org/TR/css3-color/. Accessed 1 Mar 2016
- Hickson I, Berjon R, Faulkner S, Leithead T, Doyle Navara E, O’Connor E, Pfeiffer S (2016) HTML5: a vocabulary and associated API’s for HTML and XHTML The World Wide Web Consortium. http://www.w3.org/TR/html5/. Accessed 1 Mar 2016
- Chalk S (2016) The Open Spectral Database—spectra index. University of North Florida, Jacksonville. http://osdb.info/spectra. Accessed 1 Mar 2016
- Chalk S (2016) The Open Spectral Database—compound index. University of North Florida, Jacksonville. http://osdb.info/compounds. Accessed 1 Mar 2016
- Chalk S (2016) The Open Spectral Database—analytical technique index. University of North Florida, Jacksonville. http://osdb.info/techniques. Accessed 1 Mar 2016
- Chalk S (2016) The Open Spectral Database—collection index. University of North Florida, Jacksonville. http://osdb.info/collections. Accessed 1 Mar 2016
- NLM (2016) PubChem Power User Gateway (PUG) REST interface documentation. https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html. Accessed 1 Mar 2016
- NLM (2016) PubChem. National Institutes of Health, Bethesda. https://pubchem.ncbi.nlm.nih.gov/. Accessed 19 July 2016
- WikiMedia (2016) Wikidata. https://www.wikidata.org/. Accessed 1 Mar 2016
- Chalk S (2016) The Open Spectral Database—API. University of North Florida, Jacksonville. http://osdb.info/api. Accessed 1 Mar 2016
- SmartBear (2016) Swagger API framework open API initiative (OAI). http://swagger.io/. Accessed 19 July 2016
- UC Davis (2016) Splash—the spectral hash identifier. http://splash.fiehnlab.ucdavis.edu/. Accessed 1 Mar 2016
- Chalk S (2016) Open Spectral Database Github repository. GitHub Inc, San Francisco. https://github.com/stuchalk/OSDB/. Accessed 1 Mar 2016
- Harris S, Seaborne A (2016) SPARQL query language for RDF. The World Wide Web Consortium. https://www.w3.org/TR/sparql11-query/. Accessed 19 July 2016
- Metabolomics Society (2016) Metabolomics Hackathon Metabolomics Society. http://metabolomics2015.org/index.php/program/hackathon. Accessed 1 Mar 2016
- SFC (2016) Git distributed version control system. Software Freedom Conservancy. https://git-scm.com/. Accessed 1 Mar 2016
- Chalk S (2016) The Open Spectral Database—projects. University of North Florida, Jacksonville. http://osdb.info/pages/projects. Accessed 1 Mar 2016