In common with most scientific disciplines, there has in the last few years been a huge increase in the amount of publicly-available and proprietary information pertinent to drug discovery, owing to a variety of factors including improvements in experimental technologies (High Throughput Screening, Microarray Assays, etc), improvements in computer technologies (particularly the Web), funded "grand challenge" projects (such as the Human Genome Project), an imperative to find more treatments for more diseases in an aging population, and various cultural shifts. This has been dubbed data overload  Significant effort has therefore been put into the development of computational methods for exploiting this information for drug discovery, particularly through the fields of Bioinformatics and Cheminformatics. Of particular note are the provision of large-scale chemical and biological databases, such as PubChem , ChemSpider , the PDB , and KEGG , which house information about massive numbers of compounds, proteins, sequences, assays and pathways; the development of predictive models for biological activity and other biological endpoints; data mining of chemical and biological data points; the availability of journal articles in electronic form, and associated indexing (such as in PubMed) and text mining of their content. Further, we are seeing an unprecedented amount of linking of information resources, for instance with Bio2RDF , Linking Open Drug Data  and manual linking of database entries.
One of the next great challenges is how we can use all of this information together in an intelligent way, in an integrative fashion . We can think of all these information resources as pieces of a jigsaw, which in their own right give us useful insights, but to get the full picture requires the pieces to be put together in the right fashion. We thus not only need to aggregate the information, but we also need to be able to data mine it in an integrative fashion. There are a number of technologies that are becoming available that assist with this: in particular, web services and Cyberinfrastructure  allow straightforward, standardized interfaces to a variety of data sources and Semantic Web languages such as XML, OWL and RDF permit the aggregation of data, and representation of meaning and relationships in the data respectively.
At Indiana University, we are tackling this problem from several angles. We recently developed a Cyberinfrastructure for cheminformatics, called ChemBioGrid, which has made a multitude of databases and computational tools freely available for the first time to the academic community in a web service framework . Of particular import, we have been able to successfully index chemical structures in the abstracts of large numbers of scholarly publications through a collaboration with the Murray Rust group at Cambridge. The infrastructure has spurred the development of several important client applications, including PubChemSR , and the application of Web 2.0 style "mashups" using userscripts for a variety of life-science applications . We are continuing to support and further develop this infrastructure.
With this infrastructure in place, we have investigated a variety of strategies for integrating the chemical and biological data from different sources in the infrastructure, in particular of (i) the application of data mining techniques to chemical structure, biological activity and gene expression data in an integrated fashion , (ii) the development of a generalizable four layer model (storage, interface, aggregation and smart client) for integrative data mining and knowledge discovery , and (iii) aggregation of web services into automatically generated and ranked workflows . We are now investigating methods for applying these techniques on a larger scale, particularly to be able to extract knowledge from large volumes of chemical and biological data that would not be found by searching single sources, and to be able to use multiple independent sources to corroborate or contradict hypotheses. To do this, we are employing two key technologies: aggregate web services which call multiple "atomic" web services and aggregate the results, and Semantic Web languages for the representation of integrated data.
In this paper we describe one of the first products of this work, a tool called WENDI (Web Engine for Non-obvious Drug Information) that is designed to tackle a specific question: given a chemical compound of interest, how can we probe the potential biological properties of the compound using predictive models, databases, and the scholarly literature? In particular, how can we find non-obvious relationships between the compound and assays, genes, and diseases, that cross over different types of data source? We present WENDI as a tool for aggregating information related to a compound to allow these kinds of relationships to be identified.
Of course, the power of this kind of integration comes from identifying truly non-obvious but yet real relationships between these entities. Our aim in this work is to allow a rapid differentation between known relationships (i.e. those which a scientist with a reasonable understanding of the literature in a field could be expected to already know), and unknown relationships (those which could not be found in literature closely associated with a field, or not part of the 'art' of the field). There is clearly some fuzziness in this, and this makes evaluation of a tool like WENDI for non-obviousness difficult. However, we do present it as a useful tool based on qualitative feedback from existing users, and we are currently devising ways of a more quantitative evaluation (as described in the concluding section).