The FPS fingerprint format and chemfp toolkit
© Dalke; licensee BioMed Central Ltd. 2013
Published: 22 March 2013
During GCC 2010 poster session I presented a draft version of the FPS format for storing dense binary fingerprints. That format is now stable, and supported by RDKit , CACTVS , and other software. The chemfp package is a set of command-line tools and a Python library for fingerprint generation and high-speed Tanimoto search. It can extract pre-computed fingerprints from an SD tag or use OpenEye's OEChem , Open Babel , or RDKit to generate fingerprints. Search uses a combination of careful indexing , CPU-specific instructions (if available), and OpenMP. Nearest-100 similarity searches of PubChem-sized take less than a second on a laptop, and Butina clustering  of 2 million compounds takes about 6 hours on a 15 CPU node. In my poster I present the FPS format and chemfp package, and describe how the memory and performance requirements lead to the internal search architecture.
- Swamidass SJ, Baldi P: Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time. J Chem Inf Model. 2007, 47: 302-317. 10.1021/ci600358f.View ArticleGoogle Scholar
- Butina D: Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. J Chem Inf Model. 1999, 39: 747-750. 10.1021/ci9803381.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.