Substructure searching is a relatively slow algorithm, and the time required to compare two molecules scales with the number of atoms in each molecule. To reduce the computation time, molecular fingerprints were invented. There are two key aspects to fingerprints that make them efficient: first, they have a fixed length so that the time to compare two molecule is independent of the size of the two structures; secondly, the fingerprint of a substructure always matches the fingerprint of any molecules that has that substructure.
In this section we will see two fingerprint types available in the CDK: a substructure based fingerprint, and a path based fingerprint. Before I will explain how these fingerprints are created, we will first look at the BitSet class that is used by the CDK to represent these fingerprints. Consider this code:
BitSetDemo
If we analyze the output, we see that all set bits are listed, and that all other bits are not:
BitSetDemo
Let us now consider a simple substructure fingerprint of length four with the following bit definitions:
- bit 1: molecule contains a carbon
- bit 2: molecule contains a nitrogen
- bit 3: molecule contains a oxygen
- bit 4: molecule contains a chlorine
Let's call this fingerprinter SimpleFingerprinter
:
SimpleFingerprinter
We can then calculate the fingerprints for ethanol and benzene:
SimpleFingerprintDemo
and we get these bit sets:
SimpleFingerprintDemo
Now, we can replace the presence of a particular atom, by the presence of a substructure, such as a phenyl or a carbonyl group. We have then defined a substructure fingerprint.
The CDK has several kinds of fingerprints, including path-based fingerprints (Fingerprinter and HybridizationFingerprinter), a MACSS fingerprint (MACSSFingerprinter) [Q34160151], and the PubChem fingerprint (PubChemFingerprinter). These fingerprints have been used for various tasks, including ligand classification [Q42704791], and databases like BRENDA [Q24599948] and TIN [Q33874102].
One substructure-based fingerprinter is the MACCSFingerprinter which partly implements the MACSS fingerprint specification [Q34160151]. The substructures are defined as SMARTS substructure specifications, inherited from RDKit (http://rdkit.org/). For this fingerprint it is required the implicit hydrogen counts are first set:
MACCSFingerprint
The object returned by the getBitFingerprint
method is the IBitFingerprint
which we can convert into a Java BitSet
with the asBitSet
method:
MACCSFingerprint
The CDK also has an implementation for the circular ECFP and FCFP fingerprints [Q29616639]. These are developed by Alex M. Clark at Collaborative Drug Discovery, Inc in the CircularFingerprinter [Q27902272]. It implements both in four variants: ECFP0, ECFP2, ECFP4, ECFP6, FCFP0, FCFP2, FCFP4, and FCFP6. The code is quite similar as for other fingerprints, but we do have to indicate what variant we want:
ECFPFingerprint
Again we get an IBitFingerprint
resulting in a BitSet
of bits:
ECFPFingerprint