Substructure Searching

Fingerprints

Substructure searching is a relatively slow algorithm, and the time required to compare two molecules scales with the number of atoms in each molecule. To reduce the computation time, molecular fingerprints were invented. There are two key aspects to fingerprints that make them efficient: first, they have a fixed length so that the time to compare two molecule is independent of the size of the two structures; secondly, the fingerprint of a substructure always matches the fingerprint of any molecules that has that substructure.

In this section we will see two fingerprint types available in the CDK: a substructure based fingerprint, and a path based fingerprint. Before I will explain how these fingerprints are created, we will first look at the BitSet class that is used by the CDK to represent these fingerprints. Consider this code:

BitSetDemo

If we analyze the output, we see that all set bits are listed, and that all other bits are not:

BitSetDemo

Let us now consider a simple substructure fingerprint of length four with the following bit definitions:

bit 1: molecule contains a carbon
bit 2: molecule contains a nitrogen
bit 3: molecule contains a oxygen
bit 4: molecule contains a chlorine

Let's call this fingerprinter SimpleFingerprinter:

SimpleFingerprinter

We can then calculate the fingerprints for ethanol and benzene:

SimpleFingerprintDemo

and we get these bit sets:

SimpleFingerprintDemo

Now, we can replace the presence of a particular atom, by the presence of a substructure, such as a phenyl or a carbonyl group. We have then defined a substructure fingerprint.

The CDK has several kinds of fingerprints, including path-based fingerprints (Fingerprinter and HybridizationFingerprinter), a MACSS fingerprint (MACSSFingerprinter) [Q34160151], and the PubChem fingerprint (PubChemFingerprinter). These fingerprints have been used for various tasks, including ligand classification [Q42704791], and databases like BRENDA [Q24599948] and TIN [Q33874102].

MACCS Fingerprints

One substructure-based fingerprinter is the MACCSFingerprinter which partly implements the MACSS fingerprint specification [Q34160151]. The substructures are defined as SMARTS substructure specifications, inherited from RDKit (http://rdkit.org/). For this fingerprint it is required the implicit hydrogen counts are first set:

MACCSFingerprint

The object returned by the getBitFingerprint method is the IBitFingerprint which we can convert into a Java BitSet with the asBitSet method:

MACCSFingerprint

ECFP and FCFP Fingerprints

The CDK also has an implementation for the circular ECFP and FCFP fingerprints [Q29616639]. These are developed by Alex M. Clark at Collaborative Drug Discovery, Inc in the CircularFingerprinter [Q27902272]. It implements both in four variants: ECFP0, ECFP2, ECFP4, ECFP6, FCFP0, FCFP2, FCFP4, and FCFP6. The code is quite similar as for other fingerprints, but we do have to indicate what variant we want:

ECFPFingerprint

Again we get an IBitFingerprint resulting in a BitSet of bits:

ECFPFingerprint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

substructure.i.md

substructure.i.md

Substructure Searching

MACCS Fingerprints

ECFP and FCFP Fingerprints

References

Files

substructure.i.md

Latest commit

History

substructure.i.md

File metadata and controls

Substructure Searching

MACCS Fingerprints

ECFP and FCFP Fingerprints

References