Recursive Agglomerative Clustering (RecAgglo) for categorical data

Code and simple experimental setup for the Recursive Agglomerative Clustering (RecAgglo) algorithm introduced in Detecting organized eCommerce fraud using scalable categorical clustering (Marchal S. and Szyller S., ACSAC 2019). This algorithm is initially designed to cluster orders placed on eCommerce websites with the aim to group fraudulent orders together. RecAgglo generates a large number of small clusters and it is best suited for processing data represented by attributes having high cardinality. A detailed presentation of the algorithm is provided in Section 3 (Marchal S. and Szyller S., ACSAC 2019). RecAgglo clustering performance (Impurity + computation time) is assessed and compared to state-of-the-art categorical clustering algorithms in Section 7. Its capabilities for fraud detection are assessed in Section 8 (Marchal S. and Szyller S., ACSAC 2019).

Requirements and setup

Python 3 packages: numpy, scipy, pandas, argparse. Also, you need a C compiler and Cython to compile the cython code. The code was tested with Python versions 3.6 and 3.7.

pip3 install numpy, scipy, pandas, argparse, Cython

Alternatively, you can use the provided requirements.txt file:

pip3 install -r requirements.txt

Compile the cython package asym_linkage.pyx with this command:

python3 compile_asym_linkage.py build_ext --inplace

Running the code

To run the clustering algorithm with default parameters, you need to provide just the input (data) and output file name.

input file must be a comma separated value (CSV) file with one element per line
output file is also a comma separated value (CSV) file, same shape as input file with one additional column containing the cluster index of each element

To run:

python3 main.py -i test_data/mushroom.data -o result-mushroom.csv --verbose

If you want to use custom parameters, invoke help to get more info:

python3 main.py -h

usage: main.py [-h] --infile INFILE --outfile OUTFILE [--delta_a INT]
               [--delta_fc INT] [--d_max FLOAT] [--rho_mc FLOAT]
               [--rho_s FLOAT] [--skip_index] [--verbose]

RecAgglo clustering. To run: python3 main.py --infile XYZ --outfile XYZ.
Override default args as necessary.

optional arguments:
  -h, --help            show this help message and exit
  --algorithm INT       clustering algorithm to use: 0=RecAgglo,
                        1=SampleClust, 2=AggloClust
  --weight WEIGHT       List of weight for each attribute [0,0.5,...,2.5]
                        (default weights = 1.)
  --delta_a INT         Threshold for cluster sampling (default = 1000). Must
                        be > 0.
  --delta_fc INT        Threshold for full clustering (default = 1). Must be >
                        0.
  --d_max FLOAT         Distance threshold to split clusters (default = 0.5).
                        Must be > 0.
  --rho_mc FLOAT        Divider of max cluster according to n samples,
                        max_clust = n/mclust (default = 6.0). Must be > 0.
  --rho_s FLOAT         Multiplier of sqrt(n) for sample size (default = 0.5).
                        Must be > 0.
  --skip_index          Skip first column if it is an index and not a
                        attribute.
  --verbose             Verbose printing.

Required input/output file arguments:
  --infile INFILE, -i INFILE
                        CSV file containing input data.
  --outfile OUTFILE, -o OUTFILE
                        CSV output file containing input data + cluster
                        number.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
test_data		test_data
LICENSE		LICENSE
README.md		README.md
asym_linkage.pyx		asym_linkage.pyx
clustering.py		clustering.py
compile_asym_linkage.py		compile_asym_linkage.py
main.py		main.py
parsing.py		parsing.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recursive Agglomerative Clustering (RecAgglo) for categorical data

Requirements and setup

Contents

Code

Datasets

Running the code

About

Releases

Packages

Languages

License

SSGAalto/recagglo

Folders and files

Latest commit

History

Repository files navigation

Recursive Agglomerative Clustering (RecAgglo) for categorical data

Requirements and setup

Contents

Code

Datasets

Running the code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages