The implementation allows showing the performance of the algorithm on different model datasets, which have differing levels of complexity for the nearest-neighbours-based algorithms. The user can create a custom dataset and observe how the algorithm performs classification.
As well, it contains the computation of the Compactness Profile of the dataset, which characterizes the difficulty of classifying the data in the dataset using Nearest-neighbour-based classifiers, and allows computing the Complete Cross Validation Error value without actually running cross-validation.
In addition to this functionality, the program gives the ability to observe the distribution of points in the dataset according to their FRiS-values(function of rival similarity). This value allows us to classify a point as belonging to one of the following classes:
- Outlier - FRiS-Value > -1 and <= -0.5
- Errors - FRiS-Value > -0.5 and <= -0.15
- Boundary - FRiS-Value > -0.15 and <= 0.15
- Reliably Classified - FRiS-Value > 0.15 and <= 0.7
- Representative - FRiS-Value > 0.7 and <= 1.0
The project uses Python 3.11 To create an environment to run the project, one needs to have poetry installed
To create the environment for the project, please perform the following commands:
poetry env use 3.11
poetry install
After the installation of poetry, one can run the demo UI by running the following command:
poetry run python fris\wx_python_gui.py
To create a dataset, one could use one of the following methods:
-
Input custom dataset using mouse clicks. In this case, the left mouse click generates the first class, and the right mouse click generates the point of the second class. WARNING - FRiS Algorithm will not produce a classification if the same point is present in two different classes. To build the classification, press the Fit button. After building the initial model, one could update it by adding additional points.
-
Generating points from model datasets. Enter, if needed, the number of points to be generated by a model data generator in the Point Count input field. The minimum number of points is two.
NOTE: The complexity of the
Create dataset points by pressing one of the following buttons:
- Compact - generates points in two non-overlapping horizontal stripes with an empty stripe between them. Such a dataset is the simplest case for the nearest-neighbours-based classifiers.
- Separable - generates points in two adjacent horizontal stripes.
- Wide Saw - Data with wide saw-like vertical stripes
- Narrow Saw - Data with narrow saw-like vertical stripes
- Checkers - Data sampled from checker-board drawing - This type of data distribution is hard for nearest neighbours-based classifiers.
- Stripes - Data sampled from 20 vertical stripes with differing classes. This data is also hard for nearest neighbours-based classifiers.
- Circles - points are samples from the sklearn circles data generator
- Moon - points are samples from the sklearn moon data generator.
To obtain visualization of the FRiS Stolp algorithm predictions, one needs to press the Fit button.
Buttons in the next row serve the following purposes:
-
- Fit - fits the classifier on the base of input data
-
- Clear - Removes all points from the currently active dataset
-
- Reset Parameters - Resest algorithm hyperparameters to the default values
-
- Do Second Round - Perform second round of the FRiS algorithm
-
- Allocate points to covering stolps - With this option, the algorithm skips the reallocation of points to the nearest representative points in the second round, and the visualization shows to which representative points redundant points were allocated initially by the algorithm.
-
- Save dataset - Allows saving of manually created or generated model datasets to the file specified by the user. To load a stored dataset at the start of the next session, start the program as follows:
poetry run python fris\wx_python_gui.py --input <dataset_file_name>
with <dataset_file_name> replaced by the actual file name.
Input fields and sliders f_0 and Lambda allow specifying the values of the hyperparameters.
The area below the sliders displays the characteristics of the current dataset, and if the model has been built, it also displays the FRiS-Stolp model's score.
Tab Data displays all the points from the current dataset. If the model has been built it will also show additional information computed by the algorithm.
Tab Compactness Profile shows graphs of the compactness profile (the ratio of objects from the other classes among N neighbours of point in the dataset depending on the number of neighbours) and also shows the graph of the complete cross-validation error depending on the number of points used in the cross-validation.
The FRiS Values Distribution tab displays a graph of the FRiS values distribution for the dataset points. Experiment with different model datasets to observe how the compactness profile and FRiS values distribution change concerning the underlying data distribution.
The algorithm is based on the hypothesis of the compactness: the assumption that similar objects are much more likely to lie in the same class than in different ones. In other words, it assumes that classes form closely grouped subsets within the object space.
FRiS-Stolp algorithm is a density-based machine learning algorithm utilizing the function of the Rival Similarity.
This function evaluates a similarity of the point z relative to two prototypes a and b. The range of values of the function lies between -1 and 1, where:
- -1 means that z is equal to b
- 1 means that z is equal to a
- 0 means that z is equidistant from both a and b.
FRiS-Stolp algorithm produces a compressed representation of the dataset, i.e., it selects a subset of dataset points, which is smaller than the original dataset, and is enough to produce a classification similar to the classification produced by the original dataset.
This algorithm has two hyperparameters:
-
$F_0$ . This value could be seen as the "similarity diameter" of the cluster represented by the object (stolp). The default value of the$F_0$ parameter is 0. The actual value range of this parameter is$[-1;1]$ The sensible value range is$[0;1]$ .
When this parameter value is
-
$\lambda$ . In binary classification problems, this parameter dictates which type of errors should be prioritized – whether it is crucial to accurately classify instances of the first class or to minimize misclassification of instances from the second class. This parameter influences computing efficiency of representative objects. The default value of$\lambda$ -parameter is$0.5$ when the equal weight is given to defensibility and tolerance. The valid value range is$\lambda \in [0, 1]$ .
The algorithm was first described in the following publication: N.G. Zagoruiko, I.A. Borisova, V.V. Dyubanov, O.A. Kutnenko "Methods of recognition based on the function of rival similarity"
The efficiency of an object
Defensibility describes how well an object
Tolerance of object x is a quantitative assessment of the extent to which object