Soft Hierarchical Density-Based Spatial Clustering of Applications with Noise.
Class that wraps the HDBSCAN class to provide soft clustering
calculations, cluster selection and custom plots.
Variables:
min_cluster_size (Optional[int]) – Minimum size of cluster. Argument passed to HDBSCAN, by default None.
It is mandatory to provide this argument if the clusterer attribute
is not provided.
allow_single_cluster (bool) – Whether to allow single cluster or not, by default False.
Argument passed to HDBSCAN. In case that the data only
contains one cluster and noise, the hierarchical clustering
algorithm will not identify the cluster unless this option
is set to True.
auto_allow_single_cluster (bool) – If True, HDBSCAN will automatically toggle allow_single_cluster
to True if no clusters are found, to return at least 1 cluster. By
default False.
min_samples (Optional[int]) – Minimum number of samples in a cluster, by default None.
Argument passed to HDBSCAN.
metric (Union[str, Callable]) – Metric to be used in HDBSCAN, by default “euclidean”.
noise_proba_mode (str) –
Method to calculate the noise probability, by default “score”.
Valid options are:
score: Use only HDBSCAN cluster membership scores to calculate
noise probability, as score=1-cluster_membership_score,
where cluster_membership_score is the HDBSCAN
probabilities_ value, which indicates how tied is
a point to any cluster.
outlier: Use scores as in the “score” option, and outlier_scores
to calculate the noise probability, as
noise_proba=max(score,outlier_score).
Outlier scores are calculated by HDBSCAN using the GLOSH [4] algorithm.
conservative: Same method as the “outlier” option but does not
allow for any point classified as noise to
have a noise_proba lower than 1.
cluster_proba_mode (str) –
Method to calculate the cluster probability, by default “soft”.
Valid options are:
soft: Use the HDBSCAN all_points_membership_vectors to calculate
cluster probability, allowing for a point to be a member of multiple
clusters.
hard: Does not allow for a point to be a member of multiple clusters.
A point can be considered noise or member of only one cluster.
outlier_quantile (Optional[float]) – Quantile of outlier scores to be used as a threshold that defines a point
as outlier, classified as noise, by default None.
It must be a value between 0 and 1. If provided,
noise_proba_mode is set to “outlier”.
It scales HDBSCAN outlier scores so
any point with an outlier score higher
than the value of the provided quantile will
be considered as noise.
scaler (Optional[sklearn.base.TransformerMixin]) – Scaler to be used to scale the data before clustering, by default None.
clusterer (Optional[hdbscan.HDBSCAN [3]]) – HDBSCAN clusterer to be used, by default None. Used if more control is needed
over the clustering algorithm. It is mandatory to provide this argument if
the min_cluster_size attribute is not provided.
n_classes (int) – Number of detected classes in the data sample. Only available after the
fit() method is called.
labels (Numeric1DArray) – Labels of the data sample. Only available after the
fit() method is called. Noise
points are labeled as -1, and the rest of the points are labeled
with the cluster index.
proba (Numeric2DArray) – Probability of each point to belog to each class, including.
Only available after the
fit() method is called. Array of shape
(n_samples,n_classes). The first column corresponds to the noise class.
outlier_scores (Numeric1DArray) – Outlier scores of each point. Only available after the
fit() method is called.
Raises:
ValueError – If the min_cluster_size nor the clusterer attributes are provided.
It uses the provided configuration to identify clusters,
classify the data and provide membership probabilities. The results
are stored in the SHDBSCAN instance. The
attributes storing results are n_classes,
labels,
proba and
outlier_scores.
Parameters:
data (Numeric2DArray) – Data to be clustered.
centers (Union[Numeric2DArrayLike, Numeric1DArrayLike], optional) – Center or array of centers of clusters, by default None.
If provided,
only the clusters that are geometrically closer to the
provided centers will be considered. This option is useful for
ignoring clusters in a multiple cluster scenario.
It uses the pairprobaplot(). The colors
of the points represent class labels. The sizes
of the points reresent the probability of belonging to
the most probable class.
Returns:
Pairplot of the clustering results.
Return type:
seaborn.PairGrid
Raises:
Exception – If the clustering has not been performed yet.
It uses the tsneprobaplot() function.
It represents the data in a 2 dimensional space using t-SNE.
The colors of the points represent class labels.
The sizes of the points represent the probability of belonging
to the most probable class.
Returns:
Plot of the clustering results.
Return type:
matplotlib.axes.Axes
Raises:
Exception – If the clustering has not been performed yet.
It uses the scatter3dprobaplot() function.
It represents the data in a 3 dimensional space using the
variables given by the user. The colors of the points
represent class labels. The sizes of the points represent
the probability of belonging to the most probable class.
Returns:
Plot of the clustering results.
Return type:
matplotlib.collections.PathCollection
Raises:
Exception – If the clustering has not been performed yet.
It uses the surfprobaplot() function.
The heights of the surface and colors of the points represent
the probability of belonging to the most probable cluster,
excluding the noise class. The data is represented in two
dimensions, given by the user.
Returns:
Plot of the clustering results.
Return type:
matplotlib.collections.PathCollection
Raises:
Exception – If the clustering has not been performed yet.
Includes an indicator of outlier_quantile if provided.
It is useful for choosing an appropriate value for
outlier_quantile. Uses seaborn displot function [6].
Returns:
Plot of the outlier scores distribution.
Return type:
matplotlib.axes.Axes
Raises:
Exception – If the clustering has not been performed yet.