scludam.membership module

Module for Density Based Membership Estimation.

class scludam.membership.DBME(n_iters: int = 2, kde_leave1out: bool = True, kernel_calculation_mode: str = 'per_class', pdf_estimator: HKDE = HKDE(bw=PluginSelector(nstage=None, pilot=None, binned=None, diag=False), error_convolution=False, _kernels=None, _weights=None, _covariances=None, _base_bw=None, _data=None, _n=None, _d=None, _n_eff=None, _eff_mask=None, _maxs=None, _mins=None), n: int | None = None, d: int | None = None, unique_labels: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | None = None, data: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | None = None, estimators: list = _Nothing.NOTHING, n_estimators: int | None = None, iter_priors: list = _Nothing.NOTHING, iter_counts: list = _Nothing.NOTHING, iter_label_diff: list = _Nothing.NOTHING, iter_labels: list = _Nothing.NOTHING, n_classes: int | None = None, labels: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | None = None, counts: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | None = None, posteriors: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | None = None, priors: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | None = None)[source]

Bases: object

Density Based Membership Estimation.

It uses HKDE to estimate the density and calculate smooth membership probabilities for each class, given data and initial probabilities.

Variables:

n_iters (int) – Number of iterations, by default 2. In each iteration, prior probabilities are updated according to the posterior probabilities of the previous iteration.
kernel_calculation_mode (str) –
Mode of kernel calculation, by default per_class. It indicates how many HKDE estimators will be used to estimate the density. Available modes are:
- same: the bandwidth of the kernels is the same for all classes. There will be one estimator.
- per_class: the bandwidth of the kernels is different for each class. There will be one estimator per class.
- per_class_per_iter: the bandwidth of the kernels is different for each class and iteration. There will be one estimator per class which will be updated in each iteration, recalculating the bandwith each time.
kde_leave1out (bool) – Whether to use leave-one-out KDE estimation, by default True.
pdf_estimator (Union[HKDE, List[HKDE]]) – Estimator used to estimate the density, by default an instance of HKDE with default parameters. If list is provided, it is asumed either 1 per class or first for first class and 2nd for the rest.
n_classes (int) – Number of detected classes. Only available after the fit() method is called.
labels (Numeric1DArray) – Labels of the classes, only available after the fit() method is called.
counts (Numeric1DArray) – Number of data points in each class, only available after the fit() method is called.
priors (Numeric1DArray) – Prior probabilities of each class, only available after the fit() method is called.
posteriors (Numeric2DArray) – Posterior probabilities array of shape (n_datapoints, n_classes), only available after the fit() method is called.

Examples

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal

from scludam import DBME, SHDBSCAN
from scludam.synthetic import (
    BivariateUniform,
    StarCluster,
    StarField,
    Synthetic,
    UniformFrustum,
    polar_to_cartesian,
)

# generate some data

fmix = 0.9
n = 1000
n_clusters = 1
cmix = (1 - fmix) / n_clusters

field = StarField(
    pm=BivariateUniform(locs=(-7, 6), scales=(2.5, 2.5)),
    space=UniformFrustum(locs=(118, -31, 1.2), scales=(6, 6, 0.9)),
    n_stars=int(n * fmix),
)
clusters = [
    StarCluster(
        space=multivariate_normal(mean=polar_to_cartesian([121, -28, 1.6]), cov=50),
        pm=multivariate_normal(mean=(-5.75, 7.25), cov=1.0 / 34),
        n_stars=int(n * cmix),
    ),
]
df = Synthetic(star_field=field, clusters=clusters).rvs()

data = df[["pmra", "pmdec"]].values

# create some random observational error
random_error = np.random.normal(0, 0.1, data.shape)

# calculate some initial probabilities
shdbscan = SHDBSCAN(
    min_cluster_size=150, noise_proba_mode="outlier", auto_allow_single_cluster=True
).fit(data)

# use DBME to fit HKDE models and calculate membership probabilities
dbme = DBME().fit(data, shdbscan.proba, random_error)
print(dbme.posteriors)
# [[9.45802647e-01 5.41973532e-02]
#  ...
#  [2.77988823e-01 7.22011177e-01]]

# plot to compare initial probabilities with membership probabilities
shdbscan.surfplot(cols=["pmra", "pmdec"])
dbme.surfplot(cols=["pmra", "pmdec"])
plt.show()

fit(data: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]], init_proba: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]], err: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | None = None, corr: ndarray[Any, dtype[number]] | None = None)[source]

Fit models and calculate posteriors probabilities.

The method takes data and initial probabilities and creates density estimators. Prior probabilities are taken from the initial probabilities. In each iteration, the method calculates the posterior probabilities of each datapoint using the density estimates and prior probabilites. Also, the method updates the prior probabilities considering the posterior probabilities of the past iteration. n_iters=1 uses prior probabilities as provided in the initial probabilities array. n_iters=2 (recommended), updates the prior probabilities once.

Parameters:

data (Numeric2DArray) – Data matrix.
init_proba (Numeric2DArray) – Initial posterior probability array. Must be of shape (n_samples, n_classes). This probabilities are used to create the initial density estimators per class.
err (OptionalNumeric2DArray, optional) – Error parameter to be passed to fit(), by default None.
corr (OptionalNumericArray, optional) – Correlation parameter to be passed to fit(), by default None.

Returns:

Fitted instance of the DBME class.

Return type:

DBME

pairplot(**kwargs)[source]

Plot the clustering results in a pairplot.

It uses the pairprobaplot(). The colors of the points represent class labels. The sizes of the points reresent the probability of belonging to the most probable class.

Returns:: Pairplot of the clustering results.
Return type:: seaborn.PairGrid
Raises:: Exception – If the clustering has not been performed yet.

tsneplot(**kwargs)[source]

Plot the clustering results in a t-SNE plot.

It uses the tsneprobaplot() function. It represents the data in a 2 dimensional space using t-SNE. The colors of the points represent class labels. The sizes of the points represent the probability of belonging to the most probable class.

Returns:: Plot of the clustering results.
Return type:: matplotlib.axes.Axes
Raises:: Exception – If the clustering has not been performed yet.

scatter3dplot(**kwargs)[source]

Plot the clustering results in a 3D scatter plot.

It uses the scatter3dprobaplot() function. It represents the data in a 3 dimensional space using the variables given by the user. The colors of the points represent class labels. The sizes of the points represent the probability of belonging to the most probable class.

Returns:: Plot of the clustering results.
Return type:: matplotlib.collections.PathCollection
Raises:: Exception – If the clustering has not been performed yet.

surfplot(**kwargs)[source]

Plot the clustering results in a 3D surface plot.

It uses the surfprobaplot() function. The heights of the surface and colors of the points represent the probability of belonging to the most probable cluster, excluding the noise class. The data is represented in two dimensions, given by the user.

Returns:: Plot of the clustering results.
Return type:: matplotlib.collections.PathCollection
Raises:: Exception – If the clustering has not been performed yet.