scludam.detection module

Module for density peak detection in numerical data.

This module provides density peak detection over numerical data and helper functions for tasks such as defining filtering masks. The main API of the module includes CountPeakDetector, default_mask() and extend_1dmask(), which can be direcly imported from scludam. Other functions and classes can be imported from scludam.detection.

scludam.detection.default_mask(dim: int)[source]

Create a default mean mask for a given dimension.

It returns a mean weighted mask with 5 elements per dimension, to be used as a filter for convolution. The sum of the mask weights is equal to 1.

Parameters:: dim (int) – Dimension of the mask.
Returns:: Array with the mask.
Return type:: NDArray[np.number]

Notes

The shape of the mask is chosen so it takes into account the values of neighboring bins, but not the value of the bin over which the mask is applied. The mask is intended to produce a good estimate of the local density of the background of the bin over which it is applied. This mask is used in the method applied by González-Alejo (2020) [1].

References

Examples

from scludam import default_mask

mask = default_mask(2)
print(mask)
# [[0.         0.         0.08333333 0.         0.        ]
#  [0.         0.08333333 0.08333333 0.08333333 0.        ]
#  [0.08333333 0.08333333 0.         0.08333333 0.08333333]
#  [0.         0.08333333 0.08333333 0.08333333 0.        ]
#  [0.         0.         0.08333333 0.         0.        ]]

scludam.detection.fast_std_filter(data: ndarray[Any, dtype[ScalarType]], mask: ndarray[Any, dtype[ScalarType]] | List | Tuple, **kwargs)[source]

Fast standard deviation filter.

To be applied over an image or histogram.

Parameters:

data (NDArray) – Image or n-dimensional histogram.
mask (ArrayLike) – Mask to be used for the convolution.

Returns:

Filtered image or histogram of the same dimensions.

Return type:

NDArray

Notes

Is possible to pass kwargs to the scipy.ndimage.convolve function. Its default border mode is ‘reflect’.

scludam.detection.get_histogram_bins(data: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]], bin_shape: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | List[Number] | Tuple[Number, ...], offsets: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | List[Number] | Tuple[Number, ...] | None = None)[source]

Get histogram bins and edges given a bin shape and data.

The method takes into account the data max and min values for each dimension and the bin shape to calculate the amount of bins and the edges to be used of an histogram. Half a bin is added to each extremum to avoid bins edges to be exactly on the data extremums.

Parameters:

data (Numeric2DArray) – Data to be used to get the histogram bins.
bin_shape (Numeric1DArrayLike) – Bin shape (each dimension) to be used.
offsets (OptionalNumeric1DArrayLike, optional) – Offsets to be added to the edges, by default None

Returns:

Number of bins and edges.

Return type:

(Numeric1DArray, Numeric2DArray)

scludam.detection.histogram(data: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]], bin_shape: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | List[Number] | Tuple[Number, ...], offsets: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | List[Number] | Tuple[Number, ...] | None = None)[source]

Get a histogram given a bin shape and data.

Uses get_histogram_bins() results to create a n-dimensional histogram.

Parameters:

data (Numeric2DArray) – Data to be used.
bin_shape (Numeric1DArrayLike) – Bin shape (in each dimension) to be used.
offsets (OptionalNumeric1DArrayLike, optional) – Offsets to shift the edges of the histogram, by default None

Returns:

Histogram and edges.

Return type:

(NDArray, NDArray)

scludam.detection.nyquist_offsets(bin_shape: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | List[Number] | Tuple[Number, ...])[source]

Get offsets for shifting a histogram.

Get all possible offsets for a given bin shape, to be used to shift the histogram edges following the Nyquist spatial sampling interval. The offsets are calculated as the half of the bin shape.

Parameters:: bin_shape (Numeric1DArrayLike) – Bin shape (in each dimension).
Returns:: Array of shape (n_combinations, n_dimensions) with the possible offsets.
Return type:: Numeric2DArray

scludam.detection.extend_1dmask(mask: ndarray[Any, dtype[ScalarType]] | List | Tuple, dim: int)[source]

Extend a 1-dimensional filtering mask to a n-dimensional mask.

From a numeric filtering 1D mask, the function uses the outer product to extend it to a n-dimensional one. The resulting mask is the combination of n 1D masks orthogonal to each other. The sum of the resulting mask is equal to 1.

Parameters:

mask (ArrayLike) – 1D mask to be extended.
dim (int) – Dimension of the mask to be extended to.

Returns:

Extended mask.

Return type:

NDArray

Examples

import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal.windows import gaussian as gaus_win

from scludam import extend_1dmask

win = gaus_win(5, 1)
print(win)
# [0.13533528 0.60653066 1.         0.60653066 0.13533528]
mask = extend_1dmask(win, 2)
sns.heatmap(mask, annot=True)
plt.show()

class scludam.detection.DetectionResult(centers: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] = array([], dtype=float64), sigmas: ndarray[Any, dtype[number]] = array([], dtype=float64), scores: ndarray[Any, dtype[number]] = array([], dtype=float64), counts: ndarray[Any, dtype[number]] = array([], dtype=float64), edges: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] = array([], dtype=float64), offsets: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] = array([], dtype=float64), indices: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] = array([], dtype=float64))[source]

Bases: object

Result of a detection run.

Variables:

centers (Numeric2DArray) – Centers of the detected peaks. Are calculated using sigma clipped median over the data delimited by the edges.
edges (Numeric2DArray) – Edges that delimit the data used to calculate the centers. It is taken as a bin shape in each direction, in each dimension.
scores (Numeric1DArray) – Scores of the detected peaks. The results are sorted by score in descending order.
counts (Numeric1DArray) – Number of data points in the bin that represents the peak.
sigmas (Numeric1DArray) – Sigma of the detected peaks. Currently arbitrarily set as the bin shape in each dimension.
offsets (Numeric2DArray) – Offsets used for detecting each peak.
indices (Numeric2DArray) – Indices of the detected peaks in the histogram.

class scludam.detection.CountPeakDetector(bin_shape: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | List[Number] | Tuple[Number, ...], mask: None | ndarray[Any, dtype[ScalarType]] | List | Tuple = None, nyquist_offset: bool = True, min_count: Number = 10, min_dif: Number = 10, min_sigma_dif: Number | None = None, min_score: Number = 2, max_n_peaks: int = 10, select_index: int | None = None, min_interpeak_dist: int = 1, remove_low_density_regions: bool = True, norm_mode: str = 'std', offsets: None | ndarray[Any, dtype[ScalarType]] | List | Tuple = None, last_result: DetectionResult | None = None, data: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]] | None = None)[source]

Bases: object

Count peak detector class.

Uses an n-dimensional histogram (array) to detect density peaks in the input data.

Variables:

bin_shape (Numeric1DArrayLike) – Bin shape (in each dimension) to be used to create the histogram.
mask (OptionalArrayLike, optional) – Mask to be used as in the filtering operations, by default uses default_mask() with data dimensions. The mask must have same dimensions as the data and its weights must sum to 1 and be appropriate for smoothing.
nyquist_offsets (bool, optional) – If True, the Nyquist spatial sampling interval is used to shift the histogram edges, by default True. It helps to underestimating the bin count due to an arbitrarily chosen bin edge shift. It uses nyquist_offsets().
min_count (Number, optional) – Mimimum count for a bin to be elegible as a peak, by default 10. Also used to remove_low_density_regions if that option is enabled.
remove_low_density_regions (bool, optional) – If True, low density bins are removed from the histogram, by default True. It removes low density bins from the edges of the histogram, trimming down the region of interest and reducing the size of the histogram, which in turn reduces memory usage for sparse data. It uses the min_count value as the threshold. It also keeps bins that are in the neigborhood of a valid (dense) bin so the filtering operation can be applied to the remaining bins correctly. The neighborhood is defined by the size of the mask to be used for the filtering operations.
min_dif (Number, optional) – Minimum difference between the background and the bin count for a bin to be elegible as peak, by default 10. The formula used is: elegible if histogram - background > min_dif where background is obtained by using filtering the histogram with the provided mask.
min_sigma_dif (Number, optional) – Sigma value to be used to calculate difference between the background and the bin count for a bin to be elegible as peak, by default None (deactivated). The formula used is: elegible if histogram - background > min_sigma_dif*std where background is obtained by using filtering the histogram with the provided mask. std represents the standard deviation in a window surrounding the bin, calculated according to the norm_mode parameter.
min_score (Number, optional) – Minimum score for a bin to be elegible as peak, by default 2. The score is calculated as the standardized difference between the bin count and the background: score = (histogram - background) / std where background is obtained by using filtering the histogram with the provided mask. std is calculated according to the norm_mode parameter.
max_n_peaks (Number, optional) – Maximum number of peaks to be detected, by default 10. Use np.inf to detect all peaks.
min_interpeak_distance (int, optional) – Minimum number of bins between peaks, by default 1.
norm_mode (str, optional) –
Mode to be used to get the standard deviation used in the score calculation, by default “std”. Can be one of the following:
1. ”std”: Standard deviation of the histogram - background calculated using the mask provided and fast_std_filter()
2. ”approx”: Approximation [1] to the standard deviation taking into account how the sharp = histogram - background is obtained.
  - An common estimate of the standard deviation of an histogram is the root square of the bin count: std(h) = sqrt(h).
  - According to the uncertainty propagation: if s = h - b, then std(s) = sqrt(var(h) + var(b) - 2*cov(h,b)).
  - Considering 2*cov(h,b)~0, the approximation is: std(s) = sqrt(h + b).
select_index (int, optional) – Detection process only returns the peak with the index provided, by default None.

Notes

The algorithm used is based in the following steps:

Remove the data corresponding to low density regions from the edges to the center, until a dense enough bin is found, as described in the remove_low_density_regions parameter.
Calculate all possible offsets for the histogram edges, using the Nyquist spatial sampling interval. The region surveyed is subdivided into a rectilinear grid of overlapping hypercubes separated by half the side length of an individual bin [2] [3] [4].
Instead of creating one histogram including all possible offsets, which can be very large when dimensionality increases, an histogram is created for each possible offset. Per histogram, the following steps are preformed:
1. Estimate the background density, convolving the histogram with the provided mask, smoothing the histogram over adjacent bins inside a window defined by the mask size [2] [5] [6].
2. Calculate the excess of data points in each bin as the difference between the bin count and the background density. This is equivalent to applying a high-pass filter to the histogram. It should be noted that the excess count using this method can be poorly estimated, specially when the bin shape used is not appropriate.
3. Calculate the score of each bin as the normalized excess count, using the methods described in the norm_mode parameter.
4. Apply min_count, min_sigma_dif, min_dif and min_interpeak_distance constraints and find peaks in the n-dimensional score histogram.
Take the peaks found in each shifted histogram and merge them into a a single list, taking only the higher score shift for each peak. The list is sorted in descending order by score.

The fundamental parameter of the method is bin_shape. In general, the shape must be chosen as the span in each dimension of the object to be detected.

References

Examples

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal

from scludam import CountPeakDetector
from scludam.synthetic import (
    StarCluster,
    StarField,
    Synthetic,
    UniformFrustum,
    polar_to_cartesian,
)

# Generate some data
np.random.seed(134)

field_size = int(1e4)
cluster_size = int(1e2)
field = StarField(
    pm=multivariate_normal(mean=(0.0, 0.0), cov=(20, 15)),
    space=UniformFrustum(locs=(120.5, -27.5, 12), scales=(1, 1, -11.8)),
    n_stars=field_size,
)
clusters = [
    StarCluster(
        space=multivariate_normal(
            mean=polar_to_cartesian([120.7, -28.5, 0.5]), cov=0.5
        ),
        pm=multivariate_normal(mean=(0.5, 0), cov=1.0 / 10),
        n_stars=cluster_size,
    ),
    StarCluster(
        space=multivariate_normal(mean=polar_to_cartesian([120.8, -28.6, 5]), cov=0.5),
        pm=multivariate_normal(mean=(4.5, 4), cov=1.0 / 10),
        n_stars=cluster_size,
    ),
    StarCluster(
        space=multivariate_normal(mean=polar_to_cartesian([120.9, -28.7, 8]), cov=0.5),
        pm=multivariate_normal(mean=(7.5, 7), cov=1.0 / 10),
        n_stars=cluster_size,
    ),
]
df = Synthetic(star_field=field, clusters=clusters).rvs()

# Select the data to be used for the detection
data = df[["pmra", "pmdec", "log10_parallax"]].values

# Detect density peaks in the data
detector = CountPeakDetector(bin_shape=[0.5, 0.5, 0.05])
result = detector.detect(data)
print(result)
# DetectionResult(
#     centers=numpy.ndarray([
#         [4.536048743232476, 3.9186486456763276, 0.6994997601345216],
#         [7.538194372795517, 6.998371367057486, 0.9020709230859785],
#         [0.4479574532248377, -0.01917758842077178, -0.3009924913261566]
#     ]),
#     sigmas=numpy.ndarray([
#         [0.5, 0.5, 0.05],
#         [0.5, 0.5, 0.05],
#         [0.5, 0.5, 0.05]
#     ]),
#     scores=numpy.ndarray([
#         6.9706701775336,
#         6.627355856978734,
#         5.886588680967625
#     ]),
#     counts=numpy.ndarray([35.21875, 24.71875, 31.21875]),
#     edges=numpy.ndarray([
#         [
#             [4.016260095931324, 5.016260095931324],
#             [3.1871460374559666, 4.187146037455967],
#             [0.6297442445624907, 0.7297442445624908]
#         ],
#         [
#             [6.766260095931324, 7.766260095931324],
#             [6.437146037455967, 7.437146037455967],
#             [0.8047442445624906, 0.9047442445624907]
#         ],
#         [
#             [-0.23373990406867584, 0.7662600959313242],
#             [-0.5628539625440334, 0.4371460374559666],
#             [-0.37025575543750955, -0.27025575543750957]
#         ]
#     ]),
#     offsets=numpy.ndarray([
#         [0.0, 0.25, 0.0],
#         [0.25, 0.0, 0.025],
#         [0.25, 0.0, 0.0]
#     ]),
#     indices=numpy.ndarray([[27, 21, 28], [32, 28, 31], [18, 14, 8]])
# )
# Note: the above print is actually made with (https://pypi.org/project/prettyprinter/)

# Plot the third peak found in pmra vs pmdec
detector.plot(2)
plt.show()

detect(data: ndarray[Any, dtype[number]][ndarray[Any, dtype[number]]])[source]

Detect peaks in the provided data.

Uses the configuration provided in the class attributes.

Parameters:

data (Numeric2DArray) – Numerical data to be used.

Returns:

Instance containing the detected peaks.

Return type:

DetectionResult

Raises:

ValueError – If remove_low_density_regions is used and no bin passes the density check, the min_count is probably too low.
ValueError – If data, bin_shape and mask dimensions do not match.

Warns:

UserWarning – If histogram has too few bins in some dimension, the filtering operations can still be applied but prone to border effects.

plot(peak: int = 0, x: int = 0, y: int = 1, mode: str = 'c', cols: List[str] | None = None, cut_label_prec: int = 4, center_label_prec: int = 4, **kwargs)[source]

Create a plot of the individual peaks detected.

Creates the plot using the result of the last detect() call. Returns a custom seaborn heatmap plot. Passes any kwargs to the plot function. The heatmap is a two dimensional histogram slice, where x and y can be set, and the rest of dimensions are fixed in the peak center.

Parameters:

peak (int, optional) – Index of the peak to be plotted in the result array, by default 0
x (int, optional) – Index of the variable that should be placed as the first dimension, by default 0
y (int, optional) – Index of the variable that should be placed as the second dimension, by default 1
mode (str, optional) – Histogram type, by default “c”. Can be one of:
1. “c”: Counts histogram.
2. “b”: Background histogram.
3. “e”: Excess histogram.
4. “s”: Score histogram.
The meaning of each histogram can be inferred from the method explained in CountPeakDetector.
cols (Union[[List[str]], None], optional) – List of variable names, by default None. If None, then the variables are called ‘var1’, ‘var2’, and so on.
cut_label_prec (int, optional) – Decimal places for the cut message in the title, by default 4.
center_label_prec (int, optional) – Decimal places for the center message in the title, by default 4.

Returns:

Plot of the peak.

Return type:

matplotlib.axes._subplots.AxesSubplot

Raises:

ValueError – If no results are available.
ValueError – If no peaks are detected in the last result.
ValueError – Invalid peak index.
ValueError – Invalid mode.
ValueError – Invalid x or y dimensions.
ValueError – Invalid label length.

Examples

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal

from scludam import CountPeakDetector
from scludam.synthetic import (
    StarCluster,
    StarField,
    Synthetic,
    UniformFrustum,
    polar_to_cartesian,
)

# Generate some data
np.random.seed(134)

field_size = int(1e4)
cluster_size = int(1e2)
field = StarField(
    pm=multivariate_normal(mean=(0.0, 0.0), cov=(20, 15)),
    space=UniformFrustum(locs=(120.5, -27.5, 12), scales=(1, 1, -11.8)),
    n_stars=field_size,
)
clusters = [
    StarCluster(
        space=multivariate_normal(
            mean=polar_to_cartesian([120.7, -28.5, 0.5]), cov=0.5
        ),
        pm=multivariate_normal(mean=(0.5, 0), cov=1.0 / 10),
        n_stars=cluster_size,
    ),
    StarCluster(
        space=multivariate_normal(mean=polar_to_cartesian([120.8, -28.6, 5]), cov=0.5),
        pm=multivariate_normal(mean=(4.5, 4), cov=1.0 / 10),
        n_stars=cluster_size,
    ),
    StarCluster(
        space=multivariate_normal(mean=polar_to_cartesian([120.9, -28.7, 8]), cov=0.5),
        pm=multivariate_normal(mean=(7.5, 7), cov=1.0 / 10),
        n_stars=cluster_size,
    ),
]
df = Synthetic(star_field=field, clusters=clusters).rvs()

# Select the data to be used for the detection
data = df[["pmra", "pmdec", "log10_parallax"]].values

# Detect density peaks in the data
detector = CountPeakDetector(bin_shape=[0.5, 0.5, 0.05])
result = detector.detect(data)

# Custom plot of third peak scores in pmdec and log10_parallax
detector.plot(
    peak=2,
    x=1,
    y=2,
    mode="s",
    cols=["pmra", "pmdec", "log10_parallax"],
    annot_threshold=5,
    cmap="copper",
)
plt.show()
# Custom plot of third peak background in pmra and pmdec
detector.plot(
    peak=2,
    x=0,
    y=1,
    mode="b",
    cols=["pmra", "pmdec", "log10_parallax"],
    annot=False,
    cmap="copper",
)
plt.show()

lineplot(**kwargs)[source]

Plot the last result as a line plot.

Plots scores and counts for all peaks found.

Returns:: Line plot of the scores and counts.
Return type:: AxesSubplot
Raises:: ValueError – No results available.