Module for Kernel Density Estimation with variable bandwidth matrices.
The module provides a class for multivariate Kernel Density Estimation with a bandwidth
matrix per observation. Such matrices are created from a baseline bandwidth calculated
from the Plugin or Rule Of Thumb (scott or silverman) methods. Variable errors and
covariances can be added to the matrices.
It uses the Plugin method with unconstraned pilot bandwidth
[1][2] implementation in the
ks R package [3]. See the ks package documentation for
more information on
parameter values. All attributes are passed
to ks::Hpi function.
Variables:
nstage (int, optional) – Number of calculation stages, can be
1 or 2, by default 2.
pilot (str, optional) – Kind of pilot bandwidth.
binned (bool, optional) – Use binned estimation, by default False.
diag (bool, optional) – Whether to use the diagonal bandwidth,
by default False. If true, ks::Hpi.diag
is used.
Bandwidth to be used, by default an instance
of PluginSelector. It can be:
an instance of BandwidthSelector:
the base bandwidth is calculated using the given selector.
a Number: the base bandwidth is calculated as a diagonal
matrix with the given value.
an Array: the base bandwidth is taken as the given
array. The array shape must be (n, d, d) where n is the
number of observations and d is the number of dimensions.
a String: the name of the rule of thumb to be used. One of
”scott” or “silverman”.
error_convolution (bool, optional) – When true:
* It can only estimate density for the same points as the data.
That is, eval points are equal to data points.
* It always is a leave-one-out estimation.
* To calculate the contribution of point A to the density
evaluated at point B, both the bandwidth matrix of point A and
the bandwidth matrix of point B are convolved.
* This option should be used to get an accurate measure of the
density at the data points considering the uncertainty of all
points, themselves included.
* As a new matrix is calculated for each combination of points,
is the slowest option. Although it has been optimized with ball
tree to reduce the number of matrices used, it could be
problematic for big concentrated datasets.
* Default is False.
Creates covariances matrices and kernel
instances.
Parameters:
data (Numeric2DArray) – Data.
err (OptionalNumeric2DArray, optional) – Error array of shape (n, d), by default None.
Errors are added to the base bandwidth matrix
to create individual H matrices per datapoint.
corr (OptionalNumericArray, optional) – Correlation coeficients, by default None.
Coeficients are added to the base bandwith matrix
to create individual H matrices per datapoint.
Can be one of:
NumericArray of shape (d, d): global correlation
matrix. Applied in every bandwidth matrix H.
Numeric2DArray of shape (n, (d * (d - 1) / 2):
individual correlation matrices. Each column of
the array represents a correlation between two
variables, for all observations. Order of columns
must follow a lower triangle matrix. For example:
for four variables, lower triangle of corr
matrix looks like:
weights (OptionalNumeric1DArrayLike, optional) – Weights to be used for each data point, by default None.
If None, all datapoints have the same
weight.
Base bandwidth matrix is calculated from the bw parameter. If no additional
parameters are provided, the base bandwidth matrix is
used for all datapoints. If err and/or corr are provided, they are
used to create individual covariance matrices for each datapoint [5]. The final
matrix used for each kernel is the sum of the base matrix and the individual
covariance matrix, which is equivalent to convolving two gaussian kernels, one
for the base bandwidth matrix and one for the individual covariance matrix.
The base bandwidth is considered as the minimum bandwidth of the KDE process,
for a data point without uncertainty, while the final matrix incorporates the
uncertainty if provided.
Creates a pairplot of the KDE model applied
in a grid that spans between the data max and min
values for each dimension.
Parameters:
gr (int, optional) – Grid resolution, number of bins to be taken into
account for each dimension, by default 50. Note that
data dimensions and grid resolution determine how many
points are evaluated, as eval_points=gr**dims. A high
gr value can result in a long computation time.