pybilt.common package¶
Submodules¶
pybilt.common.distance_cutoff_clustering module¶
Function to compute hiearchical distance cutoff clusters.
-
pybilt.common.distance_cutoff_clustering.
distance_cutoff_clustering
(vectors, cutoff, dist_func, min_size=1, *df_args, **df_kwargs)[source]¶ Hiearchical distance cutoff clustering.
This function takes a set of vector points and clusters them using a hiearchical distance based clustering algorithm. Points are clustered together whenevever a point is within the cutoff distance of any point within in a cluster.
- Parameters
vectors (np.array, list like) – The array of vector points.
cutoff (float) – The cutoff distance.
dist_func (function) – The function to use when computing the distance between points.
min_size (Optional[int]) – The minimum size of a cluster. Defaults to 1.
*df_args – Any additional arguments to be passed to the distance function (dist_func).
**df_kwargs – Any additional keyword arguments to be passed to the distance function (dist_func).
- Returns
- Returns a list of clustered points where each set of clustered i
points is a list of the indices of the points in that cluster.
- Return type
list
-
pybilt.common.distance_cutoff_clustering.
distance_euclidean
(v_a, v_b)[source]¶ Compute the Euclidean distance between two vectors.
- Parameters
v_a (numpy.array, array like) – The first input vector.
v_b (numpy.array, array like) – The second input vector.
- Returns
The Euclidean distance between the two vectors.
- Return type
float
Notes
The two vectors should have the same size and dimension.
-
pybilt.common.distance_cutoff_clustering.
distance_euclidean_pbc
(v_a, v_b, box_lengths, center='zero')[source]¶ Compute the Euclidean distance between two vectors under periodic boundaries.
- Parameters
v_a (numpy.array, array like) – The first input vector.
v_b (numpy.array, array like) – The second input vector.
box_lengths (numpy.array, array like) – The periodic boundary box lengths for each dimension.
center (Optional[str, array like]) – Set the coordinate center of the periodic box dimensions. Defaults to ‘zero’, which sets the center to numpy.zeros(len(box_lengths)). Also accepts the string value ‘box_half’, which sets the center to 0.5*box_lengths.
- Returns
The Euclidean distance between the two vectors.
- Return type
float
Notes
- The two vectors should have the same size and dimension, while
box_lengths should have the length of the vector dimension.
-
pybilt.common.distance_cutoff_clustering.
vector_difference_pbc
(v_a, v_b, box_lengths, center='zero')[source]¶ Compute the Euclidean distance between two vectors under periodic boundaries.
- Parameters
v_a (numpy.array, array like) – The first input vector.
v_b (numpy.array, array like) – The second input vector.
box_lengths (numpy.array, array like) – The periodic boundary box lengths for each dimension.
center (Optional[str, array like]) – Set the coordinate center of the periodic box dimensions. Defaults to ‘zero’, which sets the center to numpy.zeros(len(box_lengths)). Also accepts the string value ‘box_half’, which sets the center to 0.5*box_lengths.
- Returns
The Euclidean distance between the two vectors.
- Return type
float
Notes
- The two vectors should have the same size and dimension, while
box_lengths should have the length of the vector dimension.
pybilt.common.gaussian module¶
Define Gaussian function objects.
This module defines the Gaussian class and the GaussianRange class.
-
class
pybilt.common.gaussian.
Gaussian
(mean, std)[source]¶ Bases:
object
A Gaussian function object.
-
mean
¶ The mean of the Gaussian.
- Type
float
-
std
¶ The standard deviation of the Gaussian.
- Type
float
Initialize a Gaussian function object.
- Parameters
mean (float) – Set the mean of the Gaussian.
std (float) – Set the standard deviation of the Gaussian.
-
-
class
pybilt.common.gaussian.
GaussianRange
(in_range, mean, std, npoints=200)[source]¶ Bases:
object
Define a Gaussian function over a range.
This object is used to define a Gaussian function over a defined finite range and store its values as evaluated at points evenly spaced over the range. The points can then for example be used for integrating the Gaussian function over the range using numerical quadrature.
-
mean
¶ The mean of the Gaussian.
- Type
float
-
std
¶ The standard deviation of the Gaussian.
- Type
float
-
upper
¶ The upper boundary of the range.
- Type
float
-
lower
¶ The lower boundary of the range.
- Type
float
-
npoints
¶ The number of points to evaluate in the range.
- Type
int
Initialize the GaussianRange object.
The GaussianRange stores the values of Gaussian function with the input mean and standard deviation evaluated at evenly spaced points in the specified x-value range.
- Parameters
in_range (tuple, list) – Specify the endpoints for range, e.g. (x_start, x_end).
mean (float) – The mean of the Gaussian function.
std (float) – The standard deviation of the Gaussian function.
npoints (Optional[int]) – The number of x-value points to evaluate the Gaussian function for in the specified range (i.e. in_range).
-
eval
(x_in)[source]¶ Return the Gaussian function evaluated at the input x value.
- Parameters
x_in (float) – The x value to evaluate the function at.
- Returns
The function evaluation for the Gaussian.
- Return type
float
-
get_values
()[source]¶ Return the x and y values for the Gaussian range function.
- Returns
The x and y values for the function, returned as ( x_values, y_values).
- Return type
tuple
-
integrate_range
(lower, upper)[source]¶ Returns the numerical integration of the Gaussian range.
This function does a simple quadrature for the Gaussian function as evaluated on the range (or subset of the range) specified at initialization.
- Parameters
lower (float) – The lower boundary for the integration.
upper (float) – The upper boundary for the integration.
- Returns
- The numerical value of the Gaussian range integrated from
lower to upper.
- Return type
float
Notes
This function does not thoroughly check the bounds, so if upper is less than lower the function will break.
-
reset_mean
(new_mean)[source]¶ Change the mean of the Gaussian function.
- Parameters
new_mean (float) – The new mean of the Gaussian function.
Notes
This function does not re-evaluate the Gaussian range and therefore only affects the output of the eval function.
-
sum_range
(lower, upper)[source]¶ Returns the over the Gaussian range.
This function sums the Gaussian function at the points that were evaluated on the range (or subset of the range) specified at initialization.
- Parameters
lower (float) – The lower boundary for the sum.
upper (float) – The upper boundary for the sum.
- Returns
- The numerical value of the Gaussian range as summed from
lower to upper.
- Return type
float
Notes
This function does not thoroughly check the bounds, so if upper is less than lower the function will break.
-
pybilt.common.knn_entropy module¶
Functions to evaluate information theoretic measures using knn approaches.
This module defines a set of functions to compute information theoretic measures (i.e. Shannon Entropy, Mutual Information, etc.) using the k-nearest neighbors (knn) approach.
-
pybilt.common.knn_entropy.
conditional_mutual_information
(var_tuple, cond_tuple, k=2)[source]¶ Returns an estimate of the conditional mutual information.
This function computes an estimate of the mutual information between a set of random variables or random vectors conditioned on other random variables or vectors using knn estimators for the entropy calculations.
- Parameters
var_tuple (tuple) – A tuple of random variables or random vectors (i.e. numpy.array) to estimate the mutual information between.; e.g. var_tuple = (X, Y) where X form X = {x_1, x_2, x_3,…,x_N} and Y has form Y = {x_1, x_2, x_3,…,x_N}, or where X has form X = {(x_X1, y_X1), (x_X2, y_X2),…,(x_XN, y_XN)} and Y has form Y = {(x_Y1, y_Y1), (x_Y2, y_Y2),…,(x_YN, y_YN)}.
cond_tuple (tuple) –
- A tuple of random variables or random
vectors (i.e. numpy.array) that the mutual information is to be conditioned on; e.g. var_tuple = (X) where X has the form X = {x_1, x_2, x_3,…,x_N}.
- k (Optional[int]): The number of nearest neighbors to store for each
point. Defaults to 2.
- Returns
The mutual information estimate.
- Return type
float
Notes
The information entropies used to estimate the mutual information are computed using the shannon_entropy function. All input random variable/vector arrays must have the same shape.
-
pybilt.common.knn_entropy.
k_nearest_neighbors
(X, k=1)[source]¶ Get the k-nearest neighbors between points in a random variable/vector.
Determines the k nearest neighbors for each point in teh random variable/vector using Euclidean style distances.
- Parameters
X (np.array) – A random variable of form X = {x_1, x_2, x_3,…,x_N} or a random vector of form X = {(x_1, y_1), (x_2, y_2),…,(x_N, y_n)}.
k (Optional[int]) – The number of nearest neighbors to store for each point. Defaults to 1.
- Returns
- A dictionary keyed by the indices of X and containing a list
of the k nearest neighbor for each point along with the distance value between the point and the nearest neighbor.
- Return type
dict
-
pybilt.common.knn_entropy.
kth_nearest_neighbor_distances
(X, k=1)[source]¶ Returns the distance for the kth nearest neighbor of each point.
- Parameters
- X (np.array) – A random variable of form X = {x_1, x_2, x_3,…,x_N} or
a random vector of form X = {(x_1, y_1), (x_2, y_2),…,(x_N, y_n)}.
k (Optional[int]) – The number of nearest neighbors to check for each point. Defaults to 1.
- Returns:
list: A list in same order as X with the distance value to the kth nearest neighbor of each point in X.
-
pybilt.common.knn_entropy.
mutual_information
(var_tuple, k=2)[source]¶ Returns an estimate of the mutual information.
This function computes an estimate of the mutual information between a set of random variables or random vectors using knn estimators for the entropy calculations.
- Parameters
var_tuple (tuple) –
- A tuple of random variables or random
vectors (i.e. numpy.array); e.g. var_tuple = (X, Y) where X form X = {x_1, x_2, x_3,…,x_N} and Y has form Y = {x_1, x_2, x_3,…,x_N}, or where X has form X = {(x_X1, y_X1), (x_X2, y_X2),…,(x_XN, y_XN)} and Y has form Y = {(x_Y1, y_Y1), (x_Y2, y_Y2),…,(x_YN, y_YN)}.
- k (Optional[int]): The number of nearest neighbors to store for each
point. Defaults to 2.
- Returns
The mutual information estimate.
- Return type
float
Notes
The information entropies used to estimate the mutual information are computed using the shannon_entropy function. All input random variable/vector arrays must have the same shape.
-
pybilt.common.knn_entropy.
shannon_entropy
(X, k=1, kth_dists=None)[source]¶ Return the Shannon Entropy of the random variable/vector.
This function computes the Shannon information entropy of the random variable/vector as estimated using the Kozachenko-Leonenko (KL) knn estimator.
- Parameters
X (np.array) – A random variable of form X = {x_1, x_2, x_3,…,x_N} or a random vector of form X = {(x_1, y_1), (x_2, y_2),…,(x_N, y_n)}.
k (Optional[int]) – The number of nearest neighbors to store for each point. Defaults to 1.
kth_dists (Optional[list]) – A list in the same order as points in X that has the pre-computed distances between the points in X and their kth nearest neighbors at. Defaults to None.
References
- Damiano Lombardi and Sanjay Pant, A non-parametric k-nearest
- neighbour entropy estimator, arXiv preprint,
[cs.IT] 2015, arXiv:1506.06501v1. https://arxiv.org/pdf/1506.06501v1.pdf
https://www.cs.tut.fi/~timhome/tim/tim/core/differential_entropy_kl_details.htm
- Kozachenko, L. F. & Leonenko, N. N. 1987 Sample estimate of entropy
of a random vector. Probl. Inf. Transm. 23, 95-101.
- Returns
The estimate of the Shannon Information entropy of X.
- Return type
float
-
pybilt.common.knn_entropy.
shannon_entropy_pc
(X, k=1, kth_dists=None)[source]¶ Return the Shannon Entropy of the random variable/vector.
This function computes the Shannon information entropy of the random variable/vector as estimated using the Perez-Cruz knn estimator described in Reference 1.
- Parameters
X (np.array) – A random variable of form X = {x_1, x_2, x_3,…,x_N} or a random vector of form X = {(x_1, y_1), (x_2, y_2),…,(x_N, y_n)}.
k (Optional[int]) – The number of nearest neighbors to store for each point. Defaults to 1.
kth_dists (Optional[list]) – A list in the same order as points in X that has the pre-computed distances between the points in X and their kth nearest neighbors at. Defaults to None.
References
- Perez-Cruz, (2008). Estimation of Information Theoretic Measures
for Continuous Random Variables. Advances in Neural Information Processing Systems 21 (NIPS). Vancouver (Canada), December. https://papers.nips.cc/paper/3417-estimation-of-information-theoretic-measures-for-continuous-random-variables.pdf
- Returns
The estimate of the Shannon Information entropy of X.
- Return type
float
pybilt.common.running_stats module¶
Running stats module.
This module defines the RunningStats and BlockAverager classes, as well as the gen_running_average function.
-
class
pybilt.common.running_stats.
BlockAverager
(points_per_block=1000, min_points_in_block=500, store_data=False)[source]¶ Bases:
object
An object that keeps track of points for block averaging.
-
n_blocks
¶ The current number of active blocks.
- Type
int
Init a the BlockAverager
- Parameters
points_per_block (int, Optional) – The number of points to assign to a block before initiating a new block. Default: 1000
min_points_in_block (int, Optional) – The minimum number of points that a block (typically the last block) can have and still be included in computing the final block average and standard error estimates. This value should be <= points_per_block. Default: 500
-
averages_of_blocks
()[source]¶ Return the block average and standard error.
- Returns
Returns a length two tuple with the block average and standard error estimates.
- Return type
tuple
-
get
()[source]¶ Return the block average and standard error.
- Returns
Returns a length two tuple with the block average and standard error estimates.
- Return type
tuple
-
number_of_blocks
()[source]¶ Return the current number of blocks.
- Returns
The number of blocks.
- Return type
int
-
points_per_block
()[source]¶ Return information about the points per block.
- Returns
- A three element tuple containing the setting for points per block, the setting for minimum points
per block, and the number of points in the last block.
- Return type
tuple
-
push_container
(data)[source]¶ Push a container (array or array like) of data points to the block averaging.
- Parameters
data (array like) – The container (list, tuple, np.array, etc.) of data points to add to the block averaging.
-
-
class
pybilt.common.running_stats.
RunningStats
[source]¶ Bases:
object
A RunningStats object.
The RunningStats object keeps running statistics for a single value/quantity.
-
n
¶ The number of points that have pushed to the running
- Type
int
-
average.
Initialize the RunningStats object.
-
-
pybilt.common.running_stats.
binned_average
(data, positions, n_bins=25, position_range=None, min_count=0)[source]¶ Compute averages over a quantized range of histogram like bins.
- Parameters
data (np.array) – A 1d numpy array of values.
positions (np.array) – A 1d numpy array of positions corresponding to the values in data. These are used to assign the values to the histogram like bins for averaging.
n_bins (Optional[int]) – Set the target number of bins to quantize the position_range up into. Defaults to 25
position_range (Optional[tuple]) – A two element tuple containing the lower and upper range to bin the postions over; i.e. (position_lower, postion_upper). Defaults to None, which uses positions.min() and positions.max().
- Returns
returns a tuple with two numpy arrays of form (bins, averages)
- Return type
tuple
Notes
The function automatically filters out bins that have a zero count, so the final value of the number of bins and values will be len(bins) <= n_bins.
-
pybilt.common.running_stats.
block_avg_hist
(nparray_1d, block_size, in_range='auto', scale=False, *args, **kwargs)[source]¶ Creates histograms for each block and averages them to generate block a single block averaged historgram.
-
pybilt.common.running_stats.
gen_running_average
(onednparray)[source]¶ Generates a running average
Args: onednparray (numpy.array): A 1d numpy array of measurements (e.g. over time)
Returns: numpy.array: 2d array of dim len(onednparray)x2
2dnparray[i][0] = running average at i 2dnparray[i][1] = running standard deviation at i for i in range(0,len(onednparray))