agas.pair_from_array#

agas.pair_from_array(input_array, similarity_function: Callable, divergence_function: Callable, similarity_weight: Union[float, int] = 0.5, return_filter: Union[str, float, int] = 'first', return_matrix: bool = False)#

Calculate the optimality score of each unique pair of rows in a 2D array.

Optimality is such that when that the difference between two rows is minimal when each is aggregated using the similarity_function and maximal when each is aggregated using divergence_function.

For elaborate description, see Notes below or the agas Tutorial.

Parameters
  • input_array (array-like, shape (N, T)) – The data to be processed. Each row of N rows (the first axis, 0) is a unique groupunit of values to be aggregated together (e.g., subjects from an experiment). Each column of T columns (the second axis, 1) represent the different observations within each groupunit (e.g., across timestamps).

  • similarity_function (Callable) – Each of these two functions is used to aggregate groups of observations within input_array (i.e., along the columns axis). The absoluet differences between any pair out of the aggregated values for each group. Pairs with minimal normalized absolute differences on similarity_function and maximal normalized absolute differences on divergence_function are scored as more optimal, while pairs with maximal normalized absolute differences on similarity_function and minimal normalized absolute differences on divergence_function are scored as least optimal.

  • divergence_function (Callable) – Each of these two functions is used to aggregate groups of observations within input_array (i.e., along the columns axis). The absoluet differences between any pair out of the aggregated values for each group. Pairs with minimal normalized absolute differences on similarity_function and maximal normalized absolute differences on divergence_function are scored as more optimal, while pairs with maximal normalized absolute differences on similarity_function and minimal normalized absolute differences on divergence_function are scored as least optimal.

  • similarity_weight (int, float, default=0.5) – Used to weight the similarity_function function in weighted average of aggragted diffrecnces. Must be between 0 and 1, inclusive. The weight of divergence_function will be 1 - maximize weight.

  • return_filter (Union[str, Int, Float, ], default 'first') –

    If string must be one of {‘first’, ‘top’, ‘bottom’, ‘last’, ‘all’}:
    • ’first’, returns the indices the optimal pair.

    • ’top’ return all pairs equivilent with the most optimal pair.

    • ’last’ returns the least optimal pair.

    • ’bottom’ returns all pairs eqivilent with the least optimal pair.

    • ’all’ returns all pairs sorted by optimality (descending).

    If int or float
    • Must be in the range of [0, 1] (inclusive). Returns all pairs

    up to the input value, including. i.e., 0 is equivilent to ‘top’, 1 is equivilent to ‘all’.

  • return_matrix (bool: optional, default False) – if return_matrix is True, returns the matrix of optimality scores, regardless of return_filter value. If False, follows the specification under return_filter.

Returns

  • If return_matrix is False (default), returns

    indicesnpt.NDArray
    A 2D array, column-axis size is always 2. Each row contains the row-indices of a pair from the original array.
    • If return_filter is ‘first’ or ‘last’, then indices is of length 1 as only a single pair is returened.

    • If return_filter is ‘all’, then indices is of length N(N-1)/2 as all pairs are returned.

    • If return_filter ‘top’, ‘bottom’ or numeric then shape is subject to the data.

    scoresnpt.NDArray

    A 1D array of the optimality scores corresponding to the indices found in indices.

  • If return_matrix is True, returns a 2d array of size [N(N-1)/2, N(N-1)/2],

  • containing the optimality scores (ranging from 0 to 1 , inclusive), between

  • each pair of row-indice pairs. The matrix diagonal is filled with NaNs.

Notes

Given a matrix of size N X T, for each set of {ri, rj} out of the N * (N - 1) / 2 unique pairs of rows, a set differences {dij1, dij2} is calculated by applying two aggregation functions {f1, f2} to r1 and r2 separately (i.e., dij1 = |f1(ri) - f1(rj)|, d2 = |f2(ri) - f2(rj)|).

Each of dijx in {dij1, dij2} is scaled between 0 and 1, relative to the set of absolute difference between pairs of rows, obtained using function fx.

f1 and f2 correspond to the arguments similarity_function and divergence_function, respectively. f1 rewards minimal differences between pairs and penalizes maximal differences. d2 is multiplied by -1 to penalize for minimal differences and reward maximal differences between pairs. w1 and w2 correspond to similarity_weight and 1 - similarity_weight, respectively.

The optimality score oij is calcualted as a weighted sum dij1 * w1 - dij2 * w2, then scaled again between 0 and 1, relative to the set of all other scores.

Examples

Find an optimal pair of sub-arrays which have the most similar standard deviation (relative to all other sub-arrays), and the most different mean (relative to all other sub-arrays).


>>> a = np.vstack([[0, 0.5], [0.5, 0.5], [5, 5], [4, 10]])
>>> indices, scores = agas.pair_from_array(a, similarity_function=np.std,
...    divergence_function=np.mean)
>>> indices, a[indices]
(array([1, 2], dtype=int64), array([[0.5, 0.5],
       [5. , 5. ]]))

The pair of rows [[0.5, 0.5], [5, 5]] out of the input statisfies the lowest absolute difference in standard deviation (i.e., 0) and the largest absolute difference in mean values (4.5). The score of the most optimal pair is 0.

>>> scores
array([0.])

Optimality scores are more useful when asking agas.pair_from_array for a subset of pairs. For example, if we want to get the optimality scores of all pairs, we can set return_filter argument (default, ‘first’) to ‘all’.

Printing the pairs of rows from a, sorted by optimality (0 is most optimal 1 is least optimal).

>>> indices, scores = agas.pair_from_array(a, similarity_function=np.std,
...         divergence_function=np.mean, return_filter='all')
>>> print(*list(zip(indices.tolist(), scores.round(2))), sep='\n')
([1, 2], 0.0)
([0, 2], 0.03)
([0, 3], 0.41)
([1, 3], 0.5)
([0, 1], 0.53)
([2, 3], 1.0)

The return_filter argument can also be specified using a float, selecting pairs which are up to a specific threshold (including).

>>> indices, scores = agas.pair_from_array(a, similarity_function=np.std,
...          divergence_function=np.mean, return_filter=0.5)
>>> print(*list(zip(indices.tolist(), scores.round(2))), sep='\n')
([1, 2], 0.0)
([0, 2], 0.03)
([0, 3], 0.41)
([1, 3], 0.5)

Control the weight of the function maximizing similarity in the calculation of optimality scores, using the similarity_weight keyword argument. Here we prioritize differences in means over lack of differences in variance, by decreasing similarity_weight from 0.5 (default) to 0.2. This returns a a differnet pair then before.

>>> agas.pair_from_array(a, similarity_function=np.std,
...         divergence_function=np.mean, similarity_weight=.2)
(array([0, 3], dtype=int64), array([0.]))