agas.pair_from_wide_df#

agas.pair_from_wide_df(df: DataFrame, similarity_function: Callable, divergence_function: Callable, similarity_weight: Union[float, int] = 0.5, return_filter: Union[str, float, int] = 'first', values_columns: Optional[Union[Tuple, List, ndarray]] = None, return_matrix: bool = False)#

Calculate the optimality score of each unique pair of rows in a wide-format dataframe.

Optimality is such that when that the difference between two rows is minimal when each is aggregated using the similarity_function and maximal when each is aggregated using divergence_function.

For an explenation on the optimization process See the Notes section in the docstring of pair_from_array.

Parameters
  • df (pd.DataFrame) – A wide (unstacked, pivoted) dataframe, where scores are stored in columns and unique units are stored in rows.

  • similarity_function (Callable) – Each of these two functions is used to aggregate groups of observations within input_array (i.e., along the columns axis). The absoluet differences between any pair out of the aggregated values for each group . Pairs with minimal normalized absolute differences on similarity_function and maximal normalized absolute differences on divergence_function are scored as more optimal, while pairs with maximal normalized absolute differences on similarity_function and minimal normalized absolute differences on divergence_function are scored as least optimal.

  • divergence_function (Callable) – Each of these two functions is used to aggregate groups of observations within input_array (i.e., along the columns axis). The absoluet differences between any pair out of the aggregated values for each group . Pairs with minimal normalized absolute differences on similarity_function and maximal normalized absolute differences on divergence_function are scored as more optimal, while pairs with maximal normalized absolute differences on similarity_function and minimal normalized absolute differences on divergence_function are scored as least optimal.

  • similarity_weight (int, float, default=0.5) – Used to weight the similarity_function function in weighted average of aggragted diffrecnces. Must be between 0 and 1, inclusive. The weight of divergence_function will be 1 - maximize weight.

  • return_filter (Union[str, int, float], default 'first') –

    If string must be one of {‘first’, ‘top’, ‘bottom’, ‘last’, ‘all’}:
    • ’first’, returns the indices the optimal pair.

    • ’top’ return all pairs equivilent with the most optimal pair.

    • ’last’ returns the least optimal pair.

    • ’bottom’ returns all pairs eqivilent with the least optimal pair.

    • ’all’ returns all pairs sorted by optimality (descending).

    If int or float
    • Must be in the range of [0, 1] (inclusive). Returns all pairs up to the input value, including. i.e., 0 is equivilent to ‘top’, 1 is equivilent to ‘all’.

  • return_matrix (bool: optional, default False) – if return_matrix is True, returns the matrix of optimality scores, regardless of return_filter value. If False, follows the specification under return_filter.

  • values_columns (array-like, Default None) – List, Tuple or Array of the column names of the scores to aggregate. If None, assumes all columns should be aggregated.

Returns

  • If return_matrix is False (default)

    • If return_filter is ‘indices’, returns the indices of the

    optimal pair of rows out of df (e.g., df.iloc[optimal, :]). - If return_filter is ‘scores’ returns a dataframe composed of the optimal pair of rows out of df.

  • If return_matrix is True, returns a 2d array of size [N(N-1)/2, N(N-1)/2],

  • containing the optimality scores (ranging from 0 to 1 , inclusive), between

  • each pair of row-indice pairs. The matrix diagonal is filled with NaNs.

Notes

Currently Agas doesn’t allow usage of string function names for aggregation, unlike what can be done using pandas.

Examples


Setting up a small dataset of angle readings from fictitious sensors, collected in 3-hour intervals.

>>> data = np.array([(0, 2, 1), (10, 11, 100), (120, 150, 179)])
>>> df = pd.DataFrame(data, columns=['3PM', '6PM', '9PM'],
...             index=['Yaw', 'Pitch', 'Roll'])
>>> df.agg([np.std, 'sum'], axis=1)
         std    sum
Yaw     1.00    3.0
Pitch  51.68  121.0
Roll   29.50  449.0

Yaw and Roll display the highest normalized similarity in mean value, and the lowest normalized similarity in sum value.

>>> indices, scores = agas.pair_from_wide_df(df, np.std, np.sum)
>>> df.iloc[indices, :]
      3PM  6PM  9PM
Yaw     0    2    1
Roll  120  150  179

Giving standard deviation a heavier weight, leads to Pitch and Roll selected as the optimal value.

>>> indices, scores = agas.pair_from_wide_df(df, np.std, np.sum, 0.8)
>>> df.iloc[indices, :]
       3PM  6PM  9PM
Pitch   10   11  100
Roll   120  150  179

Prioritizing small differences between sums, and large differences in variance:

>>> indices, scores = agas.pair_from_wide_df(df, np.sum, np.std)
>>> df.iloc[indices, :]
       3PM  6PM  9PM
Yaw      0    2    1
Pitch   10   11  100