agas.pair_from_wide_df#

agas.pair_from_wide_df(df: DataFrame, similarity_function: Callable, divergence_function: Callable, similarity_weight: Union[float, int] = 0.5, return_filter: Union[str, float, int] = 'first', values_columns: Optional[Union[Tuple, List, ndarray]] = None, return_matrix: bool = False)#

Calculate the optimality score of each unique pair of rows in a wide-format dataframe.

Optimality is such that when that the difference between two rows is minimal when each is aggregated using the similarity_function and maximal when each is aggregated using divergence_function.

For an explenation on the optimization process See the Notes section in the docstring of pair_from_array.

Parameters

df (pd.DataFrame) – A wide (unstacked, pivoted) dataframe, where scores are stored in columns and unique units are stored in rows.
similarity_function (Callable) – Each of these two functions is used to aggregate groups of observations within input_array (i.e., along the columns axis). The absoluet differences between any pair out of the aggregated values for each group . Pairs with minimal normalized absolute differences on similarity_function and maximal normalized absolute differences on divergence_function are scored as more optimal, while pairs with maximal normalized absolute differences on similarity_function and minimal normalized absolute differences on divergence_function are scored as least optimal.
divergence_function (Callable) – Each of these two functions is used to aggregate groups of observations within input_array (i.e., along the columns axis). The absoluet differences between any pair out of the aggregated values for each group . Pairs with minimal normalized absolute differences on similarity_function and maximal normalized absolute differences on divergence_function are scored as more optimal, while pairs with maximal normalized absolute differences on similarity_function and minimal normalized absolute differences on divergence_function are scored as least optimal.
similarity_weight (int, float, default=0.5) – Used to weight the similarity_function function in weighted average of aggragted diffrecnces. Must be between 0 and 1, inclusive. The weight of divergence_function will be 1 - maximize weight.
return_filter (Union[str, int, float], default 'first') –
If string must be one of {‘first’, ‘top’, ‘bottom’, ‘last’, ‘all’}:
- ’first’, returns the indices the optimal pair.
- ’top’ return all pairs equivilent with the most optimal pair.
- ’last’ returns the least optimal pair.
- ’bottom’ returns all pairs eqivilent with the least optimal pair.
- ’all’ returns all pairs sorted by optimality (descending).
If int or float
- Must be in the range of [0, 1] (inclusive). Returns all pairs up to the input value, including. i.e., 0 is equivilent to ‘top’, 1 is equivilent to ‘all’.
return_matrix (bool: optional, default False) – if return_matrix is True, returns the matrix of optimality scores, regardless of return_filter value. If False, follows the specification under return_filter.
values_columns (array-like, Default None) – List, Tuple or Array of the column names of the scores to aggregate. If None, assumes all columns should be aggregated.

Returns

If return_matrix is False (default) –
- If return_filter is ‘indices’, returns the indices of the
optimal pair of rows out of df (e.g., df.iloc[optimal, :]). - If return_filter is ‘scores’ returns a dataframe composed of the optimal pair of rows out of df.
If return_matrix is True, returns a 2d array of size [N(N-1)/2, N(N-1)/2],
containing the optimality scores (ranging from 0 to 1 , inclusive), between
each pair of row-indice pairs. The matrix diagonal is filled with NaNs.

See also

pair_from_array()

Notes

Currently Agas doesn’t allow usage of string function names for aggregation, unlike what can be done using pandas.

Examples

Setting up a small dataset of angle readings from fictitious sensors, collected in 3-hour intervals.

>>> data = np.array([(0, 2, 1), (10, 11, 100), (120, 150, 179)])
>>> df = pd.DataFrame(data, columns=['3PM', '6PM', '9PM'],
...             index=['Yaw', 'Pitch', 'Roll'])
>>> df.agg([np.std, 'sum'], axis=1)
         std    sum
Yaw     1.00    3.0
Pitch  51.68  121.0
Roll   29.50  449.0

Yaw and Roll display the highest normalized similarity in mean value, and the lowest normalized similarity in sum value.

>>> indices, scores = agas.pair_from_wide_df(df, np.std, np.sum)
>>> df.iloc[indices, :]
      3PM  6PM  9PM
Yaw     0    2    1
Roll  120  150  179

Giving standard deviation a heavier weight, leads to Pitch and Roll selected as the optimal value.

>>> indices, scores = agas.pair_from_wide_df(df, np.std, np.sum, 0.8)
>>> df.iloc[indices, :]
       3PM  6PM  9PM
Pitch   10   11  100
Roll   120  150  179

Prioritizing small differences between sums, and large differences in variance:

>>> indices, scores = agas.pair_from_wide_df(df, np.sum, np.std)
>>> df.iloc[indices, :]
       3PM  6PM  9PM
Yaw      0    2    1
Pitch   10   11  100