agas.pair_from_wide_df#
- agas.pair_from_wide_df(df: DataFrame, similarity_function: Callable, divergence_function: Callable, similarity_weight: Union[float, int] = 0.5, return_filter: Union[str, float, int] = 'first', values_columns: Optional[Union[Tuple, List, ndarray]] = None, return_matrix: bool = False)#
Calculate the optimality score of each unique pair of rows in a wide-format dataframe.
Optimality is such that when that the difference between two rows is minimal when each is aggregated using the similarity_function and maximal when each is aggregated using divergence_function.
For an explenation on the optimization process See the Notes section in the docstring of
pair_from_array
.- Parameters
df (pd.DataFrame) – A wide (unstacked, pivoted) dataframe, where scores are stored in columns and unique units are stored in rows.
similarity_function (Callable) – Each of these two functions is used to aggregate groups of observations within input_array (i.e., along the columns axis). The absoluet differences between any pair out of the aggregated values for each group . Pairs with minimal normalized absolute differences on similarity_function and maximal normalized absolute differences on divergence_function are scored as more optimal, while pairs with maximal normalized absolute differences on similarity_function and minimal normalized absolute differences on divergence_function are scored as least optimal.
divergence_function (Callable) – Each of these two functions is used to aggregate groups of observations within input_array (i.e., along the columns axis). The absoluet differences between any pair out of the aggregated values for each group . Pairs with minimal normalized absolute differences on similarity_function and maximal normalized absolute differences on divergence_function are scored as more optimal, while pairs with maximal normalized absolute differences on similarity_function and minimal normalized absolute differences on divergence_function are scored as least optimal.
similarity_weight (int, float, default=0.5) – Used to weight the similarity_function function in weighted average of aggragted diffrecnces. Must be between 0 and 1, inclusive. The weight of divergence_function will be 1 - maximize weight.
return_filter (Union[str, int, float], default 'first') –
- If string must be one of {‘first’, ‘top’, ‘bottom’, ‘last’, ‘all’}:
’first’, returns the indices the optimal pair.
’top’ return all pairs equivilent with the most optimal pair.
’last’ returns the least optimal pair.
’bottom’ returns all pairs eqivilent with the least optimal pair.
’all’ returns all pairs sorted by optimality (descending).
- If int or float
Must be in the range of [0, 1] (inclusive). Returns all pairs up to the input value, including. i.e., 0 is equivilent to ‘top’, 1 is equivilent to ‘all’.
return_matrix (bool: optional, default False) – if return_matrix is True, returns the matrix of optimality scores, regardless of return_filter value. If False, follows the specification under return_filter.
values_columns (array-like, Default None) – List, Tuple or Array of the column names of the scores to aggregate. If None, assumes all columns should be aggregated.
- Returns
If return_matrix is False (default) –
If return_filter is ‘indices’, returns the indices of the
optimal pair of rows out of df (e.g., df.iloc[optimal, :]). - If return_filter is ‘scores’ returns a dataframe composed of the optimal pair of rows out of df.
If return_matrix is True, returns a 2d array of size [N(N-1)/2, N(N-1)/2],
containing the optimality scores (ranging from 0 to 1 , inclusive), between
each pair of row-indice pairs. The matrix diagonal is filled with NaNs.
See also
Notes
Currently Agas doesn’t allow usage of string function names for aggregation, unlike what can be done using pandas.
Examples
Setting up a small dataset of angle readings from fictitious sensors, collected in 3-hour intervals.
>>> data = np.array([(0, 2, 1), (10, 11, 100), (120, 150, 179)]) >>> df = pd.DataFrame(data, columns=['3PM', '6PM', '9PM'], ... index=['Yaw', 'Pitch', 'Roll']) >>> df.agg([np.std, 'sum'], axis=1) std sum Yaw 1.00 3.0 Pitch 51.68 121.0 Roll 29.50 449.0
Yaw and Roll display the highest normalized similarity in mean value, and the lowest normalized similarity in sum value.
>>> indices, scores = agas.pair_from_wide_df(df, np.std, np.sum) >>> df.iloc[indices, :] 3PM 6PM 9PM Yaw 0 2 1 Roll 120 150 179
Giving standard deviation a heavier weight, leads to Pitch and Roll selected as the optimal value.
>>> indices, scores = agas.pair_from_wide_df(df, np.std, np.sum, 0.8) >>> df.iloc[indices, :] 3PM 6PM 9PM Pitch 10 11 100 Roll 120 150 179
Prioritizing small differences between sums, and large differences in variance:
>>> indices, scores = agas.pair_from_wide_df(df, np.sum, np.std) >>> df.iloc[indices, :] 3PM 6PM 9PM Yaw 0 2 1 Pitch 10 11 100