-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Problem Description
In the context of evaluating synthetic data, we care more about whether the synthetic data has preserved trends that were strongly present in the real data. If the real data didn't have a strong trend to begin with, then it's fairly typical for synthetic data to also not have a trend; synthesizer's don't usually invent correlations when there is none to begin with. So this case is uninteresting.
Right now, if the real data doesn't have any strong trends, then the CorrelationSimilarity metric will typically report a very high score -- as the synthetic data would also not have such a strong trend. Instead of this, it would be better if the metric itself could be set up with a threshold. If that threshold is not met for the real data (aka there is no strong trend to begin with), then the metric would return a NaN instead.
Expected behavior
In CorrelationSimilarity, add a parameter called real_correlation_threshold. The metric would then work as follows:
- It would compute the correlation on the real data
- If the absolute value of the real data's correlation exceeds the threshold, then the correlation is considered "strong" and the rest of the metric computation continues.
- Otherwise, the metric score would be a
NaNinstead. There is no need to compute the correlation on the synthetic data.
The default value of this parameter can be 0 (meaning that the behavior is the same as status quo), however it will be easy for the user (or a report) to set a new value when running it.
from sdmetrics.column_pairs import CorrelationSimilarity
CorrelationSimilarity.compute(
real_data=real_table[['column_1', 'column_2']],
synthetic_data=synthetic_table[['column_1', 'column_2']],
coefficient='Pearson',
real_correlation_threshold=0.5
)