Skip to content

Add a threshold to the CorrelationSimilarity metric #816

@npatki

Description

@npatki

Problem Description

In the context of evaluating synthetic data, we care more about whether the synthetic data has preserved trends that were strongly present in the real data. If the real data didn't have a strong trend to begin with, then it's fairly typical for synthetic data to also not have a trend; synthesizer's don't usually invent correlations when there is none to begin with. So this case is uninteresting.

Right now, if the real data doesn't have any strong trends, then the CorrelationSimilarity metric will typically report a very high score -- as the synthetic data would also not have such a strong trend. Instead of this, it would be better if the metric itself could be set up with a threshold. If that threshold is not met for the real data (aka there is no strong trend to begin with), then the metric would return a NaN instead.

Expected behavior

In CorrelationSimilarity, add a parameter called real_correlation_threshold. The metric would then work as follows:

  1. It would compute the correlation on the real data
  2. If the absolute value of the real data's correlation exceeds the threshold, then the correlation is considered "strong" and the rest of the metric computation continues.
  3. Otherwise, the metric score would be a NaN instead. There is no need to compute the correlation on the synthetic data.

The default value of this parameter can be 0 (meaning that the behavior is the same as status quo), however it will be easy for the user (or a report) to set a new value when running it.

from sdmetrics.column_pairs import CorrelationSimilarity

CorrelationSimilarity.compute(
    real_data=real_table[['column_1', 'column_2']],
    synthetic_data=synthetic_table[['column_1', 'column_2']],
    coefficient='Pearson',
    real_correlation_threshold=0.5
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions