perf: optimize conflict resolution algorithm in AnonymizerEngine #1797

Ray0907 · 2025-11-29T11:54:12Z

Change Description

Replace O(n) list.remove() with O(1) set operations for tracking active indices
Use pop() instead of remove() in whitespace merge since prev_result is always last
Convert list comprehension to generator expression for short-circuit evaluation
Pre-compile regex pattern to avoid repeated compilation

Issue reference

Fixes #XX

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

- Replace O(n) list.remove() with O(1) set operations for tracking active indices - Use pop() instead of remove() in whitespace merge since prev_result is always last - Convert list comprehension to generator expression for short-circuit evaluation - Pre-compile regex pattern to avoid repeated compilation

Ray0907 · 2025-11-29T11:55:47Z

@microsoft-github-policy-service agree

Copilot

Pull request overview

This PR aims to optimize the conflict resolution algorithm in AnonymizerEngine by replacing O(n) list operations with O(1) set operations and making other performance improvements.

Changes:

Replace list.remove() with set-based active_indices tracking in merge and conflict resolution phases
Add outer loop with re-sorting to REMOVE_INTERSECTIONS algorithm
Optimize whitespace merging with pre-compiled regex and pop() instead of remove()
Convert list comprehension to generator expression for short-circuit evaluation

Copilot · 2026-01-12T10:11:17Z

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py

+            # Process until no more overlaps require re-sorting
+            needs_resort = True
+            while needs_resort:
+                needs_resort = False
+                index = 0
+                elements_length = len(unique_text_metadata_elements)
+                while index < elements_length - 1:
+                    current_entity = unique_text_metadata_elements[index]
+                    next_entity = unique_text_metadata_elements[index + 1]
+                    if current_entity.end <= next_entity.start:
+                        index += 1
                    else:
-                        current_entity.end = next_entity.start
+                        if current_entity.score >= next_entity.score:
+                            next_entity.start = current_entity.end
+                        else:
+                            current_entity.end = next_entity.start
+                        needs_resort = True
+                        break
+                if needs_resort:
                    unique_text_metadata_elements.sort(
                        key=lambda element: element.start
                    )


The addition of the outer loop with needs_resort flag significantly changes the algorithm behavior and reduces performance. The original algorithm correctly handles overlaps by staying at the same index after adjusting boundaries, then re-checking the pair in the next iteration. This is efficient - only the affected pair is re-checked.

The new algorithm breaks after each adjustment, re-sorts the entire list O(n log n), and restarts from index 0, re-checking all pairs. This is less efficient because:

Multiple sorts instead of one initial sort

Re-checking previously resolved pairs

The original algorithm already correctly resolves all overlaps without re-sorting

This change contradicts the PR's stated goal of performance optimization. The original single-pass algorithm with local re-checking should be preserved.

Copilot · 2026-01-12T10:11:17Z

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py

    @staticmethod
    def __is_result_conflicted_with_other_elements(other_elements, result):
+        # Use generator expression for short-circuit evaluation
        return any(
-            [result.has_conflict(other_element) for other_element in other_elements]
+            result.has_conflict(other_element) for other_element in other_elements
        )


The method __is_result_conflicted_with_other_elements is no longer called anywhere in the code after the refactoring. The functionality has been inlined at line 187. This method should be removed to avoid maintaining dead code.

Copilot · 2026-01-12T10:11:17Z

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py

+            return []
+
+        # Pre-compile regex pattern for efficiency
+        whitespace_pattern = re.compile(r"^( )+$")


The regex pattern r"^( )+$" contains unnecessary parentheses around the space character. This can be simplified to r"^ +$" for better readability without changing functionality.

Suggested change

whitespace_pattern = re.compile(r"^( )+$")

whitespace_pattern = re.compile(r"^ +$")

Copilot · 2026-01-12T10:11:18Z

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py

+            return []
+
+        # Pre-compile regex pattern for efficiency
+        whitespace_pattern = re.compile(r"^( )+$")


The regex pattern is compiled inside the method, which means it gets recompiled on every call to _merge_entities_with_whitespace_between. For better performance, consider compiling the pattern once at the class or module level (e.g., as a class constant: WHITESPACE_PATTERN = re.compile(r"^ +$")).

github-actions bot added the external label Nov 29, 2025

Merge branch 'main' into main

0b80309

RonShakutai requested a review from Copilot January 12, 2026 10:04

Copilot started reviewing on behalf of RonShakutai January 12, 2026 10:05 View session

RonShakutai self-requested a review January 12, 2026 10:06

Copilot AI reviewed Jan 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize conflict resolution algorithm in AnonymizerEngine #1797

perf: optimize conflict resolution algorithm in AnonymizerEngine #1797

Uh oh!

Ray0907 commented Nov 29, 2025

Uh oh!

Ray0907 commented Nov 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	whitespace_pattern = re.compile(r"^( )+$")
	whitespace_pattern = re.compile(r"^ +$")

perf: optimize conflict resolution algorithm in AnonymizerEngine #1797

Are you sure you want to change the base?

perf: optimize conflict resolution algorithm in AnonymizerEngine #1797

Uh oh!

Conversation

Ray0907 commented Nov 29, 2025

Change Description

Issue reference

Checklist

Uh oh!

Ray0907 commented Nov 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants