-
Notifications
You must be signed in to change notification settings - Fork 916
perf: optimize conflict resolution algorithm in AnonymizerEngine #1797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Replace O(n) list.remove() with O(1) set operations for tracking active indices - Use pop() instead of remove() in whitespace merge since prev_result is always last - Convert list comprehension to generator expression for short-circuit evaluation - Pre-compile regex pattern to avoid repeated compilation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aims to optimize the conflict resolution algorithm in AnonymizerEngine by replacing O(n) list operations with O(1) set operations and making other performance improvements.
Changes:
- Replace list.remove() with set-based active_indices tracking in merge and conflict resolution phases
- Add outer loop with re-sorting to REMOVE_INTERSECTIONS algorithm
- Optimize whitespace merging with pre-compiled regex and pop() instead of remove()
- Convert list comprehension to generator expression for short-circuit evaluation
| # Process until no more overlaps require re-sorting | ||
| needs_resort = True | ||
| while needs_resort: | ||
| needs_resort = False | ||
| index = 0 | ||
| elements_length = len(unique_text_metadata_elements) | ||
| while index < elements_length - 1: | ||
| current_entity = unique_text_metadata_elements[index] | ||
| next_entity = unique_text_metadata_elements[index + 1] | ||
| if current_entity.end <= next_entity.start: | ||
| index += 1 | ||
| else: | ||
| current_entity.end = next_entity.start | ||
| if current_entity.score >= next_entity.score: | ||
| next_entity.start = current_entity.end | ||
| else: | ||
| current_entity.end = next_entity.start | ||
| needs_resort = True | ||
| break | ||
| if needs_resort: | ||
| unique_text_metadata_elements.sort( | ||
| key=lambda element: element.start | ||
| ) |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of the outer loop with needs_resort flag significantly changes the algorithm behavior and reduces performance. The original algorithm correctly handles overlaps by staying at the same index after adjusting boundaries, then re-checking the pair in the next iteration. This is efficient - only the affected pair is re-checked.
The new algorithm breaks after each adjustment, re-sorts the entire list O(n log n), and restarts from index 0, re-checking all pairs. This is less efficient because:
- Multiple sorts instead of one initial sort
- Re-checking previously resolved pairs
- The original algorithm already correctly resolves all overlaps without re-sorting
This change contradicts the PR's stated goal of performance optimization. The original single-pass algorithm with local re-checking should be preserved.
| @staticmethod | ||
| def __is_result_conflicted_with_other_elements(other_elements, result): | ||
| # Use generator expression for short-circuit evaluation | ||
| return any( | ||
| [result.has_conflict(other_element) for other_element in other_elements] | ||
| result.has_conflict(other_element) for other_element in other_elements | ||
| ) |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method __is_result_conflicted_with_other_elements is no longer called anywhere in the code after the refactoring. The functionality has been inlined at line 187. This method should be removed to avoid maintaining dead code.
| return [] | ||
|
|
||
| # Pre-compile regex pattern for efficiency | ||
| whitespace_pattern = re.compile(r"^( )+$") |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern r"^( )+$" contains unnecessary parentheses around the space character. This can be simplified to r"^ +$" for better readability without changing functionality.
| whitespace_pattern = re.compile(r"^( )+$") | |
| whitespace_pattern = re.compile(r"^ +$") |
| return [] | ||
|
|
||
| # Pre-compile regex pattern for efficiency | ||
| whitespace_pattern = re.compile(r"^( )+$") |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern is compiled inside the method, which means it gets recompiled on every call to _merge_entities_with_whitespace_between. For better performance, consider compiling the pattern once at the class or module level (e.g., as a class constant: WHITESPACE_PATTERN = re.compile(r"^ +$")).
Change Description
Issue reference
Fixes #XX
Checklist