Skip to content

Conversation

@Ray0907
Copy link

@Ray0907 Ray0907 commented Nov 29, 2025

Change Description

  • Replace O(n) list.remove() with O(1) set operations for tracking active indices
  • Use pop() instead of remove() in whitespace merge since prev_result is always last
  • Convert list comprehension to generator expression for short-circuit evaluation
  • Pre-compile regex pattern to avoid repeated compilation

Issue reference

Fixes #XX

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

  - Replace O(n) list.remove() with O(1) set operations for tracking active
  indices
  - Use pop() instead of remove() in whitespace merge since prev_result is
  always last
  - Convert list comprehension to generator expression for short-circuit
  evaluation
  - Pre-compile regex pattern to avoid repeated compilation
@Ray0907
Copy link
Author

Ray0907 commented Nov 29, 2025

@microsoft-github-policy-service agree

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to optimize the conflict resolution algorithm in AnonymizerEngine by replacing O(n) list operations with O(1) set operations and making other performance improvements.

Changes:

  • Replace list.remove() with set-based active_indices tracking in merge and conflict resolution phases
  • Add outer loop with re-sorting to REMOVE_INTERSECTIONS algorithm
  • Optimize whitespace merging with pre-compiled regex and pop() instead of remove()
  • Convert list comprehension to generator expression for short-circuit evaluation

Comment on lines +204 to 225
# Process until no more overlaps require re-sorting
needs_resort = True
while needs_resort:
needs_resort = False
index = 0
elements_length = len(unique_text_metadata_elements)
while index < elements_length - 1:
current_entity = unique_text_metadata_elements[index]
next_entity = unique_text_metadata_elements[index + 1]
if current_entity.end <= next_entity.start:
index += 1
else:
current_entity.end = next_entity.start
if current_entity.score >= next_entity.score:
next_entity.start = current_entity.end
else:
current_entity.end = next_entity.start
needs_resort = True
break
if needs_resort:
unique_text_metadata_elements.sort(
key=lambda element: element.start
)
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of the outer loop with needs_resort flag significantly changes the algorithm behavior and reduces performance. The original algorithm correctly handles overlaps by staying at the same index after adjusting boundaries, then re-checking the pair in the next iteration. This is efficient - only the affected pair is re-checked.

The new algorithm breaks after each adjustment, re-sorts the entire list O(n log n), and restarts from index 0, re-checking all pairs. This is less efficient because:

  1. Multiple sorts instead of one initial sort
  2. Re-checking previously resolved pairs
  3. The original algorithm already correctly resolves all overlaps without re-sorting

This change contradicts the PR's stated goal of performance optimization. The original single-pass algorithm with local re-checking should be preserved.

Copilot uses AI. Check for mistakes.
Comment on lines 262 to 267
@staticmethod
def __is_result_conflicted_with_other_elements(other_elements, result):
# Use generator expression for short-circuit evaluation
return any(
[result.has_conflict(other_element) for other_element in other_elements]
result.has_conflict(other_element) for other_element in other_elements
)
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method __is_result_conflicted_with_other_elements is no longer called anywhere in the code after the refactoring. The functionality has been inlined at line 187. This method should be removed to avoid maintaining dead code.

Copilot uses AI. Check for mistakes.
return []

# Pre-compile regex pattern for efficiency
whitespace_pattern = re.compile(r"^( )+$")
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern r"^( )+$" contains unnecessary parentheses around the space character. This can be simplified to r"^ +$" for better readability without changing functionality.

Suggested change
whitespace_pattern = re.compile(r"^( )+$")
whitespace_pattern = re.compile(r"^ +$")

Copilot uses AI. Check for mistakes.
return []

# Pre-compile regex pattern for efficiency
whitespace_pattern = re.compile(r"^( )+$")
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern is compiled inside the method, which means it gets recompiled on every call to _merge_entities_with_whitespace_between. For better performance, consider compiling the pattern once at the class or module level (e.g., as a class constant: WHITESPACE_PATTERN = re.compile(r"^ +$")).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants