Skip to content

Conversation

@mgazza
Copy link
Collaborator

@mgazza mgazza commented Dec 8, 2025

Summary

This PR implements a comprehensive caching optimization for the Octopus Energy API integration, reducing API load by ~99.8% and adding JWT token persistence across pod restarts.

Problem Statement

Current inefficiency: Every PredBat instance (1000+ users) independently:

  1. Fetches the SAME Octopus Agile rates every 30 minutes
  2. Stores duplicate copies in per-user cache files
  3. Makes redundant API calls (500 users on AGILE-24-10-01 = 500 identical API calls)
  4. Loses Kraken auth tokens on pod restart (in-memory only)

Solution: 3-Part Caching Architecture

1. Cache Split Architecture (Commit 1)

Separated user-specific from shared data:

User-specific cache (`/tmp/cache/{user_id}/octopus_user.yaml`):

  • Account agreements (which tariffs apply when)
  • Saving session enrollments
  • Intelligent device settings
  • Kraken authentication token

Shared cache (`/tmp/cache/shared/`):

  • `tariffs/{product_code}_{tariff_code}.yaml` - Tariff rates (one file per tariff)
  • `urls/{sha256_hash}.yaml` - HTTP responses (one file per URL)

Benefits:

  • 99.9% storage reduction (10 tariff files vs 10,000 duplicate entries)
  • 99.8% API call reduction (1 call per tariff vs 500 calls)
  • Instant propagation (Pod A fetches, Pods B/C/D benefit immediately)

2. Stale-While-Revalidate Pattern (Commit 2)

Problem: Thundering herd when cache expires (all 1000 pods fetch simultaneously)

Solution: 3-tier cache strategy

  • Fresh (< 30 min): Return immediately
  • Stale (30-35 min): Serve stale data while ONE pod refreshes
  • Too stale (> 35 min): Must fetch

Benefits:

  • Only 1 pod fetches, 999 serve stale data
  • No race conditions (atomic file locking)

3. JWT Token Caching (Commit 3)

Problem: Kraken tokens were lost on pod restart

Solution: JWT-based token persistence with error-driven refresh

Benefits:

  • Token survives pod restarts
  • Accurate expiry tracking (JWT exp field)
  • Automatic recovery from expired tokens

Expected Impact

API Load Reduction:

  • Before: 500 users = 500 API calls every 30 min
  • After: 1 API call per tariff every 30 min
  • Reduction: 99.8%

Storage Efficiency:

  • Before: 10,000 duplicate entries
  • After: 10 unique tariff files
  • Reduction: 99.9%

Files Changed

  • `apps/predbat/octopus.py` - Core caching implementation

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

springfall2008 and others added 9 commits November 30, 2025 20:16
Added debug print statements to trace:
- When fetch_octopus_rates is called with entity_id
- Which entity is being queried for rates
- Data import results (type and length)
- Total accumulated rate entries
- Sample rate data structure

These are temporary development logs for troubleshooting tariff
data loading issues. Can be removed or made conditional on
debug_enable flag.

Also includes gecloud.py cleanup changes from previous work.
Removed 5 debug print statements that were added during Phase 2B
development to trace fetch_octopus_rates() execution. These were
temporary debugging aids used to verify that:

- fetch_octopus_rates was being called correctly
- Entity IDs were properly constructed
- Data was successfully fetched from Supabase
- Rate data had expected structure

Now that the Octopus NATS integration is working correctly, these
debug prints are no longer needed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implements cache refresh strategy that prevents all pods from simultaneously
fetching from Octopus API when cache expires.

## Problem
When cache expires at exactly the same time for all 1000 pods:
- 4:30:00.000 - Cache expires for ALL pods
- 4:30:00.001-500 - All 1000 pods fetch from Octopus API simultaneously
- Result: Thundering herd 💥 overwhelming Octopus API

## Solution: Stale-While-Revalidate

Implements a three-tier cache strategy:
1. **Fresh (< 30 min)**: Return cached data immediately
2. **Stale (30-35 min)**: Serve stale data while ONE pod refreshes
3. **Too stale (> 35 min)**: Must fetch fresh data

## How It Works

When cache is 30-35 minutes old:
- First pod to check: Acquires atomic file lock, refreshes cache
- Other pods: See lock exists, serve 5-min-stale data (acceptable for tariff rates)
- No blocking: All pods return immediately
- Eventually consistent: Fresh data available within seconds

## Lock Implementation

Uses atomic file creation with O_CREAT | O_EXCL flags:
- Non-blocking: Failed acquisition means another pod is refreshing
- Automatic cleanup: Lock file removed after refresh
- No deadlock risk: Lock holder always completes and removes lock

## Expected Impact

Before:
- 1000 pods × cache expiry = 1000 simultaneous API calls
- Octopus API rate limiting and potential failures

After:
- 1 pod fetches, 999 pods serve stale data
- 99.9% reduction in API calls during cache expiry
- 5-minute staleness is acceptable for tariff optimization

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add decode_kraken_token_expiry() to extract expiry from JWT payload
- Update async_refresh_token() to use JWT expiry instead of hardcoded 1-hour
- Save/load Kraken token in per-user cache (octopus_user.yaml)
- Add error-driven token refresh on auth errors (KT-CT-1139, KT-CT-1111, KT-CT-1143)
- Auto-retry GraphQL queries once on authentication failure

Benefits:
- Token survives pod restarts (loaded from cache)
- Accurate expiry tracking (directly from JWT exp field)
- Automatic recovery from expired tokens
"args": {
"ge_cloud_direct": {
"required_true": True,
"required": True,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should still be required_true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agreed, ignore these, they need reverting!

"name": "GivEnergy Cloud Data",
"args": {
"ge_cloud_data": {
"required_true": True,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agreed, ignore these :D

@springfall2008
Copy link
Owner

The octopus change was re-implemented on main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants