fix-upload-task-pod-failure-resilience - Branch - codecov/umbrella - CodSpeed

Blog Docs Changelog

CCMRG-1863 Adjust timing and add metrics around task retries

Comparing

fix-upload-task-pod-failure-resilience

(

6943a5a

) with

main

(

5ba2af3

)

0%

Untouched: 9

No successful run was found on main ( 21a85a5) during the generation of this report, so 5ba2af3 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Benchmarks

Passed

test_process_totals

libs/shared/tests/benchmarks/test_report.py

0%

593.1 ms591.5 ms

test_report_merge

libs/shared/tests/benchmarks/test_report.py

0%

2.8 s2.8 s

test_parse_full

libs/shared/tests/benchmarks/test_report.py

0%

487.6 ms487.1 ms

test_report_filtering

libs/shared/tests/benchmarks/test_report.py

0%

2.3 s2.3 s

test_report_diff_calculation[Report]

libs/shared/tests/benchmarks/test_report.py

0%

1.9 ms1.9 ms

test_report_diff_calculation[FilteredReport]

libs/shared/tests/benchmarks/test_report.py

0%

2.9 ms2.9 ms

test_parse_shallow

libs/shared/tests/benchmarks/test_report.py

-1%

9.5 ms9.6 ms

test_report_serialize

libs/shared/tests/benchmarks/test_report.py

-1%

5.7 ms5.8 ms

test_report_carryforward

libs/shared/tests/benchmarks/test_report.py

-2%

5.3 ms5.4 ms

Commits

Click on a commit to change the comparison range

Base

main

5ba2af3

-0.5%

Fix upload task resilience to pod failures **Problem:** Upload task 15fa00a1-b967-4ab2-9199-543d7e41e89d experienced a 6-hour delay when a pod failure occurred. The retry was received by a worker pod that died before processing, causing the task to be lost until visibility timeout expired. **Root Causes:** 1. The `retries == 0` condition only checked for processing on first attempt 2. 6-hour visibility timeout meant lost tasks took 6 hours to recover 3. Tasks weren't automatically returned to queue when workers died 4. No metrics to monitor stuck or repeatedly retried uploads **Changes:** 1. **Removed `retries == 0` condition** (apps/worker/tasks/upload.py) - Now EVERY retry checks if upload is currently processing - Added max retry limit of 10 to prevent infinite loops - Logs include retry count for debugging 2. **Reduced visibility timeout** (libs/shared/shared/celery_config.py) - Changed from 6 hours to 15 minutes - Lost tasks are now retried in 15 minutes instead of 6 hours 3. **Added task_reject_on_worker_lost** (libs/shared/shared/celery_config.py) - When pods die, tasks are immediately returned to queue - Another worker picks up the task within seconds 4. **Added monitoring metrics** (apps/worker/tasks/upload.py) - UPLOAD_TASK_PROCESSING_RETRY_COUNTER: tracks retry patterns - UPLOAD_TASK_TOO_MANY_RETRIES_COUNTER: tracks abandoned uploads **Protection Against Double Processing:** Multiple layers prevent race conditions: - upload_processing_lock check detects if processor is running - upload_lock (Redis lock) serializes Upload task execution - has_pending_jobs() recheck inside lock verifies jobs still exist - Atomic lpop ensures same job isn't processed twice **Impact:** - Pod failures now recover in seconds/minutes instead of 6 hours - Tasks survive worker crashes without data loss - Better observability via metrics - Fully backward compatible See UPLOAD_TASK_RESILIENCE_CHANGES.md for complete analysis and testing plan.

b2bee75

1 day ago

by drazisil-codecov

+0.21%

Add Dead Letter Queue to prevent data loss **Issue:** AI agent identified that uploads hitting 10-retry limit would be silently dropped without recovery mechanism, potentially causing data loss. **Root Cause:** When task gave up after 10 retries, uploads remained in Redis but were never processed. They would either: - Wait for another upload task (unlikely if processing lock stuck) - Expire after 24 hours (data loss) **Solution:** Implemented Dead Letter Queue (DLQ) 1. **Atomic move to DLQ**: When hitting retry limit, lpop all pending uploads and push to DLQ key: upload_dlq/{repoid}/{commitid}/{report_type} 2. **7-day retention**: DLQ entries expire after 7 days, giving team time to inspect and recover 3. **Monitoring**: Added UPLOAD_TASK_DLQ_COUNTER metric to track entries 4. **Logging**: Error logs include DLQ key and upload count for debugging **Benefits:** - Prevents silent data loss - Enables manual recovery/inspection - Visibility via metrics and alerts - Future: Can add automated recovery script **Testing:** - No linter errors - Backward compatible (DLQ is additive) - Metric increments properly - Redis TTL prevents DLQ buildup This addresses the valid concern raised by the AI agent and ensures no uploads are lost, even in worst-case scenarios.

635459a

1 day ago

by drazisil-codecov

-0.25%

Create settings.json

b7a08ea

23 hours ago

by drazisil-codecov

+0.05%

fix: update tests for corrected retry behavior and new visibility timeout 1. Update test_run_impl_currently_processing_second_retry: - Test now verifies processing check runs on ANY retry (not just first) - Task should raise Retry exception when processing is ongoing - This properly tests the fix for the retries==0 bug 2. Update test_celery_config: - Update expected visibility_timeout: 21600 (6h) → 900 (15m) - Add comment explaining the change for pod failure recovery Both changes align with the resilience improvements in the main PR.

d62d0a0

23 hours ago

by drazisil-codecov

+0.08%

test: add coverage for TOO_MANY_RETRIES checkpoint logging Add test_run_impl_too_many_retries_logs_checkpoint to verify that UploadFlow.TOO_MANY_RETRIES checkpoint is logged when a task hits the max retry limit (10 retries) while processing is ongoing. This addresses missing test coverage identified by code review for the new retry limit behavior added in the resilience improvements. The test verifies: - maybe_log_upload_checkpoint is called with TOO_MANY_RETRIES - Task returns with reason 'too_many_processing_retries' - Task doesn't proceed with setup when max retries exceeded

69f89cf

23 hours ago

by drazisil-codecov

-0.07%

refactor: remove outdated comment in test_celery_config.py

8cc2510

22 hours ago

by drazisil-codecov

+0.07%

test: add comprehensive coverage for safe_retry and retry metrics Add tests for BaseCodecovTask improvements: - safe_retry succeeds below max retries - safe_retry fails at max retries with proper metrics - safe_retry uses exponential backoff by default - safe_retry handles MaxRetriesExceededError gracefully - on_retry tracks retry count in metrics These tests ensure full coverage of the new resilience features added to the base task class that all worker tasks inherit from.

6943a5a

53 minutes ago

by drazisil-codecov

© 2025 CodSpeed Technology

Home Terms Privacy Docs