Commits
Click on a commit to change the comparison rangeFix upload task resilience to pod failures
**Problem:**
Upload task 15fa00a1-b967-4ab2-9199-543d7e41e89d experienced a 6-hour delay
when a pod failure occurred. The retry was received by a worker pod that died
before processing, causing the task to be lost until visibility timeout expired.
**Root Causes:**
1. The `retries == 0` condition only checked for processing on first attempt
2. 6-hour visibility timeout meant lost tasks took 6 hours to recover
3. Tasks weren't automatically returned to queue when workers died
4. No metrics to monitor stuck or repeatedly retried uploads
**Changes:**
1. **Removed `retries == 0` condition** (apps/worker/tasks/upload.py)
- Now EVERY retry checks if upload is currently processing
- Added max retry limit of 10 to prevent infinite loops
- Logs include retry count for debugging
2. **Reduced visibility timeout** (libs/shared/shared/celery_config.py)
- Changed from 6 hours to 15 minutes
- Lost tasks are now retried in 15 minutes instead of 6 hours
3. **Added task_reject_on_worker_lost** (libs/shared/shared/celery_config.py)
- When pods die, tasks are immediately returned to queue
- Another worker picks up the task within seconds
4. **Added monitoring metrics** (apps/worker/tasks/upload.py)
- UPLOAD_TASK_PROCESSING_RETRY_COUNTER: tracks retry patterns
- UPLOAD_TASK_TOO_MANY_RETRIES_COUNTER: tracks abandoned uploads
**Protection Against Double Processing:**
Multiple layers prevent race conditions:
- upload_processing_lock check detects if processor is running
- upload_lock (Redis lock) serializes Upload task execution
- has_pending_jobs() recheck inside lock verifies jobs still exist
- Atomic lpop ensures same job isn't processed twice
**Impact:**
- Pod failures now recover in seconds/minutes instead of 6 hours
- Tasks survive worker crashes without data loss
- Better observability via metrics
- Fully backward compatible
See UPLOAD_TASK_RESILIENCE_CHANGES.md for complete analysis and testing plan.1 day ago
by drazisil-codecov Add Dead Letter Queue to prevent data loss
**Issue:** AI agent identified that uploads hitting 10-retry limit would be
silently dropped without recovery mechanism, potentially causing data loss.
**Root Cause:** When task gave up after 10 retries, uploads remained in Redis
but were never processed. They would either:
- Wait for another upload task (unlikely if processing lock stuck)
- Expire after 24 hours (data loss)
**Solution:** Implemented Dead Letter Queue (DLQ)
1. **Atomic move to DLQ**: When hitting retry limit, lpop all pending uploads
and push to DLQ key: upload_dlq/{repoid}/{commitid}/{report_type}
2. **7-day retention**: DLQ entries expire after 7 days, giving team time to
inspect and recover
3. **Monitoring**: Added UPLOAD_TASK_DLQ_COUNTER metric to track entries
4. **Logging**: Error logs include DLQ key and upload count for debugging
**Benefits:**
- Prevents silent data loss
- Enables manual recovery/inspection
- Visibility via metrics and alerts
- Future: Can add automated recovery script
**Testing:**
- No linter errors
- Backward compatible (DLQ is additive)
- Metric increments properly
- Redis TTL prevents DLQ buildup
This addresses the valid concern raised by the AI agent and ensures no uploads
are lost, even in worst-case scenarios.1 day ago
by drazisil-codecov Create settings.json23 hours ago
by drazisil-codecov fix: update tests for corrected retry behavior and new visibility timeout
1. Update test_run_impl_currently_processing_second_retry:
- Test now verifies processing check runs on ANY retry (not just first)
- Task should raise Retry exception when processing is ongoing
- This properly tests the fix for the retries==0 bug
2. Update test_celery_config:
- Update expected visibility_timeout: 21600 (6h) → 900 (15m)
- Add comment explaining the change for pod failure recovery
Both changes align with the resilience improvements in the main PR.23 hours ago
by drazisil-codecov test: add coverage for TOO_MANY_RETRIES checkpoint logging
Add test_run_impl_too_many_retries_logs_checkpoint to verify that
UploadFlow.TOO_MANY_RETRIES checkpoint is logged when a task hits
the max retry limit (10 retries) while processing is ongoing.
This addresses missing test coverage identified by code review for
the new retry limit behavior added in the resilience improvements.
The test verifies:
- maybe_log_upload_checkpoint is called with TOO_MANY_RETRIES
- Task returns with reason 'too_many_processing_retries'
- Task doesn't proceed with setup when max retries exceeded23 hours ago
by drazisil-codecov refactor: remove outdated comment in test_celery_config.py22 hours ago
by drazisil-codecov test: add comprehensive coverage for safe_retry and retry metrics
Add tests for BaseCodecovTask improvements:
- safe_retry succeeds below max retries
- safe_retry fails at max retries with proper metrics
- safe_retry uses exponential backoff by default
- safe_retry handles MaxRetriesExceededError gracefully
- on_retry tracks retry count in metrics
These tests ensure full coverage of the new resilience features
added to the base task class that all worker tasks inherit from.53 minutes ago
by drazisil-codecov