Commits
Click on a commit to change the comparison rangefeat(ray): Implement dynamic scale-in for RaySwordfishActor
This commit implements the dynamic scaling down (scale-in) functionality for RaySwordfishActor to release idle resources.
Key changes:
- Implement `retire_idle_ray_workers` in `RayWorkerManager` to identify and release idle workers.
- Add `pending_release_blacklist` to track retiring workers and prevent them from being reused or causing "worker died" errors.
- Move scale-down cooldown logic to `RayWorkerManager` to prevent frequent scale-down operations.
- Optimize `retire_idle_ray_workers` to reduce lock contention by releasing the lock before performing Ray/Python operations.
- Update `try_autoscale` in `flotilla.py` to support empty resource requests, enabling Ray to scale down resources.
- Fix unit tests in `src/daft-distributed/src/scheduling/worker.rs` and ensure compatibility with the scheduler loop.
This addresses the issue where `udfActor` could not dynamically scale down and prevents "worker died" errors during graceful shutdown.