-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][autoscaler] Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 #48519
[core][autoscaler] Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 #48519
Conversation
8cd0be3
to
4f03be0
Compare
0ba6dec
to
b65602f
Compare
I will review this PR today |
horizon = now - (60 * self.config["idle_timeout_minutes"]) | ||
|
||
# local import to avoid circular dependencies | ||
from ray.autoscaler.v2.sdk import get_cluster_resource_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Autoscaler v1 relying on Autoscaler v2's functions is hacky for me. We should avoid that.
Hi, here are my investigations:
The image shows the idle time from 1. is always 0, while the idle time from 2. is correct.
From the investigations above, now we have two choices:
IMO, 1. will be the best practice but I guess it will take a certain amount of time. 2. will be quicker fix. @kevin85421 What do you think? |
Let me discuss this with my colleagues. Proceeding with option (2) first, and then having me take it over, is also an option. |
f663683
to
efbe62e
Compare
@kevin85421 Option 2 is done, the test error doesn't seem to be related to mine. |
cc @rickyyx |
Niceeeee! 2 is a fine approach to me - while GetClusterResourceState is used by Autoscaler V2, it should also work when V2 is turned off. So having V1 autoscaler polling that endpoint is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding some tests since this PR is not only for KubeRay Autoscaler?
ray_nodes_idle_duration_ms_by_ip = ( | ||
self.load_metrics.ray_nodes_idle_duration_ms_by_ip | ||
) | ||
now = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to reset now
or is it OK to use the arg now
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can
@@ -238,13 +239,32 @@ def get_latest_readonly_config(): | |||
prom_metrics=self.prom_metrics, | |||
) | |||
|
|||
def get_cluster_resource_state(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a function is only called by other member functions within the same class, we typically prefix the function name with an underscore _
.
def get_cluster_resource_state(self): | |
def _get_cluster_resource_state(self): |
@@ -238,13 +239,32 @@ def get_latest_readonly_config(): | |||
prom_metrics=self.prom_metrics, | |||
) | |||
|
|||
def get_cluster_resource_state(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to work around this issue by using the Autoscaler V2 API. Do we still need to define get_cluster_resource_state
in this file, or can we directly import it from v2/sdk.py
?
) | ||
now = time.time() | ||
last_used = { | ||
ip: now - duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In load_metrics.py
, the type hint of idle_duration_ms
is int. Can we directly subtract idle_duration_ms
from time.time()
?
@@ -97,10 +97,12 @@ def update( | |||
infeasible_bundles: List[Dict[str, float]] = None, | |||
pending_placement_groups: List[PlacementGroupTableData] = None, | |||
cluster_full_of_actors_detected: bool = False, | |||
node_last_used_time_s: float = time.time(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the goal of setting this as the default value? It is a bit weird for me. time.time()
will be called only once, when the function is defined. For example, the following program prints the same timestamp twice.
import time
def f(t = time.time()):
print(t)
f()
time.sleep(5)
f()
Maybe use node_last_used_time_s: Optional[float] = None
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
): | ||
self.static_resources_by_ip[ip] = static_resources | ||
self.raylet_id_by_ip[ip] = raylet_id | ||
self.cluster_full_of_actors_detected = cluster_full_of_actors_detected | ||
self.ray_nodes_last_used_time_by_ip[ip] = node_last_used_time_s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases will node_last_used_time_s
not be set? If node_last_used_time_s
is None
, an error may occur when performing time (float) - None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check added on be86afd
By the way, would you mind opening an issue in the KubeRay repo to track the progress of adding an end-to-end test for this PR? |
c40c314
to
be86afd
Compare
Issue opened: ray-project/kuberay#2568 |
@@ -97,6 +97,7 @@ def update( | |||
infeasible_bundles: List[Dict[str, float]] = None, | |||
pending_placement_groups: List[PlacementGroupTableData] = None, | |||
cluster_full_of_actors_detected: bool = False, | |||
node_last_used_time_s: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case will this not be set and the default value be used? How about making the field required instead of optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the testing code, manually call load_metrics.update().
It's more convenient to set last_used_time default as now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making the field required instead? The current implementation makes ray_nodes_last_used_time_by_ip
have two different definitions.
): | ||
self.last_used_time_by_ip[ip] = now | ||
self.ray_nodes_last_used_time_by_ip[ip] = ( | ||
node_last_used_time_s if node_last_used_time_s else now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When checking whether a variable is None
, it's better to use is
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node_last_used_time_s if node_last_used_time_s is not None else now
@@ -318,6 +337,9 @@ def update_load_metrics(self): | |||
infeasible_bundles, | |||
pending_placement_groups, | |||
cluster_full, | |||
time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make it into a single line or two lines?
time.time() - idle_duration_ms / 1000, # node_last_used_time_s = now - idle_duration
if node_id in ray_nodes_idle_duration_ms_by_id: | ||
idle_duration_ms = ray_nodes_idle_duration_ms_by_id[node_id] | ||
else: | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases will this condition occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, get_all_resource_usage
and get_cluster_resource_state
should return the same set of nodes. However, since they are implemented in Autoscaler v1 and v2 respectively, I'm uncertain if there may be any discrepancies between them in some special cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to understand how severe the inconsistency is to determine whether this warrants a warning or a panic. @rickyyx Would you mind providing some insights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both seem to be getting data from GCS, which should be mostly consistent I believe, but yeah, the codepaths that generate the data is different -> i think graceful handling with warnings is fine (they should eventually be consistent)
BTW, is there any reason we don't use mainly the v2's info from get_cluster_resource_state
here entirely? I think the only v1 bit of info that's missing in v2 is "cluster_full", which we could probably also add to v2.
I don't have a strong opinion between the below given the source of info is both GCS, and the likelihood of inconsistency is low IMO:
- use mostly v1's info, and patch with v2's idle info
- use mostly v2's info, and patch with v1's cluster_full or other missing ones in v2.
I think merging this PR to fix the idle issue is fine - and we could follow up to bring v2's RPC in parity with V1 so we could use V2 entirely here. Or if we are pushing V2 really hard, we might just deprecate V1 in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use mostly v1's info, and patch with v2's idle info
- use mostly v2's info, and patch with v1's cluster_full or other missing ones in v2.
How about we go with option 1 so that we can reduce the dependency between V1 and V2?
Signed-off-by: Mimi Liao <[email protected]>
Signed-off-by: Mimi Liao <[email protected]>
Signed-off-by: Mimi Liao <[email protected]>
…ead of idle duration Signed-off-by: Mimi Liao <[email protected]>
Signed-off-by: Mimi Liao <[email protected]>
Signed-off-by: Mimi Liao <[email protected]>
Signed-off-by: Mimi Liao <[email protected]>
…nd set default value Signed-off-by: Mimi Liao <[email protected]>
be86afd
to
914f5b5
Compare
assert autoscaler.resource_demand_scheduler.node_types["worker"][ | ||
"resources" | ||
] == {"CPU": 1} | ||
# def _aggressiveAutoscalingHelper(self, foreground_node_launcher: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only used by the above commented function. So I commented it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickyyx, why are testAggressiveAutoscaling
and testAggressiveAutoscalingWithForegroundLauncher
commented out? Should we remove _aggressiveAutoscalingHelper
if it is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably irrelevant to this PR, we could leave it as it is, and open another PR to either clean up this deadcode or reenable the test (if possible)
In that case, this PR is ready for merge.
…sed on that Signed-off-by: Mimi Liao <[email protected]>
914f5b5
to
cbee98b
Compare
Signed-off-by: Mimi Liao <[email protected]>
assert autoscaler.resource_demand_scheduler.node_types["worker"][ | ||
"resources" | ||
] == {"CPU": 1} | ||
# def _aggressiveAutoscalingHelper(self, foreground_node_launcher: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickyyx, why are testAggressiveAutoscaling
and testAggressiveAutoscalingWithForegroundLauncher
commented out? Should we remove _aggressiveAutoscalingHelper
if it is not used?
@@ -320,6 +320,8 @@ def update_nodes(self): | |||
SMALL_CLUSTER, **{"available_node_types": TYPES_A, "head_node_type": "empty_node"} | |||
) | |||
|
|||
DUMMY_IDLE_DURATION_S = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments for DUMMY_IDLE_DURATION_S
? For example, explain when DUMMY_IDLE_DURATION_S
should be used (e.g., when static resources (total resources) are not equal to dynamic resources (available resources)) and DUMMY_IDLE_DURATION_S
should not trigger scale down? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 1e424ce
Signed-off-by: Mimi Liao <[email protected]>
… idle in autoscaler v1 (ray-project#48519) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions: 1. Tasks running on the node do not have assigned resources. 2. All tasks on the node are blocked on get or wait operations. This will lead to the incorrect termination of nodes during downscaling. To resolve this issue, use the `idle_duration_ms` reported by raylet instead, which already considers the aforementioned conditions. ref: ray-project#39582 ### Before: NodeDiedError ![image](https://github.com/user-attachments/assets/a126af98-7950-40c4-ad43-2448f4b0d71a) ### After ![image](https://github.com/user-attachments/assets/ae5f6c74-6b7a-4684-a126-66e9a562149c) ### Reproduction Script (on local fake nodes) - Setting: head_nodes: < 10 cpus, worker nodes: 10 cpus - Code: ``` import ray import time import os import random @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_with_outside(): print('start inside_ray_task_with_outside') sleep_time = 15 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_without_outside(): print('start inside_ray_task_without_outside task') sleep_time = 50 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=0, num_cpus=10) def outside_ray_task(): print('start outside_ray_task task') future_list = [inside_ray_task_with_outside.remote(), inside_ray_task_without_outside.remote()] ray.get(future_list) if __name__ == '__main__': ray.init() ray.get(outside_ray_task.remote()) ``` ## Related issue number <!-- For example: "Closes ray-project#1234" --> Closes ray-project#46492 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Mimi Liao <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
Great work @mimiliaogo! and thanks for the reviews on this too @kevin85421 |
… idle in autoscaler v1 (ray-project#48519) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions: 1. Tasks running on the node do not have assigned resources. 2. All tasks on the node are blocked on get or wait operations. This will lead to the incorrect termination of nodes during downscaling. To resolve this issue, use the `idle_duration_ms` reported by raylet instead, which already considers the aforementioned conditions. ref: ray-project#39582 ### Before: NodeDiedError ![image](https://github.com/user-attachments/assets/a126af98-7950-40c4-ad43-2448f4b0d71a) ### After ![image](https://github.com/user-attachments/assets/ae5f6c74-6b7a-4684-a126-66e9a562149c) ### Reproduction Script (on local fake nodes) - Setting: head_nodes: < 10 cpus, worker nodes: 10 cpus - Code: ``` import ray import time import os import random @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_with_outside(): print('start inside_ray_task_with_outside') sleep_time = 15 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_without_outside(): print('start inside_ray_task_without_outside task') sleep_time = 50 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=0, num_cpus=10) def outside_ray_task(): print('start outside_ray_task task') future_list = [inside_ray_task_with_outside.remote(), inside_ray_task_without_outside.remote()] ray.get(future_list) if __name__ == '__main__': ray.init() ray.get(outside_ray_task.remote()) ``` ## Related issue number <!-- For example: "Closes ray-project#1234" --> Closes ray-project#46492 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Mimi Liao <[email protected]> Signed-off-by: hjiang <[email protected]>
… idle in autoscaler v1 (ray-project#48519) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions: 1. Tasks running on the node do not have assigned resources. 2. All tasks on the node are blocked on get or wait operations. This will lead to the incorrect termination of nodes during downscaling. To resolve this issue, use the `idle_duration_ms` reported by raylet instead, which already considers the aforementioned conditions. ref: ray-project#39582 ### Before: NodeDiedError ![image](https://github.com/user-attachments/assets/a126af98-7950-40c4-ad43-2448f4b0d71a) ### After ![image](https://github.com/user-attachments/assets/ae5f6c74-6b7a-4684-a126-66e9a562149c) ### Reproduction Script (on local fake nodes) - Setting: head_nodes: < 10 cpus, worker nodes: 10 cpus - Code: ``` import ray import time import os import random @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_with_outside(): print('start inside_ray_task_with_outside') sleep_time = 15 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_without_outside(): print('start inside_ray_task_without_outside task') sleep_time = 50 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=0, num_cpus=10) def outside_ray_task(): print('start outside_ray_task task') future_list = [inside_ray_task_with_outside.remote(), inside_ray_task_without_outside.remote()] ray.get(future_list) if __name__ == '__main__': ray.init() ray.get(outside_ray_task.remote()) ``` ## Related issue number <!-- For example: "Closes ray-project#1234" --> Closes ray-project#46492 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Mimi Liao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions:
This will lead to the incorrect termination of nodes during downscaling.
To resolve this issue, use the
idle_duration_ms
reported by raylet instead, which already considers the aforementioned conditions. ref: #39582Before: NodeDiedError
After
Reproduction Script (on local fake nodes)
head_nodes: < 10 cpus, worker nodes: 10 cpus
Related issue number
Closes #46492
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.