[core][autoscaler] Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 #48519

mimiliaogo · 2024-11-03T03:00:51Z

In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions:

Tasks running on the node do not have assigned resources.
All tasks on the node are blocked on get or wait operations.

This will lead to the incorrect termination of nodes during downscaling.

To resolve this issue, use the idle_duration_ms reported by raylet instead, which already considers the aforementioned conditions. ref: #39582

Before: NodeDiedError

After

Reproduction Script (on local fake nodes)

Setting:
head_nodes: < 10 cpus, worker nodes: 10 cpus

Code:

import ray
import time
import os
import random

@ray.remote(max_retries=5, num_cpus=10)
def inside_ray_task_with_outside():
    print('start inside_ray_task_with_outside')
    sleep_time = 15 
    start_time = time.perf_counter()
    while True:
        if(time.perf_counter() - start_time < sleep_time):
            time.sleep(0.001)
        else:
            break

@ray.remote(max_retries=5, num_cpus=10)
def inside_ray_task_without_outside():
    print('start inside_ray_task_without_outside task')
    sleep_time = 50 
    start_time = time.perf_counter()
    while True:
        if(time.perf_counter() - start_time < sleep_time):
            time.sleep(0.001)
        else:
            break

@ray.remote(max_retries=0, num_cpus=10)
def outside_ray_task():
    print('start outside_ray_task task')
    future_list = [inside_ray_task_with_outside.remote(), 
                        inside_ray_task_without_outside.remote()]
    ray.get(future_list)


if __name__ == '__main__':
    ray.init()
    ray.get(outside_ray_task.remote())

Related issue number

Closes #46492

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

kevin85421 · 2024-11-15T23:19:41Z

I will review this PR today

kevin85421 · 2024-11-16T07:30:32Z

python/ray/autoscaler/_private/autoscaler.py

-        horizon = now - (60 * self.config["idle_timeout_minutes"])
+
+        # local import to avoid circular dependencies
+        from ray.autoscaler.v2.sdk import get_cluster_resource_state


Autoscaler v1 relying on Autoscaler v2's functions is hacky for me. We should avoid that.

mimiliaogo · 2024-11-17T22:55:09Z

Hi, here are my investigations:

Autoscaler v1 uses GetAllResourceUsage to get the current resource usage, which actually contains a column about idle time. However, I found that the value of idle time is always 0 and the value is not being used now. They instead use a simple check on resources utilization here to decide if the node is idle, which will misclassify a blocking worker as idle.
This is the previous PR about reporting idle time. Further investigations are needed to check if anything is wrong when reporting idle time.
On the other side, Autoscaler v2 uses GetClusterResourceState to get resource state. The idle time reported here is correct.

The image shows the idle time from 1. is always 0, while the idle time from 2. is correct.

Here's a related PR to solve the misclassification of idle nodes by setting footprint of busy worker. So the reported idle time from the local resource manager should already be handled and correct. However, I haven't figured out why the idle time in 1. is always 0.

From the investigations above, now we have two choices:

Debug why idle time in GetAllResourceUsage is not correct, it should be consistent with the one reported by GetClusterResourceState.
Directly call GetClusterResourceState to get the correct idle time.

IMO, 1. will be the best practice but I guess it will take a certain amount of time. 2. will be quicker fix.
If anyone from the team can give some insights for 1. will be very helpful.

@kevin85421 What do you think?

kevin85421 · 2024-11-18T03:55:22Z

IMO, 1. will be the best practice but I guess it will take a certain amount of time. 2. will be quicker fix.
If anyone from the team can give some insights for 1. will be very helpful.

Let me discuss this with my colleagues. Proceeding with option (2) first, and then having me take it over, is also an option.

mimiliaogo · 2024-11-18T16:00:01Z

@kevin85421 Option 2 is done, the test error doesn't seem to be related to mine.
If debugging Option 1 is needed, I can also take some time to look at it.

kevin85421 · 2024-11-19T21:06:41Z

cc @rickyyx

rickyyx · 2024-11-19T22:08:17Z

Niceeeee!

2 is a fine approach to me - while GetClusterResourceState is used by Autoscaler V2, it should also work when V2 is turned off.

So having V1 autoscaler polling that endpoint is fine.

kevin85421

Would you mind adding some tests since this PR is not only for KubeRay Autoscaler?

kevin85421 · 2024-11-22T07:25:10Z

python/ray/autoscaler/_private/autoscaler.py

+        ray_nodes_idle_duration_ms_by_ip = (
+            self.load_metrics.ray_nodes_idle_duration_ms_by_ip
+        )
+        now = time.time()


do we need to reset now or is it OK to use the arg now?

kevin85421 · 2024-11-22T08:03:25Z

python/ray/autoscaler/_private/monitor.py

@@ -238,13 +239,32 @@ def get_latest_readonly_config():
            prom_metrics=self.prom_metrics,
        )

+    def get_cluster_resource_state(self):


If a function is only called by other member functions within the same class, we typically prefix the function name with an underscore _.

Suggested change

def get_cluster_resource_state(self):

def _get_cluster_resource_state(self):

kevin85421 · 2024-11-22T08:06:39Z

python/ray/autoscaler/_private/monitor.py

@@ -238,13 +239,32 @@ def get_latest_readonly_config():
            prom_metrics=self.prom_metrics,
        )

+    def get_cluster_resource_state(self):


We decided to work around this issue by using the Autoscaler V2 API. Do we still need to define get_cluster_resource_state in this file, or can we directly import it from v2/sdk.py?

kevin85421 · 2024-11-22T08:23:43Z

python/ray/autoscaler/_private/autoscaler.py

+        )
+        now = time.time()
+        last_used = {
+            ip: now - duration


In load_metrics.py, the type hint of idle_duration_ms is int. Can we directly subtract idle_duration_ms from time.time()?

mimiliaogo · 2024-11-24T06:06:32Z

Test added. For the original version, v1 fails and v2 passes as expected

After PR, both v1 and v2 pass.

kevin85421 · 2024-11-25T05:24:05Z

python/ray/autoscaler/_private/load_metrics.py

@@ -97,10 +97,12 @@ def update(
        infeasible_bundles: List[Dict[str, float]] = None,
        pending_placement_groups: List[PlacementGroupTableData] = None,
        cluster_full_of_actors_detected: bool = False,
+        node_last_used_time_s: float = time.time(),


What's the goal of setting this as the default value? It is a bit weird for me. time.time() will be called only once, when the function is defined. For example, the following program prints the same timestamp twice.

import time def f(t = time.time()): print(t) f() time.sleep(5) f()

Maybe use node_last_used_time_s: Optional[float] = None instead?

Good catch!

kevin85421 · 2024-11-25T06:59:33Z

python/ray/autoscaler/_private/load_metrics.py

    ):
        self.static_resources_by_ip[ip] = static_resources
        self.raylet_id_by_ip[ip] = raylet_id
        self.cluster_full_of_actors_detected = cluster_full_of_actors_detected
+        self.ray_nodes_last_used_time_by_ip[ip] = node_last_used_time_s


In which cases will node_last_used_time_s not be set? If node_last_used_time_s is None, an error may occur when performing time (float) - None.

Check added on be86afd

kevin85421 · 2024-11-25T07:58:59Z

By the way, would you mind opening an issue in the KubeRay repo to track the progress of adding an end-to-end test for this PR?

mimiliaogo · 2024-11-25T22:23:56Z

track the progress of adding an end-to-end test

Issue opened: ray-project/kuberay#2568

kevin85421 · 2024-11-26T05:29:53Z

python/ray/autoscaler/_private/load_metrics.py

@@ -97,6 +97,7 @@ def update(
        infeasible_bundles: List[Dict[str, float]] = None,
        pending_placement_groups: List[PlacementGroupTableData] = None,
        cluster_full_of_actors_detected: bool = False,
+        node_last_used_time_s: Optional[float] = None,


In which case will this not be set and the default value be used? How about making the field required instead of optional?

In the testing code, manually call load_metrics.update().
It's more convenient to set last_used_time default as now.

How about making the field required instead? The current implementation makes ray_nodes_last_used_time_by_ip have two different definitions.

kevin85421 · 2024-11-26T05:31:10Z

python/ray/autoscaler/_private/load_metrics.py

-        ):
-            self.last_used_time_by_ip[ip] = now
+        self.ray_nodes_last_used_time_by_ip[ip] = (
+            node_last_used_time_s if node_last_used_time_s else now


When checking whether a variable is None, it's better to use is.

node_last_used_time_s if node_last_used_time_s is not None else now

kevin85421 · 2024-11-26T05:34:30Z

python/ray/autoscaler/_private/monitor.py

@@ -318,6 +337,9 @@ def update_load_metrics(self):
                infeasible_bundles,
                pending_placement_groups,
                cluster_full,
+                time.time()


nit: make it into a single line or two lines?

time.time() - idle_duration_ms / 1000, # node_last_used_time_s = now - idle_duration

kevin85421 · 2024-11-26T05:37:47Z

python/ray/autoscaler/_private/monitor.py

+            if node_id in ray_nodes_idle_duration_ms_by_id:
+                idle_duration_ms = ray_nodes_idle_duration_ms_by_id[node_id]
+            else:
+                logger.warning(


In which cases will this condition occur?

In theory, get_all_resource_usage and get_cluster_resource_state should return the same set of nodes. However, since they are implemented in Autoscaler v1 and v2 respectively, I'm uncertain if there may be any discrepancies between them in some special cases.

We may need to understand how severe the inconsistency is to determine whether this warrants a warning or a panic. @rickyyx Would you mind providing some insights?

Both seem to be getting data from GCS, which should be mostly consistent I believe, but yeah, the codepaths that generate the data is different -> i think graceful handling with warnings is fine (they should eventually be consistent)

BTW, is there any reason we don't use mainly the v2's info from get_cluster_resource_state here entirely? I think the only v1 bit of info that's missing in v2 is "cluster_full", which we could probably also add to v2.

I don't have a strong opinion between the below given the source of info is both GCS, and the likelihood of inconsistency is low IMO:

use mostly v1's info, and patch with v2's idle info

use mostly v2's info, and patch with v1's cluster_full or other missing ones in v2.

I think merging this PR to fix the idle issue is fine - and we could follow up to bring v2's RPC in parity with V1 so we could use V2 entirely here. Or if we are pushing V2 really hard, we might just deprecate V1 in the future.

use mostly v1's info, and patch with v2's idle info

use mostly v2's info, and patch with v1's cluster_full or other missing ones in v2.

How about we go with option 1 so that we can reduce the dependency between V1 and V2?

Signed-off-by: Mimi Liao <[email protected]>

…ead of idle duration Signed-off-by: Mimi Liao <[email protected]>

Signed-off-by: Mimi Liao <[email protected]>

…nd set default value Signed-off-by: Mimi Liao <[email protected]>

mimiliaogo · 2024-12-01T06:57:23Z

python/ray/tests/test_autoscaler.py

-        assert autoscaler.resource_demand_scheduler.node_types["worker"][
-            "resources"
-        ] == {"CPU": 1}
+    # def _aggressiveAutoscalingHelper(self, foreground_node_launcher: bool = False):


This function is only used by the above commented function. So I commented it as well.

@rickyyx, why are testAggressiveAutoscaling and testAggressiveAutoscalingWithForegroundLauncher commented out? Should we remove _aggressiveAutoscalingHelper if it is not used?

Looks like it's commented here: #38459

@rynewang (author) probably has the most context.

Probably irrelevant to this PR, we could leave it as it is, and open another PR to either clean up this deadcode or reenable the test (if possible)

Probably irrelevant to this PR, we could leave it as it is, and open another PR to either clean up this deadcode or reenable the test (if possible)

In that case, this PR is ready for merge.

…sed on that Signed-off-by: Mimi Liao <[email protected]>

mimiliaogo · 2024-12-01T16:24:43Z

The general logic for adding node_idle_duration_s in testing code is

if dynamic resource != static resource, node_idle_duration_s = 0
Else, set it as a dummy value, (e.g., 3 sec). As far as I know, the testing codes here are not for idle time-out testing (the time-out config is 5 minutes by default), here we just let the dummy value not cause killing idle nodes as the original behavior.
I found that in test_autoscaling_policy.py, the testing results of Removing nodes are all because of launch failed. Not sure if this is intended or not. Because of that, setting node_idle_duration_s always 0 won't affect the test results.

I might not be able to understand what every testing function is doing, pls correct me if anything is missing.

Signed-off-by: Mimi Liao <[email protected]>

kevin85421 · 2024-12-03T04:16:00Z

python/ray/tests/test_autoscaler.py

-        assert autoscaler.resource_demand_scheduler.node_types["worker"][
-            "resources"
-        ] == {"CPU": 1}
+    # def _aggressiveAutoscalingHelper(self, foreground_node_launcher: bool = False):


@rickyyx, why are testAggressiveAutoscaling and testAggressiveAutoscalingWithForegroundLauncher commented out? Should we remove _aggressiveAutoscalingHelper if it is not used?

kevin85421 · 2024-12-03T04:45:38Z

python/ray/tests/test_autoscaler.py

@@ -320,6 +320,8 @@ def update_nodes(self):
    SMALL_CLUSTER, **{"available_node_types": TYPES_A, "head_node_type": "empty_node"}
 )

+DUMMY_IDLE_DURATION_S = 3


Can you add some comments for DUMMY_IDLE_DURATION_S? For example, explain when DUMMY_IDLE_DURATION_S should be used (e.g., when static resources (total resources) are not equal to dynamic resources (available resources)) and DUMMY_IDLE_DURATION_S should not trigger scale down? Thanks!

Done: 1e424ce

Signed-off-by: Mimi Liao <[email protected]>

… idle in autoscaler v1 (ray-project#48519)   In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions: 1. Tasks running on the node do not have assigned resources. 2. All tasks on the node are blocked on get or wait operations. This will lead to the incorrect termination of nodes during downscaling. To resolve this issue, use the `idle_duration_ms` reported by raylet instead, which already considers the aforementioned conditions. ref: ray-project#39582 ### Before: NodeDiedError ![image](https://github.com/user-attachments/assets/a126af98-7950-40c4-ad43-2448f4b0d71a) ### After ![image](https://github.com/user-attachments/assets/ae5f6c74-6b7a-4684-a126-66e9a562149c) ### Reproduction Script (on local fake nodes) - Setting: head_nodes: < 10 cpus, worker nodes: 10 cpus - Code: ``` import ray import time import os import random @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_with_outside(): print('start inside_ray_task_with_outside') sleep_time = 15 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_without_outside(): print('start inside_ray_task_without_outside task') sleep_time = 50 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=0, num_cpus=10) def outside_ray_task(): print('start outside_ray_task task') future_list = [inside_ray_task_with_outside.remote(), inside_ray_task_without_outside.remote()] ray.get(future_list) if __name__ == '__main__': ray.init() ray.get(outside_ray_task.remote()) ``` ## Related issue number  Closes ray-project#46492 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Mimi Liao <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

rickyyx · 2024-12-05T06:00:01Z

Great work @mimiliaogo! and thanks for the reviews on this too @kevin85421

… idle in autoscaler v1 (ray-project#48519)   In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions: 1. Tasks running on the node do not have assigned resources. 2. All tasks on the node are blocked on get or wait operations. This will lead to the incorrect termination of nodes during downscaling. To resolve this issue, use the `idle_duration_ms` reported by raylet instead, which already considers the aforementioned conditions. ref: ray-project#39582 ### Before: NodeDiedError ![image](https://github.com/user-attachments/assets/a126af98-7950-40c4-ad43-2448f4b0d71a) ### After ![image](https://github.com/user-attachments/assets/ae5f6c74-6b7a-4684-a126-66e9a562149c) ### Reproduction Script (on local fake nodes) - Setting: head_nodes: < 10 cpus, worker nodes: 10 cpus - Code: ``` import ray import time import os import random @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_with_outside(): print('start inside_ray_task_with_outside') sleep_time = 15 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_without_outside(): print('start inside_ray_task_without_outside task') sleep_time = 50 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=0, num_cpus=10) def outside_ray_task(): print('start outside_ray_task task') future_list = [inside_ray_task_with_outside.remote(), inside_ray_task_without_outside.remote()] ray.get(future_list) if __name__ == '__main__': ray.init() ray.get(outside_ray_task.remote()) ``` ## Related issue number  Closes ray-project#46492 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Mimi Liao <[email protected]> Signed-off-by: hjiang <[email protected]>

… idle in autoscaler v1 (ray-project#48519)   In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions: 1. Tasks running on the node do not have assigned resources. 2. All tasks on the node are blocked on get or wait operations. This will lead to the incorrect termination of nodes during downscaling. To resolve this issue, use the `idle_duration_ms` reported by raylet instead, which already considers the aforementioned conditions. ref: ray-project#39582 ### Before: NodeDiedError ![image](https://github.com/user-attachments/assets/a126af98-7950-40c4-ad43-2448f4b0d71a) ### After ![image](https://github.com/user-attachments/assets/ae5f6c74-6b7a-4684-a126-66e9a562149c) ### Reproduction Script (on local fake nodes) - Setting: head_nodes: < 10 cpus, worker nodes: 10 cpus - Code: ``` import ray import time import os import random @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_with_outside(): print('start inside_ray_task_with_outside') sleep_time = 15 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_without_outside(): print('start inside_ray_task_without_outside task') sleep_time = 50 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=0, num_cpus=10) def outside_ray_task(): print('start outside_ray_task task') future_list = [inside_ray_task_with_outside.remote(), inside_ray_task_without_outside.remote()] ray.get(future_list) if __name__ == '__main__': ray.init() ray.get(outside_ray_task.remote()) ``` ## Related issue number  Closes ray-project#46492 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Mimi Liao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

mimiliaogo force-pushed the autoscaler-v1-idle-check branch 2 times, most recently from 8cd0be3 to 4f03be0 Compare November 3, 2024 18:46

mimiliaogo changed the title ~~Fix incorrectly terminating nodes misclassified as idle in autoscaler v1~~ [core][autoscaler] Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 Nov 3, 2024

kevin85421 self-assigned this Nov 5, 2024

mimiliaogo marked this pull request as ready for review November 6, 2024 17:59

mimiliaogo requested review from hongchaodeng and a team as code owners November 6, 2024 17:59

mimiliaogo force-pushed the autoscaler-v1-idle-check branch from 0ba6dec to b65602f Compare November 12, 2024 17:38

kevin85421 reviewed Nov 16, 2024

View reviewed changes

mimiliaogo force-pushed the autoscaler-v1-idle-check branch 2 times, most recently from f663683 to efbe62e Compare November 18, 2024 04:27

mimiliaogo requested a review from kevin85421 November 19, 2024 20:54

kevin85421 reviewed Nov 22, 2024

View reviewed changes

mimiliaogo requested a review from kevin85421 November 24, 2024 15:54

kevin85421 reviewed Nov 25, 2024

View reviewed changes

mimiliaogo force-pushed the autoscaler-v1-idle-check branch 2 times, most recently from c40c314 to be86afd Compare November 25, 2024 20:53

mimiliaogo mentioned this pull request Nov 25, 2024

[Feature] Add E2E Test for Autoscaler Nested Remote Functions ray-project/kuberay#2568

Open

2 tasks

kevin85421 reviewed Nov 26, 2024

View reviewed changes

MortalHappiness added the go add ONLY when ready to merge, run all tests label Nov 30, 2024

mimiliaogo added 8 commits December 1, 2024 00:55

Fix incorrectly terminating nodes misclassified as idle in autoscaler v1

c43f1c4

Signed-off-by: Mimi Liao <[email protected]>

map provider node id to raylet id

0c8da1e

Signed-off-by: Mimi Liao <[email protected]>

use GetClusterResourceState to get correct idle time

3f31bfb

Signed-off-by: Mimi Liao <[email protected]>

fix review suggestions directly use AS v2 sdk and pass last_used inst…

786bc26

…ead of idle duration Signed-off-by: Mimi Liao <[email protected]>

add test for autoscaler not killing blocking node

5d8e4b3

Signed-off-by: Mimi Liao <[email protected]>

add the fixture to isolate autoscaler v1 and v2 testing

9fca93b

Signed-off-by: Mimi Liao <[email protected]>

fix default value

018641c

Signed-off-by: Mimi Liao <[email protected]>

Add logging for missing node_id in ray_nodes_idle_duration_ms_by_id a…

898dc51

…nd set default value Signed-off-by: Mimi Liao <[email protected]>

mimiliaogo force-pushed the autoscaler-v1-idle-check branch from be86afd to 914f5b5 Compare December 1, 2024 06:55

mimiliaogo commented Dec 1, 2024

View reviewed changes

make node_idle_duration_s a required field and modify testing code ba…

cbee98b

…sed on that Signed-off-by: Mimi Liao <[email protected]>

mimiliaogo force-pushed the autoscaler-v1-idle-check branch from 914f5b5 to cbee98b Compare December 1, 2024 16:12

add test for scaling down node when idle time-out

7a34ffe

Signed-off-by: Mimi Liao <[email protected]>

mimiliaogo requested a review from kevin85421 December 2, 2024 01:53

kevin85421 approved these changes Dec 3, 2024

View reviewed changes

kevin85421 removed the go add ONLY when ready to merge, run all tests label Dec 3, 2024

add explanation of DUMMY_IDLE_DURATION_S

1e424ce

Signed-off-by: Mimi Liao <[email protected]>

kevin85421 added the go add ONLY when ready to merge, run all tests label Dec 3, 2024

rickyyx enabled auto-merge (squash) December 3, 2024 20:41

rickyyx merged commit fc3cfef into ray-project:master Dec 3, 2024
7 checks passed

kevin85421 mentioned this pull request Dec 4, 2024

[Umbrella] Autoscaler improvements ray-project/kuberay#2600

Open

28 tasks

mimiliaogo mentioned this pull request Dec 5, 2024

Add e2e autoscaler test for nested remote functions with IdleTimeOut ray-project/kuberay#2610

Open

4 tasks

	def get_cluster_resource_state(self):
	def _get_cluster_resource_state(self):

[core][autoscaler] Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 #48519

[core][autoscaler] Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 #48519

Conversation

mimiliaogo commented Nov 3, 2024 • edited Loading

Before: NodeDiedError

After

Reproduction Script (on local fake nodes)

Related issue number

Checks

kevin85421 commented Nov 15, 2024

Choose a reason for hiding this comment

mimiliaogo commented Nov 17, 2024

kevin85421 commented Nov 18, 2024

mimiliaogo commented Nov 18, 2024

kevin85421 commented Nov 19, 2024

rickyyx commented Nov 19, 2024

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimiliaogo commented Nov 24, 2024 • edited Loading

kevin85421 Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Nov 25, 2024

mimiliaogo commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimiliaogo Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimiliaogo Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimiliaogo commented Dec 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rickyyx commented Dec 5, 2024

mimiliaogo commented Nov 3, 2024 •

edited

Loading

mimiliaogo commented Nov 24, 2024 •

edited

Loading

kevin85421 Nov 25, 2024 •

edited

Loading

mimiliaogo Nov 26, 2024 •

edited

Loading

mimiliaogo Dec 1, 2024 •

edited

Loading