Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-9440 Report machine state through GetMachineStatus #4616

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

benjirewis
Copy link
Member

@benjirewis benjirewis commented Dec 10, 2024

RSDK-9440

Changes:

  • Adds a new enumerated State field to robot.MachineStatus both server and client side
  • Starts machines with a "minimal" config to start web service earlier before starting with full config
  • Reports StateInitializing in robot.MachineStatus before reconfigure with full config occurs
  • Reports StateRunning in robot.MachineStatus after reconfigure with full config occurs
  • Exposes a SetInitializing method on robot.LocalRobot for the above two points to work

Testing:

  • Modifies server and client side MachineStatus tests to make assertions on State
  • Basic integration test
  • Checks for state running before returning from client.New when in testing
  • Adds more injected machine status functions to client and client session tests

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 10, 2024
@benjirewis benjirewis changed the title RSDK-9440 RSDK-9440 Report machine state through GetMachineStatus Dec 10, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 10, 2024
Copy link
Member Author

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a WIP wrt testing; will leave in draft. These are my ideas so far, though.

// and immediately start web service. We need the machine to be reachable
// through the web service ASAP, even if some resources take a long time to
// initially configure.
minimalProcessedConfig := &(*fullProcessedConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe CopyPublicFields? might look less janky

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

if err := web.RunWeb(ctx, myRobot, options, s.logger); err != nil {
return err
}
myRobot.Reconfigure(ctx, fullProcessedConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a risk that the config watcher would call reconfigure before this reconfigure is called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; good call out. Discussed offline a bit, we should start the config watcher goroutine only after this reconfigure is called.

That does mean that we have the same behavior as today: when the robot is starting up (both in first robotimpl.New(minimalConfig) and myRobot.Reconfigure(fullProcessedConfig)) no new config changes will be seen. So, if a user messes up their config and accidentally starts a module that takes forever to start up, they will not be able to quickly remove that module from their config. Instead, they'll have to restart/shutdown their robot if they want to stop the initial construction. Once again, I don't think this is different from what we have currently, and, of course, viam-server is receptive to gRPC requests earlier with the changes in this PR.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 11, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 11, 2024
@benjirewis
Copy link
Member Author

I broke a lot of tests that I'm presuming are expecting resources to be available as soon as the web service is available. Thinking about it.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 13, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 13, 2024
minimalProcessedConfig.Modules = nil
minimalProcessedConfig.Processes = nil

myRobot, err := robotimpl.New(ctx, minimalProcessedConfig, s.logger, robotOptions...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best way of achieving what we want or the most expedient?

I'm fine with this as-is. And I'm kind of fine never coming back to think about this. But the whole "robot owns the web server" feels backwards.

There would be less states to consider if we could start a web service and register robots with it. And there'd be a small API that describes:

  • What state the robot is in (startup or running) and
  • which APIs are available, e.g:
    • just "GetMachineStatus" and maybe "ResourceNames"
    • but none of "SetPower"/other resource specific APIs

But in this PR we have 90-100 lines between this comment/robot.New and when the web service is started. That's a lot of lines to accidentally break our contract and add some blocking code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best way of achieving what we want or the most expedient?

Perhaps not, and I understand your argument there. I've introduced a slightly different/simpler mechanic for controlling the "initializing" value, so that might address some of your concerns here. I didn't go so far as starting a web service and registering robots with it (if I'm understanding what you're saying.)

// Use `fullProcessedConfig` as the initial `oldCfg` for the config watcher
// goroutine, as we want incoming config changes to be compared to the full
// config.
oldCfg := fullProcessedConfig
utils.ManagedGo(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably about time this lambda gets its own function/name. I think a lot of my above concern goes away if this 60 lines of control flow keyword soup is hidden by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea; working on it + will re-request review when done.

@@ -479,7 +502,8 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
}()
defer cancel()

options, err := s.createWebOptions(processedConfig)
// Create initial web options with `minimalProcessedConfig`.
options, err := s.createWebOptions(minimalProcessedConfig)
Copy link
Member

@dgottlieb dgottlieb Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment off-diff. The goroutine spun off above will check for diff.NetworkEqual and if not, run myRobot.StartWeb (newline 490).

Just below this we call web.RunWeb. I'm not sure what the significance is between having different methods, StartWeb and RunWeb, but assuming that's not interesting: is it possible for those two things to race? Are we guaranteed to end up with the right set of weboptions?

To clarify, this is a question about existing behavior. I don't think this patch changed anything here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine spun off above will check for diff.NetworkEqual and if not, run myRobot.StartWeb (newline 490).

Correct; I believe there's a call to StopWeb before that happens, too. And then a Reconfigure after that StopWeb. All about handling network changes in the config.

Just below this we call web.RunWeb. I'm not sure what the significance is between having different methods, StartWeb and RunWeb, but assuming that's not interesting

It's interesting having those two methods. StartWeb starts up the web service on the robot. RunWeb does that, but also waits on <-ctx.Done(), so it's a blocking call and represents the "main" program that "runs" when you call go run web/cmd/server/main.go.

Is it possible for those two things to race? Are we guaranteed to end up with the right set of weboptions?

It depends what you mean by race. There is a lock on starting the web service, so I'm not sure we'd see a race manifest as an actual DATA RACE, but I think you are "right" in wondering about the "right set of weboptions." I'm not entirely sure, but I think RunWeb would run into an error if it tried to start the web service with an old set of options after the config watcher goroutine had started it already with a new set of options. So, my guess is we'd see an error from RunWeb and an inability to start the server in the event of the race you're describing.

@@ -498,6 +502,8 @@ func newWithResources(
}

successful = true
// Robot is "initializing" until first reconfigure after initial creation completes.
r.initializing.Store(true)
Copy link
Member

@dgottlieb dgottlieb Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically there's a Reconfigure earlier on newcode 491 that sets this value to false. Not to mention the initializing value is initialized (ugh) to false. Two things:

  • I'm taking that it's important we exit this function with initializing set to true. But I would expect to see the setting up at the top near the constructor. Can we document that the placement here is intentional to avoid prior calls mucking with the state?
  • Is this function guaranteed to not start webserver and expose the GetMachineStatus API? If it can, it seems we might be setting initializing too late and may allow clients to observe an illegal transition of ready -> initializing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- line 428 (old) 432 (new) refers to the mod manager web server. Do we need to consider/provide guidelines for how module SDKs (which are -- in theory -- different from "application SDKs") use and perhaps expose GetMachineStatus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the mechanics here slightly. You'll see there's a new robot.Option to start a robot in initializing mode. You can then use SetInitializing(false) to mark the robot as running. This means that only the code here in web/server/entrypoint.go is "special" with respect to initialization. All other calls to robotimpl.New will create robots that always return robot.StateRunning from MachineStatus.

I'm not sure that will address all your concerns here, and I'll think a bit harder about your module question before re-requesting review.

// been closed above. This ensures processes are shutdown before any files
// are deleted they are using.
//
// If initializing, machine will be starting with no modules, but may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. Does anything actually go wrong if we don't guard this "cleanup" logic? Or are we just suggesting that making these calls would be "wasteful" no-ops?

The existing comment/first paragraph refers to "cleanup unused packages", so if we never started up with any, where are they coming from? Existing files on the file system from a previous start?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheukt had mentioned it would be good to guard the lines below based on initialization.

Or are we just suggesting that making these calls would be "wasteful" no-ops?

That's my understanding, yep.

Existing files on the file system from a previous start?

Also my understanding, yep.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 16, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 16, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 16, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 18, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 18, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 18, 2024
@@ -107,8 +107,11 @@ func TestClientSessionOptions(t *testing.T) {
return &dummyEcho{Named: arbName.AsNamed()}, nil
},
ResourceRPCAPIsFunc: func() []resource.RPCAPI { return nil },
LoggerFunc: func() logging.Logger { return logger },
SessMgr: sessMgr,
MachineStatusFunc: func(_ context.Context) (robot.MachineStatus, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inject an almost identical MachineStatusFunc all over the place in client_session_test.go and client_test.go. client.New now calls MachineStatus when in a testing environment to make sure the robot is in a robot.StateRunning and therefore capable of receiving resource API calls.

I thought about placing this functionality in inject/robot.go, but opted to just put it in every robot inject, since we currently do that for ResourceNamesFunc and ResourceByNameFunc.

@@ -129,13 +132,11 @@ func TestClientSessionOptions(t *testing.T) {
Disable: true,
})))
}
roboClient, err := client.New(ctx, addr, logger, opts...)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In three of the client session tests, I had to move the instantiation of the robot client to be below the injection of the session manager. client.New now calls MachineStatus in testing environments, which is not exempted from session creation. So, the injected robot will try to Start a session from its session manager, and end up panicking if StartFunc is not defined. I did not leave a comment in-line explaining this.

I'd like to meet offline at some point to walk through TestClientSessionOptions, in particular, and add some documentation to what it's testing.

@@ -859,21 +883,24 @@ func TestClientUnaryDisconnectHandler(t *testing.T) {
info *grpc.UnaryServerInfo,
handler grpc.UnaryHandler,
) (interface{}, error) {
// Allow a single GetMachineStatus through; return `io.ErrClosedPipe`
// after that.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment here to better describe what this test is doing. We also have to call handler(ctx, req) below, now, as we do want to invoke the injected MachineStatus to understand that the robot is running.


t.Run("unary call to connected remote", func(t *testing.T) {
t.Helper()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[fly-by] Seemingly incorrect test helper annotations.

},
1,
2, // once for client.New call and once for MachineStatus call
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment here and below is hopefully at least somewhat self-explanatory, but the idea here is that this test calls client.New and then rc.MachineStatus to assert on the structure of the machine status. Since client.New now calls MachineStatus in testing environments, the count here and below is increased by a factor of two.

// been closed above. This ensures processes are shutdown before any files
// are deleted they are using.
//
// If initializing, machine will be starting with no modules, but may
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheukt had mentioned it would be good to guard the lines below based on initialization.

Or are we just suggesting that making these calls would be "wasteful" no-ops?

That's my understanding, yep.

Existing files on the file system from a previous start?

Also my understanding, yep.

@@ -479,7 +502,8 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
}()
defer cancel()

options, err := s.createWebOptions(processedConfig)
// Create initial web options with `minimalProcessedConfig`.
options, err := s.createWebOptions(minimalProcessedConfig)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine spun off above will check for diff.NetworkEqual and if not, run myRobot.StartWeb (newline 490).

Correct; I believe there's a call to StopWeb before that happens, too. And then a Reconfigure after that StopWeb. All about handling network changes in the config.

Just below this we call web.RunWeb. I'm not sure what the significance is between having different methods, StartWeb and RunWeb, but assuming that's not interesting

It's interesting having those two methods. StartWeb starts up the web service on the robot. RunWeb does that, but also waits on <-ctx.Done(), so it's a blocking call and represents the "main" program that "runs" when you call go run web/cmd/server/main.go.

Is it possible for those two things to race? Are we guaranteed to end up with the right set of weboptions?

It depends what you mean by race. There is a lock on starting the web service, so I'm not sure we'd see a race manifest as an actual DATA RACE, but I think you are "right" in wondering about the "right set of weboptions." I'm not entirely sure, but I think RunWeb would run into an error if it tried to start the web service with an old set of options after the config watcher goroutine had started it already with a new set of options. So, my guess is we'd see an error from RunWeb and an inability to start the server in the event of the race you're describing.

minimalProcessedConfig.Modules = nil
minimalProcessedConfig.Processes = nil

myRobot, err := robotimpl.New(ctx, minimalProcessedConfig, s.logger, robotOptions...)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best way of achieving what we want or the most expedient?

Perhaps not, and I understand your argument there. I've introduced a slightly different/simpler mechanic for controlling the "initializing" value, so that might address some of your concerns here. I didn't go so far as starting a web service and registering robots with it (if I'm understanding what you're saying.)

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 18, 2024
@benjirewis benjirewis marked this pull request as ready for review December 18, 2024 21:28
@benjirewis benjirewis requested a review from a team as a code owner December 18, 2024 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants