Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors thrown when iterating over subscription source event streams (AsyncIterables) should be caught #4001

Open
aseemk opened this issue Dec 20, 2023 · 25 comments

Comments

@aseemk
Copy link

aseemk commented Dec 20, 2023

Context

Hi there. We're using graphql-js and serving subscriptions over WebSocket via graphql-ws (as recommended by Apollo for both server and client).

In our subscriptions' subscribe methods, we always return an AsyncIterable pretty much right away. We typically do this either by defining our methods via async generator functions (async function*), or by calling graphql-redis-subscriptions's asyncIterator method. Our subscribe methods effectively never throw an error just providing an AsyncIterable.

However, we occasionally hit errors actually streaming subscription events, when graphql-js calls our AsyncIterable's next() method. E.g. Redis could be momentarily down, or an upstream producer/generator could fail/throw. So we sometimes throw errors during iteration. And importantly, this can happen mid-stream.

Problem

graphql-js does not try/catch/handle errors when iterating over an AsyncIterable:

async next() {
return mapResult(await iterator.next());
},

There's even a test case today that explicitly expects these errors to be re-thrown:

it('should pass through error thrown in source event stream', async () => {
async function* generateMessages() {
yield 'Hello';
throw new Error('test error');
}

graphql-ws doesn't try/catch/handle errors thrown during iteration either:

https://github.com/enisdenjo/graphql-ws/blob/e4a75cc59012cad019fa3711287073a4aef9ed05/src/server.ts#L813-L815

As a result, when occasional errors happen like this, the entire underlying WebSocket connection is closed.

This is obviously not good! 😅 This interrupts every other subscription the client may be subscribed to at that moment, adds reconnection overhead, drops events, etc. And if we're experiencing some downtime on a specific subscription/source stream, this'll result in repeat disconnect-reconnect thrash, because the client also has no signal on which subscription has failed!!

Inconsistency

You could argue that graphql-ws should try/catch these errors and send back an error message itself. The author of graphql-ws believes this is the domain of graphql-js, though (enisdenjo/graphql-ws#333), and I agree.

That's because graphql-js already try/catches and handles errors both earlier in the execution of a subscription and later:

  • Errors producing an AsyncIterable in the first place (the synchronous result of calling the subscription's subscribe method, AKA producing a source event stream in the spec) are caught, and returned as a {data: null, errors: ...} result:

    try {
    const eventStream = executeSubscription(exeContext);
    if (isPromise(eventStream)) {
    return eventStream.then(undefined, (error) => ({ errors: [error] }));
    }
    return eventStream;
    } catch (error) {
    return { errors: [error] };
    }

  • Errors mapping iteration results to response events (the result of calling the subscription's resolve method) are caught, and sent back to the client as a {value: {data: null, errors: ...}, done: false} event:

    return mapAsyncIterable(
    resultOrStream,
    (payload: unknown) =>
    executeImpl(
    buildPerEventExecutionContext(exeContext, payload),
    // typecast to ExecutionResult, not possible to return
    // ExperimentalIncrementalExecutionResults when
    // exeContext.operation is 'subscription'.
    ) as ExecutionResult,
    );

So it's only iterating over the AsyncIterable — the "middle" step of execution — where graphql-js doesn't catch errors and convert them to {data: null, errors: ...} objects.

This seems neither consistent nor desirable, right?

Alternatives

We can change our code to:

  • Have our AsyncIterable never throw in next() (try/catch every iteration ourselves)
    • Have it instead always return a wrapper type, mimicking {data, errors}
  • Define a resolve method just to unwrap this type (even if we have no need for custom resolving otherwise)
    • And have this resolve method throw any errors or return data if no errors

Doing this would obviously be pretty manual, though, and we'd have to do it for every subscription we have.

Relation to spec

Given the explicit test case, I wasn't sure at first if this was an intentional implementation/interpretation of the spec.

I'm not clear from reading the spec, and it looks like at least one other person wasn't either: graphql/graphql-spec#995.

But I think my own interpretation is that the spec doesn't explicitly say to re-throw errors. It just doesn't say what to do.

And I believe that graphql-js is inconsistent in its handling of errors, as shown above. The spec also doesn't seem to clearly specify how to handle errors creating source event streams, yet graphql-js (nicely) handles them.

I hope you'll consider handling errors iterating over source event streams too! Thank you.

@yaacovCR
Copy link
Contributor

I agree that the spec is agnostic, and that it would be useful for graphql-js to be consistent and provide explanatory errors. I think the spec should also be improved.

For context, it seems from #918 that prior to that PR, all subscribe errors threw, and that the argument was made there that explanatory errors would be helpful in some cases. The parts of the PR that I skimmed through doesn't seem to indicate why explanatory errors to the client would not be helpful with iteration errors; my suspicion is that the PR was attacking the low-hanging fruit, and the authors/reviewers there would not necessarily object to even more explanatory errors. :)

@robzhu @leebyron

@yaacovCR
Copy link
Contributor

I think the next step would be to raise this topic at a working group meeting. @aseemk are you interested in championing this there? (I am potentially dangerously assuming that this hasn’t happened already…)

@yaacovCR
Copy link
Contributor

yaacovCR commented Oct 14, 2024

graphql/graphql-spec#1099 has editorial changes to the event stream that I am not sure are 100% clear on this point. The way forward I think still goes through a discussion at a WG meeting.

@yaacovCR
Copy link
Contributor

yaacovCR commented Nov 7, 2024

This was discussed at the November 2024 WG:

My interpretation of the conclusions:

  • Our behavior should be tied to the specification, right now we are aligned, @leebyron 's overall direction would be that graphql-js should never throw, but that would have to be reflected in the spec, which is currently being worked on with respect to event streams at Editorial changes for Event Streams graphql-spec#1099 => so that this change in graphql-js may be able to move forward together with the spec language clarification there
  • This would be a huge breaking change for consumers of graphql-js and would have to be communicated accordingly => luckily we have semver and v17 on the horizon, so that provides an opportunity and out-of-band communication would obviously help, i.e. migration guides, blog posts, docs, etc.

@benjie
Copy link
Member

benjie commented Nov 13, 2024

I, personally, think that event streams that raise an error during execution (typically after yielding results from the stream) should cause the client stream to be terminated via emitting an error - we should not terminate successfully with a GraphQL error payload. The graphql() function itself should basically never throw, but successfully returning an async iterable which later yields an error is not the same.

It's already possible (I think?) for users to wrap the async iterables that they return from subscribe() in such a way that errors are absorbed and the iterable terminates cleanly if such behavior is desired.

@yaacovCR
Copy link
Contributor

yaacovCR commented Nov 14, 2024

It's already possible (I think?) for users to wrap the async iterables that they return from subscribe() in such a way that errors are absorbed and the iterable terminates cleanly if such behavior is desired.

Fwiw, the original poster @aseemk suggested this but was trying to avoid a per-subscription solution.

Doing this would obviously be pretty manual, though, and we'd have to do it for every subscription we have.


To your preference:

I, personally, think that event streams that raise an error during execution [sic] should cause the client stream to be terminated via emitting an error - we should not terminate successfully with a GraphQL error payload...

@benjie would you be able to elaborate more on your reasoning for this preference?

More specifically, I think the proposal would be to return a final { errors: <GraphQLErrror> } as the form in which the iterator completes with an error.... rather than simply throwing.

I think it's important to consider another failure mode. What is the CreateSourceEventStream() algorithm succeeds and produces the event stream, but the ExecutionSubscriptionEvent() algorithm throws a "request errors," emitting an errors payload of this form. If that did happen, I would think the service processing this stream should stop as soon as the first "request error" is generated.

What might be an example of a "request error" of this type? Well, it would not happen with failure of variable coercion, because the variables have already been coerced, that's one of the main differences between ExecutionSubscriptionEvent() and ExecuteQuery(). But it could be that the queryType doesn't exist (although that's pretty contrived). If the service did not stop on that request error, it would receive a "request error" for every event, and it would make sense to stop right away. [Not sure why I forgot that for subscriptions even ExecutionSubscriptionEvent() runs the regular field resolvers from the subsciption root type also, not the query root type, meaning this type of request error cannot exist.]

I can't think of a practical example of a "request error" for ExecutionSubscriptionEvent(), but actually for exactly that reason I think it's safe to dictate that a "request error" of any type is basically equivalent to the response stream "completing with error," and we do not have to reserve a separate failure mode to signify that the response stream has in fact "completed with error." [As @benjie points out later, it doesn't seem like a request error can be made from ExecutionSubscriptionEvent() at all!]

Gaming out what might be driving your preference, maybe you are suggesting that even if that is the case now, it would be prudent to reserve the ability for services processing the response stream to distinguish between these two types of events, and so we should preserve the distinct failure modes. I think that's fair, but I would love to hear more about your exact motivation.


I would say that the way this should be handled should not be on a per-subscription basis, but just by response stream processing services like graphql-ws catching the errors and reporting them more cleanly. The original poster actually suggested that services like graphql-ws handle this, but it was argued that this might violate the spec. I'm not sure why that would be. Maybe @enisdenjo could chime in, I know it's been a while since this was first raised, but I am not sure if I have the answer from the comment over here. It sounds like graphql-ws would like to report the error as an event => is that limited in any way by what graphql-js decides to do?

@benjie
Copy link
Member

benjie commented Nov 21, 2024

My concern was that completing a stream successfully (but with the final payload having errors) and completing it with error (due to an underlying stream error) implicitly have different meanings, and was concerned that a final payload with just { errors } wouldn't be sufficient to differentiate this. However, more careful scrutiny of ExecuteSubscriptionEvent reveals that data is always set (even if set to null in the case of ExecuteSelectionSet handling an error), and thus not setting data in the final payload would be a clear signal this relates to the stream itself rather than the selection set, so I think this would be an acceptable (albeit potentially breaking) change.

All that said, in the event of an error in a single subscription stream across a multiplexed protocol (such as graphql-ws), only that single stream in the multiplex should be terminated. This seems to align with the Error event in the graphql-ws protocol, so I'm surprised to hear that the implementation terminates the multiplex and not just the single stream? I think this specific issue is a separate (but related) concern.

@benjie
Copy link
Member

benjie commented Nov 21, 2024

Here's my first punt at this: graphql/graphql-spec#1126

Here's a go at making this change on top of Lee's editorial changes:

@benjie
Copy link
Member

benjie commented Nov 21, 2024

One important note here is that internal errors will still result in the stream closing with an error - this should still not terminate the entire multiplex, only the individual stream within it.

@enisdenjo
Copy link
Member

enisdenjo commented Nov 26, 2024

this should still not terminate the entire multiplex, only the individual stream within it

Just thinking of cases where internal errors span across the whole GraphQL instance - but couldnt think of any. That being said - I agree. graphql-ws should change to use the "error" message type for internal errors too.

@yaacovCR
Copy link
Contributor

Tried a quick implementation at #4302

@enisdenjo
Copy link
Member

enisdenjo commented Nov 27, 2024

I like the way #4302 looks. Even if/when it lands I'll proceed with #4001 (comment). There will be a distinction, errors from #4302 will be a part of the next message and uncaught ones the error message.

@martinbonnin
Copy link

errors from #4302 will be a part of the next message and uncaught ones the error message.

@enisdenjo not sure I follow. Is there an error message in graphql-sse? I can only see next and complete. Does that mean there's no way to represent "internal errors" in graphql-sse?

@enisdenjo
Copy link
Member

enisdenjo commented Nov 27, 2024

@martinbonnin graphql-sse follows SSE semantics where all events are just data transmitted over the wire.

The only reason why complete exists in the "distinct connections mode" is to indicate to the client that the following close of connection is expected. For example, EventSource will keep reconnecting forever until the client stops the stream.

That being said, next just means "event" - any event, i.e. if the complete werent necessary, next wouldnt exist either.

"Single connection mode", on the other hand, would need the complete nevertheless because of its multiplex nature. But there we're keeping the semantics in line by not introducing an error message type.

P.S. I spoke too fast in my previous comment and have updated it in place to not confuse future readers.

@martinbonnin
Copy link

martinbonnin commented Nov 27, 2024

SSE nextevent is like so:

interface NextMessage {
  event: 'next';
  data: {
    id: '<unique-operation-id>';
    payload: ExecutionResult;
  };
}

Because the payload is always ExecutionResult, we can't distinguish between responseStream emits response and responseStream completes with error when using graphql-sse. Or can we?

PS: not saying this is a big deal but trying to understand if it's worth keeping this difference with graphql-ws. I think I'dd happily drop graphql-ws error event and just use next everywhere but I'm catching up so maybe there's a reason we need both next and error? (next having payload: ExecutionResult and error having payload: [Error])

@enisdenjo
Copy link
Member

Because the payload is always ExecutionResult, we can't distinguish between responseStream emits response and responseStream completes with error when using graphql-sse. Or can we?

This is also an ExecutionResult, but an erroneous one:

{
  "errors": [{ "message": "whatever" }]
}

I think I'dd happily drop graphql-ws error event and just use next everywhere but I'm catching up so maybe there's a reason we need both next and error? (next having payload: ExecutionResult and error having payload: [Error])

Yes, you're on the right track. I was also thinking about it yesterday and maybe having error there too is not necessary - the original reasoning with having the error message in graphql-ws is to follow the Observable pattern (next, error and complete).

@martinbonnin
Copy link

the original reasoning with having the error message in graphql-ws is to follow the Observable pattern (next, error and complete).

Makes complete sense and it's also what is pushed by the current spec (see latest editorial PR):

A response stream can:

  1. "emit a response" (next)
  2. "complete normally" (complete)
  3. "complete with error" (error)

But if a transport cannot support the spec because it doesn't have 3. then it's a problem.

I'd personnally vote for simplicity and only use next and complete (like graphql-sse) but this might be a bigger spec change... I'm curious what's everyone thoughts here.

@enisdenjo
Copy link
Member

enisdenjo commented Nov 28, 2024

Good point. If others deem it necessary, I'd be happy to add the error message to graphql-sse too keeping parity.

However, #4302 is a "problem" then. A "problem" because the iterator wont throw, will instead catch exceptions and emit them as ExecutionResults with errors property. I.e. #4302 removes the error from JS native iterable semantics. Meaning, neither graphql-sse nor graphql-ws will know when to error (unless an error outside graphql-js subscriptions occurs) since all values are just emitted from the iterator.

@martinbonnin
Copy link

martinbonnin commented Nov 28, 2024

If others deem it necessary, I'd be happy to add the error message to graphql-sse too keeping parity.

FWIW, I'm currently in the "opposite" team: the "let's simplify the spec by removing the need for error" team.

That might be a bigger lift but also since we're on the topic now, it might be the good time to do it.

Meaning, neither graphql-sse nor graphql-ws will know when to error (unless an error outside graphql-js subscriptions occurs) since all values are just emitted from the iterator.

Agreed. And this is fine I think?

error has created confusion in Apollo Kotlin users. What should you do when you receive an error? Retry? Not if it's a validation error (edit: or are validation errors next events? looks like yes). But if the source stream fails because redis is down (like here), probably retrying with exponential backoff is desired.

I came to the conclusion that the error vs next distinction is not useful for clients and that other information such as whether the subscription should be retried is better left outside the spec. For an example in error.extensions.

This is a loosely held opinion though and if anyone has use cases for error vs next, I'm very curious to hear them.

@benjie
Copy link
Member

benjie commented Nov 28, 2024

A "problem" because the iterator wont throw, will instead catch exceptions and emit them as ExecutionResults with errors property. I.e. #4302 removes the error from JS native iterable semantics. Meaning, neither graphql-sse nor graphql-ws will know when to error (unless an error outside graphql-js subscriptions occurs) since all values are just emitted from the iterator.

This is the same as what graphql() and execute() do for non-subscription requests anyway, right? They should never throw an error (except if an internal error occurs) - they will always yield success, though that success may just be { errors: [...] }. Or am I misunderstanding your point?


From a consumer point of view, an error yielded deliberately by the server in the process of performing its GraphQL duties ({ errors: [...] }) is very different from an error originating from somewhere else - for example the Nginx proxy reboots, terminating all connections cleanly with an error; or the wifi drops, terminating all connections uncleanly - and we should differentiate these as best we can. In the case of a multiplex, the multiplex takes over the GraphQL duty and must handle the error that occurs on a stream, either by terminating just that stream with an error (highly recommended), or by terminating the entire multiplex (perhaps necessary if the multiplex doesn't support indicating errors within a stream).

@enisdenjo
Copy link
Member

they will always yield success, though that success may just be { errors: [...] }. Or am I misunderstanding your point?

@benjie that's all fine, yielding errors: [] is ok . What I was discussing is the need of an error message type - which would not be needed if the errors: [] are yielded through next.

FWIW, I'm currently in the "opposite" team: the "let's simplify the spec by removing the need for error" team.

@martinbonnin I am on that team too! :D

I love the idea that a GraphQL stream emits data until completed. And the data can contain an error.

@benjie
Copy link
Member

benjie commented Nov 29, 2024

What I was discussing is the need of an error message type - which would not be needed if the { errors: [] } are yielded through next.

Yes, in the case of expected (normal, handled, non-internal) errors, then under https://github.com/graphql/graphql-spec/pull/1127/files the errors would be a regular payload like any other and would come through next() (followed by the iterable terminating). Only internal errors would come through throw(), and those are where something has gone so wrong that we can't represent it in a GraphQL errors payload anyway - internal errors should not happen; they indicate a bug in the implementation (or, maybe, the spec), I think?

@enisdenjo
Copy link
Member

Only internal errors would come through throw(), and those are where something has gone so wrong that we can't represent it in a GraphQL errors payload anyway

I agree, but that wont be the case if #4302 lands. Whatever the underlying error is, internal or not, #4302 will put it in the next().

The way I see it, we have two options at the moment:

  1. Land transmit source event stream errors as an error payload #4302 and have all errors be yielded. This path does not require the error message type since an exception from an iterable cannot occur. Also, without inspecting the errors, there's no way for the consumer to know the severity of the error.
  2. Reject transmit source event stream errors as an error payload #4302 and keep the current approach. This path would require the handling of errors to happen on transport/consumer level, it then would require the error message type that would indicate "something went horribly wrong".

@benjie
Copy link
Member

benjie commented Nov 29, 2024

A: If you have an internal error: that's a bug.

B: If you don't represent an internal error as an error: that's a bug.

So if B ever occurs, fix A, and you've fixed both bugs 😉

@martinbonnin
Copy link

martinbonnin commented Nov 30, 2024

I agree, but that wont be the case if #4302 lands. Whatever the underlying error is, internal or not, #4302 will put it in the next().

FWIW #4302 would remove the need for error for this case (while iterating the stream). But there is the mention of complete with error in @leebyron editorial changes:

  - Let {response} be the result of running
    {ExecuteSubscriptionEvent(subscription, schema, variableValues,
    sourceValue)}.
  - If internal {error} was raised:
    - Cancel {sourceStream}.
    - Complete {responseStream} with {error}.

I'd say whatever we decide here when iterating the event stream should also apply to processing a single event.

Re: what to decide, I think I'd prefer removing the complete with error case completely as it doesn't bring much information to client. I have yet to come accross a good use case where a client handles the error event and does something that wouldn't be possible using a regular next response event (with errors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants