Skip to content

Commit

Permalink
Add service canary documents (#108)
Browse files Browse the repository at this point in the history
* Add service canary documents

* Delete empty line

* Use consistent arch in operator test

* Add multiple canaries guide

* Refine some stuff

* Correct 6th diagram of paper

* Add canary chosen table

* Add newline in table

* Make team name more clear
  • Loading branch information
xxx7xxxx authored Dec 30, 2021
1 parent 6e40c35 commit c6f6f4c
Show file tree
Hide file tree
Showing 21 changed files with 315 additions and 16 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -401,4 +401,4 @@ EaseMesh is under the Apache 2.0 license. See the [LICENSE](./LICENSE) file for

## 11. User Manual

See [EaseMesh User Manual](./docs/user_manual.md) for details.
See [EaseMesh User Manual](./docs/user-manual.md) for details.
4 changes: 1 addition & 3 deletions README.zh-CN.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@


# EaseMesh

- [EaseMesh](#easemesh)
Expand Down Expand Up @@ -404,4 +402,4 @@ EaseMesh采用Apache 2.0许可证。详情请见[LICENSE](./LICENSE)文件。

## 11. 用戶手冊

详情请见[EaseMesh用戶手冊](./docs/user_manual.md)。
详情请见[EaseMesh用戶手冊](./docs/user-manual.md)。
Binary file added docs/imgs/multiple-canaries-guide-01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/multiple-canaries-guide-02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/multiple-canaries-guide-03.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/multiple-canaries-guide-04.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/multiple-canaries-guide-05.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/multiple-canaries-guide-06.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/service-canary-01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/service-canary-02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/service-canary-03.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
126 changes: 126 additions & 0 deletions docs/multiple-canaries-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@

# Multiple Canaries Guide

- [Multiple Canaries Guide](#multiple-canaries-guide)
- [Background](#background)
- [Local Canary](#local-canary)
- [Global Canary](#global-canary)
- [Practical Guide](#practical-guide)
- [Explicitly Exclusive Traffic Rules](#explicitly-exclusive-traffic-rules)
- [Choose Zero or One Canary](#choose-zero-or-one-canary)
- [FAQ](#faq)

## Background

Canary deployment normally aims to test new features of services with a specific part of the traffic in the production environment. And the feature will be evaluated in multi-dimensions such as online errors, performance or business feedback from users, etc.

In the concept, the canary sounds simple, but in the real world, we often need to handle more complicated things, for example:

- Test and choose one decision from multiple candidates of new features, which means that multiple canaries are running for one service technically.
- Simultaneously, a service has multiple canaries testing for different features.
- The situations above expanded across multiple services.

In a simple summary, we need to explicitly handle how to run multiple canaries in one or more services, and constitute the relationship among them.

## Local Canary

First of all, we start with the simplest situation named `local canary`: If a feature only needs one service to deploy a canary, then we call this canary `local canary`. Please notice there could be many `local canary` at the same time, even within one service. So we could make a definition: `local canary` is the feature only requiring one service to deploy canary.

The local canary itself is very simple, but when it comes to the relationship between them we need to be careful. So we will use an evolutionary way to illustrate its points.

As the basic example at (1), we have 3 services to represent backend services of an order takeaway app, and primary traffic means all traffic except canary traffic:

![image](imgs/multiple-canaries-guide-01.png)

Then at (2), `the delivery team` deployed a local canary `delivery-beijing` to test a new feature for traffic from Beijing.

![image](imgs/multiple-canaries-guide-02.png)

And we go to (3), another team `the restaurant team` deployed another local canary `restaurant-beijing` to test another new feature. So if the two canaries handled some or all common traffic, the clients might get unexpected results. For example, `restaurant-beijing` returned a cook duration but `delivery-beijing` returned a delivery duration, where the sum of two separated durations is not consistent with the original total duration. This kind of confusing situation isn’t absolutely what we want to appear.

![image](imgs/multiple-canaries-guide-03.png)

So the situation illustrated (4) is what we expect in normal scenarios. We need to explicitly split Beijing traffic into two parts respectively going through two different canaries. There are different solutions in different environments, we will demonstrate one later.

![image](imgs/multiple-canaries-guide-04.png)

As the evolution shows, we can tell it’s unsafe that local canaries share some part of traffic. In other words technically: Local canaries do not call each other.

## Global Canary

Based on local canary, the term global canary is pretty clear, it is for the feature that needs multiple services to respectively deploy one release to support one canary. So we need global canary to:

- Test a feature involving multiple services.
- Transfer traffic through service instances which belongs to the same global canary.

Along with the local canary example, we evolve it with global canary:

![image](imgs/multiple-canaries-guide-05.png)

The principles of global canary evolved from local canary is almost the same: It can’t share traffic with other local or global canaries. But as a global canary, it needs one more principle: It needs to call the same global release of another service if there is, otherwise the primary release can be just the choice. If the traffic choice violates the principles, it can’t get the whole part of the feature, or even it could get unsafe behavior.

When we reach here, we can tell clearly: **Local canary is just a special case of global canary**. So we can conclude 3 core principles here for multiple canary deployments:

1. The traffic rules of choosing the canary are explicitly exclusive.
2. The complete chain of a request goes one canary at most.
3. Normal traffic not matching canary rules must go through primary deployments.

## Practical Guide

Before jumping into the practice, we should define terms to make the words more fluent.

- Color: we use the word to refer to give a request a specific tag, which also means the canary it belongs to under the context.

### Explicitly Exclusive Traffic Rules

To satisfy this goal, we just need to color plain/uncolored traffic in the endpoint under dedicated rules. For example, we use priority as an integer to represent the coloring order, where the number is lower, the priority is higher. So back to the local canary example, we assigned priority 4 to `restaurant-beijing`, 5 to `delivery-beijing`. So the default choice of Beijing traffic will be `restaurant-beijing`. Besides the so-called default choice, if the traffic itself has been already colored in advance, it will go its own canary regardless of the traffic rules.

![image](imgs/multiple-canaries-guide-06.png)

We suspect you will ask what if they got the same priority, the solution for it could be varied. You can forbid assigning the same priority, or give canaries under the same priority a second level explicit order such as ordering alphabetically.

### Choose Zero or One Canary

based on the practice above, we could just need to guarantee the traffic can’t be recolored in the whole path, which means its color can only be initialized but not changed. For example, if we used an HTTP header `X-Canary-Choice` to represent the color. Every endpoint in the chain must not change its value if there already has been one.

So until now, we could write a simple snippet of pseudocode to explain it in a technical way:

```go
canary := request.Headers[“X-Canary-Choice”]

if canary != “” {
sendRequestToCanary(request, canary)
} else {
canary := chooseCanary(request)
if canary != “” {
request.Headers[“X-Canary-Choice”] = canary
sendRequestToCanary(request, request.Canary)
} else {
sendRequestToPrimary(request)
}
}
```

And the complete examples of the alogorithm `chooseCanary` could be like:

| Traffic | Delivery Canary Table <br>(Priority, Traffic Rules, Color) | Decision | Strategy |
| :-----------------------------------: | ------------------------------------------------------------------- | :-----------------------: | :------------------------------------------------: |
| Beijing | 1, Beijing, Green<br>2, Beijing, Blue | Green | Base on Priority |
| Beijing&Android<br>Beijing<br>Android | 1, Beijing, Green<br>2, Android, Yellow | Green<br>Green<br>Yellow | Base on Priority<br>Base on Rules<br>Base on Rules |
| Beijing&Android<br>Beijing<br>Android | 1, Android, Yellow<br>2, Beijing&Android, Blue<br>3, Beijing, Green | Yellow<br>Green<br>Yellow | Base on Priority<br>Base on Rules<br>Base on Rules |

finnally please notice the performance cost in selecting canary while it has many canaries. The administration had better set a limitation number for canaries, such as 5.

## FAQ

- What canary policy should be chosen?

It depends on the real business, the common ways could be:

On percentage: it is the simplest policy, but the inconsistent behavior for users may decrease user experience.
On client devices or geographical region, etc: it is stabler than percentage policy, but not stable as users policy.
On users: it is more complex than the percentage policy, but it has more precise control on testing such as VIP users have higher priority to use the new canary feature.

- Which component is better to color traffic?

The moderate way is to use API gateway (the traffic entry) to support configurable canary coloring rules. It needs an API gateway to easily integrate new features. But as a self-contained solution, every endpoint should have the ability to color traffic in general ways.
171 changes: 171 additions & 0 deletions docs/service-canary-user-manual.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Service Canary User Manual

- [Service Canary User Manual](#service-canary-user-manual)
- [Quick Start](#quick-start)
- [Config Explained](#config-explained)
- [Another Service Canary](#another-service-canary)
- [Service Canary Across Multiple Services](#service-canary-across-multiple-services)
- [Safety](#safety)

EaseMesh uses service canary to define rules of [canary release](https://martinfowler.com/bliki/CanaryRelease.html) for mesh services.

## Quick Start

We use 3 services to present a demonstration of a takeaway app, plus a delivery canary release to add a new feature that returns road duration.

![image](./imgs/service-canary-01.png)

1. Apply takeaway app config:

```bash
$ emctl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/easemesh_tenant.yaml
$ emctl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/easemesh_order.yaml
$ emctl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/easemesh_restaurant.yaml
$ emctl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/easemesh_delivery.yaml


$ kubectl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/k8s_mesh_namesapce.yaml
$ kubectl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/k8s_order.yaml
$ kubectl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/k8s_restaurant.yaml
$ kubectl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/k8s_delivery.yaml
```

2. Try primary traffic

```bash
# Get order public node port.
$ kubectl get -n mesh-service service order-mesh-public
$ curl http://{node_ip}:{order_public_port}/ -d '{"order_id": "abc1234", "food": "bread"}'
order_id: abc1234
restuarant:
delivery_time: 2021-12-07T13:12:14
food: bread
order_id: abc1234
```

3. Add canary of delivery

```bash
$ emctl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/easemesh_delivery_beijing.yaml
$ kubectl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/k8s_delivery_beijing.yaml

$ curl http://127.0.0.1:32539/ -d '{"order_id": "abc1234", "food": "bread"}' -H 'X-Location: Beijing'
order_id: abc1234
restuarant:
delivery_time: '2021-12-07T13:22:47 (road duration: 7m)'
food: bread
order_id: abc1234
```

## Config Explained

Actually, We just introduce a new definition to describe service canary in [delivery_beijing.yaml](https://github.com/megaease/easemesh-demo/blob/main/deploy/mesh/easemesh_delivery_beijing.yaml):

```yaml
apiVersion: mesh.megaease.com/v1alpha1
kind: ServiceCanary
metadata:
name: delivery-mesh-beijing
spec:
priority: 5 # The range is [1, 9], default is 5, the lower number is, the priority is higher.
selector:
matchServices: [delivery-mesh] # What services are in the canary.
matchInstanceLabels: {release: delivery-mesh-beijing} # What instance labels are in the canary.
trafficRules: # What characteristics of traffic are in the scope of canary.
headers:
X-Location:
exact: Beijing
```
So this config tells EaseMesh: The traffic with header `X-Location: Beijing` will be tagged `delivery-mesh-beijing`, and it will go through instances labeled `release: delivery-mesh-beijing` of the service `delivery-mesh`.

The details about the config refer to [service canary](https://github.com/megaease/easemesh-api/blob/main/v1alpha1/meshmodel.md#easemesh.v1alpha1.ServiceCanary).

## Another Service Canary

![image](./imgs/service-canary-02.png)

Now we are deploying restaurant canary adding a feature predicting the cook duration, which is also tested for Beijing traffic. But it reuses the same header `X-Location` to identity Beijing traffic. So how EaseMesh handles the conflicts between the two canaries is to use different priorities. [delivery_beijing.yaml](https://github.com/megaease/easemesh-demo/blob/main/deploy/mesh/easemesh_restaurant_beijing.yaml) :

```yaml
apiVersion: mesh.megaease.com/v1alpha1
kind: ServiceCanary
metadata:
name: restaurant-mesh-beijing
spec:
priority: 4
selector:
matchServices: [restaurant-mesh]
matchInstanceLabels: {release: restaurant-mesh-beijing}
trafficRules:
headers:
X-Location:
exact: Beijing
```

The lower number is, the priority is higher. So the traffic from Beijing will be tagged `restaurant-mesh-beijing` instead of `delivery-mesh-beijing`. So if the priority of `restaurant-mesh-beijing` was 6, `delivery-mesh-beijing` will be the one.

Multiple canaries with the same priority will be sorted alphabetically in matching, but the user had better rely on the priority instead of the name in order to get explicit results.

So after understanding the mechanism, we could apply the config:

```bash
$ emctl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/easemesh_restaurant_beijing.yaml
$ kubectl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/k8s_restaurant_beijing.yaml
$ curl http://127.0.0.1:32539/ -d '{"order_id": "abc1234", "food": "bread"}' -H 'X-Location: Beijing'
order_id: abc1234
restaurant:
delivery_time: '2021-12-07T15:11:33 (cook duration: 5m)'
food: bread
order_id: abc1234
```

Now Beijing traffic will go through the new restaurant canary, which doesn't get mixed with delivery canary without confusion. What if you want it to go through delivery canary, adding the header `X-Mesh-Service-Canary: delivery-mesh-beijing` will get the last result as expected.

## Service Canary Across Multiple Services

![image](./imgs/service-canary-03.png)

So what if delivery and restaurant need to be tested for a new feature together. We prepare a feature that restaurant returns coupon if delivery returns the delivery time is beyond the deadline.

```bash
$ emctl apply -f https://github.com/megaease/easemesh-demo/raw/main/deploy/mesh/easemesh_android.yaml
$ kubectl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/k8s_delivery_android.yaml
$ kubectl apply -f https://raw.githubusercontent.com/megaease/easemesh-demo/main/deploy/mesh/k8s_restaurant_android.yaml
curl http://127.0.0.1:32539/ -d '{"order_id": "abc1234", "food": "bread"}' -H 'X-Phone-Os: Android'
order_id: abc1234
restaurant:
coupon: $5
delivery_time: 2021-12-07T16:54:01
food: bread
order_id: abc1234
```

The config shows it very clearly:

```yaml
apiVersion: mesh.megaease.com/v1alpha1
kind: ServiceCanary
metadata:
name: refund-android
spec:
priority: 5
selector:
matchServices: [restaurant-mesh, delivery-mesh]
matchInstanceLabels: {release: refund-android}
trafficRules:
headers:
X-Phone-Os:
exact: Android
```

The details about the config refer to [service canary](https://github.com/megaease/easemesh-api/blob/main/v1alpha1/meshmodel.md#easemesh.v1alpha1.ServiceCanary).

## Safety

We formulate some rules to guarantee the safety and clarity of service canary:

1. One request is tagged with one canary at most throughout the full chain (technically header `X-Mesh-Service-Canary` will be only one value, and never change if it's been filled).
2. The tagging rule is defined without any ambiguousness(ordered by priority then name).
10 changes: 7 additions & 3 deletions docs/sidecar_protocol.md → docs/sidecar-protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,11 @@

There are three types of traffic that are managed by EaseMesh.

* First, the **RESTful-API HTTP traffic** for RPC inside the mesh. This traffic is invoked by Java applications with popular RPC frameworks, such as Feign, RestTemplate, and so on. EaseAgent will enhance this traffic by adding the target RPC server's name inside the HTTP header for telling the sidecar of the real handler.
* First, the **RESTful-API HTTP traffic** for RPC inside the mesh. This traffic is invoked by Java applications with popular RPC frameworks, such as Feign, RestTemplate, and so on. EaseAgent will enhance this traffic by adding the target RPC server's name inside the HTTP header for telling the sidecar of the real handler. The traffic must satisfy at least one way of:

1. Headers: `X-Mesh-Rpc-Service: {destination_service_name}`
2. Headers: `Host: {destination_service_name}` or `Host: ^(\w+\.)*{destination_service_name}\.(\w+)\.svc\..+`

* Second, the **Health-checking HTTP traffic**. This traffic is sent from the sidecar to the Java application's additional port opened by EaseAgent. The complete URI is `http://localhost:9900/health` by default. This `9900` port is opened by EaseAgent, sidecar will query this URI period for checking the liveness of the Java application. After successfully deployed, sidecar will registry this instance into EaseMesh automatically after confirming the HTTP 200 success return by this URI.
* Third, the **Service-discovery traffic**. This traffic is invoked by the Java spring cloud application's RPC framework. During the lifetime of the Java application, sidecar will work as the Java application's service registry and discovery center. EaseMesh sidecar implements Eureka/Consul/Naocs APIs for hosting the Java application's registry and discovery requests. To make the sidecar server the registry and discovery center, value it with `http://localhost:13009` inside the Java application's XML. The port `13009` is listened by sidecar for handling Eureka/Consul/Nacos APIs.

Expand Down Expand Up @@ -39,11 +43,11 @@ The ports used by EaseMesh sidecar+agnet system

To support the none-Java-spring-cloud-based RESTful-API application, regardless of which programming is used. The application must follow the protocol below


1. It must serve as standard RESTful-API for handling requesting or invoking RPC.

2. It must use a domain for discovering in RESTful-API RPC.
```

```plain
Requirement:
1. Use coreDNS with easemesh specific plugin
2. Valid domain formats:
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion emctl/cmd/client/command/flags/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ spec:
type: "DirectoryOrCreate"`

// DefaultEasegressImage is default name of Easegress docker image
DefaultEasegressImage = "megaease/easegress:latest"
DefaultEasegressImage = "megaease/easegress:easemesh"
// DefaultEaseMeshOperatorImage is default name of the operator docker image
DefaultEaseMeshOperatorImage = "megaease/easemesh-operator:latest"
// DefaultShadowServiceControllerImage is default name of the shadow service docker image
Expand Down
4 changes: 2 additions & 2 deletions emctl/cmd/client/command/meshclient/zz_ingress_gen.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@ import (
"net/http"
)

type ingressGetter struct {
type ingressInterface struct {
client *meshClient
}
type ingressInterface struct {
type ingressGetter struct {
client *meshClient
}

Expand Down
4 changes: 2 additions & 2 deletions emctl/cmd/client/command/meshclient/zz_resilience_gen.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@ import (
"net/http"
)

type resilienceGetter struct {
type resilienceInterface struct {
client *meshClient
}
type resilienceInterface struct {
type resilienceGetter struct {
client *meshClient
}

Expand Down
4 changes: 2 additions & 2 deletions emctl/cmd/client/command/meshclient/zz_servicecanary_gen.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@ import (
"net/http"
)

type serviceCanaryInterface struct {
type serviceCanaryGetter struct {
client *meshClient
}
type serviceCanaryGetter struct {
type serviceCanaryInterface struct {
client *meshClient
}

Expand Down
Loading

0 comments on commit c6f6f4c

Please sign in to comment.