Skip to content

Driver: AWS

Sergei Parshev edited this page Nov 19, 2024 · 33 revisions

AWS driver

AWS is the first cloud driver you can use with Aquarium Fish - now in addition to your local resources you can borrow the resources from outside. It needs some additional preparation to integrate those instances to your infrastructure to work similarly as the local envs, but sometimes it worth it.

How it works?

How AWS driver works?

Like any other remote driver - this one doesn't consume your Fish node machine resources, but uses the cloud resources to spin-up the environments.

Dedicated hosts

In order to spin-up mac instances you really need to have a dedicated host and it's pricey, because Apple decided that on AWS you would not be able to release the host for 24h straight...

But luckily we found a workaround that utilizes another "feature" of Mac machines on AWS - it's mandatory scrubbing process, which is running every time you stop or terminate an instance. It puts dedicated host in pending state for ~1h30m, but the nice thing that you don't have to pay for that.

So Fish just runs empty instance in case dedicated host is younger then 24h and don't do anything for parameterized delay (scrubbing_delay). It's not optimal solution (consumes 5-9 times more machines), but it allows you to pay none for the time dedicated host is just sitting there.

The rest of the dedicated hosts are running as usual and releasing when not in use anymore.

The budgeting for such complicated system could be a pain, so I've added a simple simulator for this purpose. It will input the workloads data in CSV format (check script header) and passes it through all the same processes that are happening within the Fish AWS allocator to spit out the next statistics example in the end (1yr mac pipeline data and 500 hosts cap was used):

Simulator statistics:
Max: Instances: 112 Hosts: 500 Queue: 55 wait (minutes): 11.65

                              Jan        Feb        Mar        Apr        May        Jun        Jul        Aug        Sep        Oct        Nov        Dec
Instances       h/mon:   10038.73   11151.81   10096.55   10309.90   10730.29   10118.50   10598.17   11414.20   13024.22   10912.37    4422.56    6437.40
Hosts           h/mon:   31073.04   33830.46   34200.53   34816.49   35649.71   33562.03   35162.53   36152.82   37476.56   35056.38   14249.24   21735.57
Queue           h/mon:       0.27       3.69       9.88      14.60      18.39      14.50      13.48       6.27      16.18       2.60       6.64       0.00
Queue Mean wait s/mon:       1.26       1.91       2.21       2.65       3.68       2.66       2.32       2.30       2.25       2.23       4.11       0.00

Usage

To use the driver you need:

  • Create the image - you can use regular AMI's provided by AWS, Aquarium Bait will support in adobe/aquarium-bait#3
  • Put the AWS driver configuration & credentials into the Fish config
  • Run the Aquarium Fish node, create Label and send Application to receive the resource you want

Configuration

Describes the driver options in the drivers section in the aquarium-fish config file:

drivers:
  - name: aws
    cfg:
      region:     string  # The AWS EC2 region to use
      key_id:     string  # IAM role credential key id
      secret_key: string  # IAM role credential secret key

      account_ids:   []string  # Trusted account IDs to filter vpc, subnet, sg, images, snapshots... Default is the same as creds account
      instance_tags: map       # Instance tags to use when this node provision them (to identify the node for example)

      dedicated_pool: map  # Managed dedicated pools are used to allocate dedicated hosts on demand and manage their life to save some money
        <name>: map
          type: string  # Type of the dedicated hosts pool (example: "mac2.metal")
          zone: string  # Where to allocate the dedicated host (example: "us-west-2c")
          max:  uint    # Maximum dedicated hosts to allocate

          # Is a special optimization for the Mac dedicated hosts to send them in [scrubbing process] to
          # save money when we can't release the host due to Apple's license of [24 hours] min limit.
          #
          # Details:
          #
          # Apple forces AWS and any of their customers to keep the Mac dedicated hosts allocated for at
          # least [24 hours]. So after allocation you have no way to release the dedicated host even if
          # you don't need it. This makes the mac hosts very pricey for any kind of dynamic allocation.
          # In order to workaround this issue - Aquarium implements optimization to keep the Mac hosts
          # busy with [scrubbing process], which is triggered after the instance stop or termination and
          # puts Mac host in pending state for 1-2hr. That's the downside of optimization, because you
          # not be able to use the machine until it will become available again.
          #
          # That's why this ScrubbingDelay config exists - we need to give Mac host some time to give
          # the workload a chance to utilize the host. If it will not be utilized in this duration - the
          # manager will start the scrubbing process. When the host become old enough - the manager will
          # release it to clean up space for new fresh mac in the roster.
          #
          # * When this option is unset or 0 - no optimization is enabled.
          # * When it's set - then it's a duration to stay idle and then allocate and terminate empty
          # instance to trigger scrubbing.
          #
          # Current implementation is attached to state update, which could be API consuming, so this
          # duration should be >= 1 min, otherwise API requests will be too often.
          #
          # [24 hours]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-mac-instances.html#mac-instance-considerations
          # [scrubbing process]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mac-instance-stop.html
          scrubbing_delay: duration  # Format example: 10m30s
          

Account configuration:

EC2 user needs not much permissions, but it's always a good idea to check the used requests in aws driver sources or use AWS IAM Access Advisor to remove unused permissions:

ec2:AllocateHosts                         # Used in dedicated pool to create new dedicate hosts
ec2:CopyImage                             # Used by TaskImage to re-encrypt the temporary image
ec2:CreateImage                           # Used by TaskImage to create image of the instance
ec2:CreateSnapshots                       # To create snapshots of the disks - for caching
ec2:CreateTags                            # Tag the resources we own like instances, volumes & snapshots - very useful
ec2:DeleteSnapshot                        # Used by TaskImage to complete image delete by removing it's snapshots
ec2:DeregisterImage                       # Used by TaskImage to cleanup the tmp image after re-encrypting
ec2:DescribeHosts                         # Get info about the available dedicated hosts to use them during Allocation, also used in dedicated pool for management
ec2:DescribeImages                        # Get info about the available images and find their ID's
ec2:DescribeInstanceAttribute             # Used by TaskImage to detect instance disks
ec2:DescribeInstanceTypes                 # Used to figure out the architecture of the host to find the right image for triggering mac scrubbing process
ec2:DescribeInstances                     # List the running instances
ec2:DescribeSecurityGroups                # To locate the security group by name or ID
ec2:DescribeSnapshots                     # To list the snapshots and their tags and find the latest ID
ec2:DescribeSubnets                       # To find the subnet ID
ec2:DescribeVolumes                       # Locate volumes to connect
ec2:DescribeVpcs                          # To locate the vpc ID by tag
ec2:ReleaseHosts                          # Used in dedicated pool to release dedicated hosts
ec2:RunInstances                          # Run instance duh
ec2:StopInstances                         # To make a safe snapshot after the instance shutdown
ec2:TerminateInstances                    # Terminate instances duh
kms:ListAliases                           # Find the kms key ID by alias
servicequotas:ListServiceQuotas           # Determine the limits for the project to identify the capacity
servicequotas:ListAWSDefaultServiceQuotas # Determine the limits for the project to identify the capacity

Also for triggering of the mac dedicated host scrubbing process it needs to have the default VPC, please check details in #71.

Label definition

Describes the available options of the driver label definition:

definition:
  driver: aws

  options:
    image:          string  # EC2 AMI ID/Name/Tag:Value of the image you want to use (Tag:Value is usually a bad idea for reproducibility)
    instance_type:  string  # EC2 instance type, [AWS Instance Types](https://aws.amazon.com/ec2/instance-types/)
    security_group: string  # EC2 VPC Security group ID/Name (not a tag) to attach to the instance
    tags:           map     # EC2 Tags to add during instance creation
    encrypt_key:    string  # KMS Key ID or Alias in format "alias/<name>" for newly created disks
    pool:           string  # Which dedicated pool (from configuration) to use to run the instance - otherwise will not use any specific pool

    userdata_format: string  # Empty if not needed or "json", "env", "ps1" to store the metadata in instance userdata field
    userdata_prefix: string  # Could be used with "env" or "ps1" format to add some prefix to each flattened key of the metadata

    # TaskImage options
    task_image_name:        string  # Use this name to new image with defined name + "-DATE.TIME" suffix
    task_image_encrypt_key: string  # KMS Key ID or Alias in format "alias/<name>" if need to re-encrypt the newly created AMI snapshots

  resources:
    cpu:     uint    # Amount of CPUs (threads), not used and defined in `instance_type`
    ram:     uint    # Amount of memory (in GB), not used and defined in `instance_type`
    network: string  # Empty, VPC ID, Subnet ID or Tag:Value of vpc/subnet, if empty - will use default VPC, if VPC - will use the underused subnet of it

    disks:   map     # Disks to create/use in the VM
      <path>:        # Path of the disk device, [AWS User Guide](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html)
        type:  string  # Disk type and additional data in format "<type>[:iops[:throughput]]", "gp3" by default
        label: string  # Additional tags in format "<tag_key>:<tag_value>,..." - empty by default
        size:  uint    # Size of the new disk (in GB), raw disk will be created
        clone: string  # Disk Snapshot ID/Tag:Value to use as a source disk

    lifetime: duration  # Lifetime of the Resource in "1h2m3s" format. If "" or "0" - then default will be used, if negative - no timeout.

NOTICE: You can use names or tags where it's possible only if the owner is the same as the project for security reasons.

Available ApplicationTask's

AWS driver supports the next tasks that could be executed during the instance runtime:

Task: snapshot

Takes snapshot of the instance disks.

  • options:
    • full:bool - with full=true will also create a snapshot of the root (image) disk
  • when:
    • ALLOCATED - execute any time during ALLOCATED status, be careful to make sure you synced disks you want to snapshot, otherwise it's risky to get not completed data in the snapshot.
    • DEALLOCATE - executes after ALLOCATED changed to those statuses, but before the actual termination procedures. It will soft-stop the instance, so you can be sure the data on the disks will be consistent.

Task: image

Creates new AMI from the instance.

  • options:
    • full:bool - with full=true will include the attached disks
  • when:
    • ALLOCATED - execute any time during ALLOCATED status, be careful to make sure you synced disks you want to create an image from, otherwise it's risky to get not completed data in the AMI.
    • DEALLOCATE - executes after ALLOCATED changed to those statuses, but before the actual termination procedures. It will soft-stop the instance, so you can be sure the data on the disks will be consistent.

Examples: