EC2 Launch Times
TLDR - As of October 2023, EBS Encryption negatively affects launch performance of instances. Amazon Linux 2023 is the fastest general-purpose AMI to boot. Mid-ranged fixed performance instance types boot in a similar amount of time.
I often hear developers and system administrators complaining about slow start times for EC2 instances or building overly complex solutions to improve start times.
In some scenarios, optimisation might make sense; in others, teams get bogged down in technical debt and complexity to solve an issue that doesn't exist. Failing to follow MVP principles and instead overengineering a solution based on a misunderstanding of the limits and capabilities of the underlying architecture.
So, how quickly can EC2 on-demand instances launch? This article will discuss the effects caused by EBS volumes, AMIs and current-generation instance types such as t3a.2xlarge, c6a.2xlarge or m7i.16xlarge, as of October 2023.
What needs to happen?
Before an EC2 instance can be started, a few critical things need to happen:
- Creation of VPC ENI (Elastic Network Interface) - An ENI must be created within the specified VPC subnet to assign network connectivity to the instance.
- Allocation of a Public IP - If selected, an ephemeral public IP must be attached to the VPC ENI [1].
- Creation of an EBS root (Elastic Block Store) volume - An EBS volume must be created, backed by a specific AMI (Amazon Machine Image) containing the operating system and boot partition.
- Allocate the instance onto a physical host - The EC2 control plane schedules the instance to run on a specific physical host, passing the relevant ENI and EBS information to Nitro hardware for initialisation, which triggers steps including contacting KMS for volume encryption keys and making the network and block devices available to the host [2].
- Instance Startup - The Nitro Controller sends a message instructing the Nitro Hypervisor to start the instance [2]. The instance launches a UEFI (Unified Extensible Firmware Interface) or Legacy BIOS [9], reads the boot volume, the bootloader initialises the kernel, and the operating system launches.
When creating an EBS volume from a snapshot, this API operation will typically complete within a few seconds, and data from the snapshot (stored in Amazon S3) is lazy loaded upon read request into the block volume [3]. This system allows EBS volumes to be used immediately, and if the OS accesses data that hasn't been loaded, EBS downloads the requested data on the fly from Amazon S3.
As AMIs are based on snapshots stored in S3 [4], the process for boot volumes is the same. Only the parts required to boot the instance must be retrieved from the snapshot during the initial instance boot. However, subsequent boots may be faster as the blocks already exist in the EBS volume and do not need to be fetched from S3 [5].
Once the instance has been launched on the underlying host, its state changes to running
, at this point, the instance is ready for use [6]. Status checks for instances are performed on an ad-hoc basis every minute to verify the underlying systems are healthy, and the instance is reachable via an ARP request to the network interface [7]. The status checks do not indicate the second-by-second health of an instance; an instance can be healthy, doing work and accessible while the status checks are initializing.
The following image shows a typical boot sequence of a fresh EC2 instance.
If the instance is stopped and started, we can expect it to launch quicker, as there is no need to create ENIs or EBS volumes as they already exist [1]. Additionally, the initial boot process would've already pulled in required blocks from the AMI stored in S3, so the OS should boot faster, as it no longer has to fetch disk blocks from S3.
Testing methodology
To determine how long the launch process takes until the user code is run (for example, a service is started), we will configure the instance user data [10] always to be called by setting the SCRIPTS-USER parameter to ALWAYS [8]. We will then call a public endpoint from the instance user data, which the test script will use to calculate the duration of the RunInstances
or StartInstances
command to the incoming HTTPS request.
By setting cloud-init to always run the user data, we can use this method to measure startup times from stopped instances. The callback function will report back the Instance ID and current OS Boot Duration/Uptime (/proc/uptime
). The script then measures the overall duration when the HTTP callback request is processed. A summary of the test methodology can be seen in the following diagram.
The pending duration measures the time from just before the API call to when the operating system kernel is booted, including any time spent in POST or the Bootloader. This measurement is taken to ensure that any variance in latency that takes place during the API call or the POST process is captured.
The OS Boot Duration measures the time from kernel initialisation to the execution of the instance metadata. In this way, we can measure an accurate view of how long it takes from an API call to get the EC2 instance running user code.
The code used for benchmarking is available here:
Any tests, including provisioned volumes, are deployed with 3000 IOPS. GP3 tests are provisioned with 3000 IOPS / 125MBPS. All test cases are done with a minimum of 3 instances, and each dataset has been repeated over several weeks to verify the validity data. All data plotted is summarized using the median statistic to reduce the effect of outliers on the dataset.
EBS Volume Types
It's typical practice to enable disk encryption on EBS volumes, but can this slow down the provisioning of an instance?
In this test, we launch fresh instances using the RunInstances
API, launching instances with different types of EBS volumes and sizes.
First, the following graph shows the seconds spent in the pending
state when instances are provisioned with unencrypted volume types.
The results for unencrypted volumes are largely consistent with gp2, gp3 and io1 types and a slightly higher provisioning latency for the new io2 volume type. An unencrypted gp2, gp3 or io1 volume will result in minimal time an instance spends in the pending
state. Typically, for unencrypted gp2, gp3 or io1 volumes, we can expect ~5 seconds in pending
state.
The following graph shows the same volume types but with encryption enabled.
Interestingly, the newest volume type, io2 is the only consistent volume type when encryption is enabled, and the time to provision gp3 and io1 volumes scales in a near-linear fashion up to 1TiB, taking close to 40 seconds. This trend resets after 1TiB, reproducible across multiple regions, instance types and base AMIs. I have reached out to AWS regarding further explanation of the above behaviour, and an investigation is underway.
These findings show that an instance backed by an encrypted 250GiB gp3 volume could expect up to 16 seconds of startup delay, a considerable difference compared to an equivalent unencrypted volume's delay of around 5 seconds. Currently, provisioning encrypted volumes negatively affects instance launch time - it is unclear if this is an AWS bug or intended behaviour.
The same behaviour is not observed when stopping and starting instances; this delay is only seen when launching fresh instances with the RunInstances
API.
Additionally, it should be noted that increasing the root volume size will cause additional delays as cloud-init will attempt to resize the partition and filesystem [11]. This can range from around 8 seconds of additional boot time to resize from 8GiB to 128GiB to over 5 minutes of OS boot time for 8TiB and is consistent between encrypted and unencrypted volume types as encryption provides negligible performance impact [2].
Key Takeaways
- EBS encryption currently increases the time spent in the pending state for freshly launched instances.
- Unencrypted gp2, gp3 or io1 result in the lowest time spent in the pending instance state for freshly launched instances.
- Setting the OS volume to larger values increases the time spent resizing the partition and filesystem after the initial boot, potentially adding several seconds to the overall boot time.
Which AMI boots faster?
In the following tests, different AMIs are booted using RunInstances
. The instances are then stopped and started to measure the StartInstance
startup time.
EKS-related AMIs are included in the test cases, as this might be an important factor for users who wish to provision compute on-demand for Kubernetes clusters using solutions such as Karpenter.
Test Specification
c6a.2xlarge (8vCPU/16GiB)
GP3 Volume (20GiB | 3000 IOPS | 125MiB/S | Unencrypted)
The following AMIs are tested:
- Amazon Linux 2
- Amazon Linux 2023
- Amazon Linux 2 EKS (for EKS 1.28)
- Ubuntu 22.04 LTS
- Ubuntu 20.04 EKS (for EKS 1.28)
- Red Hat Enterprise Linux RHEL 9.2.0
As discussed previously, we can expect the boot time of an instance of the RunInstances
command to be relatively slower than StartInstances
, this is because several things are going on:
- Data blocks are being pulled from an AMI stored in S3 into the EBS volume
- Amazon-based AMIs are likely cached in EBS, allowing them to deliver improved launch performance over other vendors' AMIs by reducing the need to fetch blocks from S3.
- First-time cloud-init processes are running, for example, expanding the disk and filesystem to the target size
The following graph shows the typical time to launch instances with different AMIs on a c6a.2xlarge instance with an unencrypted EBS volume.
The time spent in the pending state is largely consistent for most AMIs between 6-7 seconds, with the outlier being RHEL 9.2.0. Amazon Linux 2 EKS and Amazon Linux 2023, can execute the user data within 14 seconds of making a LaunchInstances
call. The slowest operating system to boot was Ubuntu 20.04 LTS EKS, which took 32.64 seconds and could significantly impact on-demand resource allocation to EKS clusters.
The next graph shows the duration for an instance from a stopped state to executing the user data. For completeness, the EKS AMI types are included here; however, seeing this in production environments would be unusual, as Karpenter and Managed Node Groups would always replace instances rather than stop and start them.
The data shows that Amazon Linux 2023, and Amazon Linux 2, both general-purpose operating systems, boot in largely comparable times from a stopped state. Ubuntu 22.04 LTS boots slightly slower at 14.28 seconds. This might be a deciding factor for some workloads.
The pending duration for RHEL is consistently higher than other AMIs in both test cases. RHEL has an additional hourly licencing charge; perhaps additional work is done behind the scenes to accommodate this.
Comparing the launch and start instance graphs, it can be seen that stopping and starting an instance when required is consistently faster; however, the advantage is less pronounced for Amazon Linux 2023, at only a 4-second improvement. In addition, there is a cost tradeoff, as customers are charged for provisioned EBS volumes attached to stopped instances.
If you are looking at designing a platform based on dynamic EC2 scaling, using an AMI based on Amazon Linux 2023 would likely give the fastest response times.
Key Takeaways
- Use Amazon Linux 2023 for the fastest general-purpose AMI to boot for launches and stop/starts.
- Ubuntu 20.04 LTS EKS takes approximately 2 times longer to boot at 32.64 seconds than Amazon Linux 2 for EKS, at 13.63 seconds. This will have a noticeable user-facing impact when using on-demand schedulers such as Karpenter.
- Instances boot faster after being stopped. However, the effect is less pronounced with Amazon Linux 2023. You are charged for EBS volumes attached to stopped instances, so there is a cost trade-off.
- Some AMIs might result in instances spending longer in the pending state, such as RHEL.
How about Instance Types?
With the volume and AMI data in mind, the instance types are the last major variable to change. Based on the results above, we will launch different instance types with the following configuration:
Test Specification
Amazon Linux 2023
GP3 Volume (10GiB | 3000 IOPS | 125MiB/S | Unencrypted)
Instances Types
- t3a.nano, t3a.2xlarge
- c6a.large, c6a.16xlarge, c6a.48xlarge
- m7i.2xlarge, m7i.16xlarge, m7i.48xlarge
The following graph shows the measurement of the overall waiting time the user experiences when calling the LaunchInstance
API to the user data being executed.
The pending duration is noticeably higher for both t3a instance types, perhaps due to its burstable/oversubscribed model, compared to the fixed-performance instances.
Total launch time decreases as allocated vCPU increases, up to a point. The largest instance types tested, c6a.48xlarge and m7i.48xlarge, launched consistently slower than their 16xlarge equivalents.
After stopping and starting the instances, we see the following results.
As expected from previous tests, the pending durations are ~1-2 seconds quicker across the board. However, the trend of pending duration for each test case remains; for example, the t3a type is slower than the fixed-performance types.
We can see that all instance types boot faster from a stop than a fresh start, again somewhat correlated to the allocated vCPU; however, that trend is broken again at the largest c6a and m7i instances.
Finally, to investigate why the 48xlarge instances boot slower, eight instances of each subtype for the m7i instance type are tested, showing the Pending Duration and OS Boot Duration for a LaunchInstance
API call.
The pending duration tends to be fairly consistent across instance sizes, with most ranging from around 5.5 to 6 seconds. However, the m7i.48xlarge, appears as quite the outlier, consistently taking around a second more in the pending state.
It's clear from the results that a larger instance type doesn't necessarily result in a faster boot. There is potentially some additional overhead when provisioning the largest of instance types, such as additional processing when scheduling or POSTing the virtual machine, and it's likely the OS can't take advantage of all the available compute at the upper end of the instance scale.
Key Takeaways
- Fixed-performance instance types will launch faster than burstable instance types.
- The largest instance sizes are not guaranteed to launch faster than mid-range sizes.
- Launch times are largely consistent across a fixed-performance instance family. However, testing should be done for your specific use case and chosen instance type.
Closing Thoughts
This article has discussed different factors which can affect the launch times of EC2 instances. Factors such as EBS size, EBS encryption, EBS type, selected AMI and instance type can all affect the time it takes to launch an instance.
Spot instances. bare metal instances, hibernation and specialist AMIs such as BottleRocket were beyond the scope of the article.
If looking at optimising your EC2 boot times, use the following findings:
- EBS encryption currently increases the time spent in the pending state for freshly launched instances.
- Unencrypted gp2, gp3 or io1 result in the lowest time spent in the pending instance state for freshly launched instances.
- Use Amazon Linux 2023 for the fastest general-purpose AMI to boot for launches and stop/starts.
- Ubuntu 20.04 LTS EKS takes approximately 2 times longer to boot at 32.64 seconds than Amazon Linux 2 for EKS, at 13.63 seconds. This will have a noticeable user-facing impact when using on-demand schedulers such as Karpenter.
- Instances boot faster after being stopped. However, the effect is less noticeable with Amazon Linux 2023. You are charged for EBS volumes attached to stopped instances, so there is a cost trade-off.
- Some AMIs might result in instances spending longer in the pending state, such as RHEL.
- Setting the OS volume to larger values increases the time spent resizing the partition and filesystem after the initial boot, potentially adding several seconds to the overall boot time.
- Fixed-performance instance types will launch faster than burstable instance types.
- The largest instance sizes are not guaranteed to launch faster than mid-range sizes.
- Launch times are largely consistent across a fixed-performance instance family. However, testing should be done for your specific use case and instance type.
Appendix
References
- AWS - EC2 - Instance Lifecycle
- AWS - Nitro - EBS volume attachment
- AWS - Addressing I/O Latency when restoring Amazon EBS volumes
- AWS - Creating an EBS-backed Linux AMI
- AWS - AMI types - Storage for the root device
- AWS - EC2 Run-Instances API
- AWS - Status checks for your instances
- AWS re:Post - Always execute user data with EC2
- AWS - EC2 - Boot Modes
- AWS - EC2 - Work with instance user data
- AWS - Amazon Linux 2023 - Cloud-init