Citymapper Engineering Blog

At Citymapper, we use AWS S3 as our data lake. We love that we can store everything on S3 - it remains unmatched for reliability, security and versatility.

Just like any other file system, sometimes we need to move keys around within S3. Some good reasons for this are:

Reorganizing your data
Moving keys to a more optimal prefixing scheme
Moving keys to a new bucket with a different security policy

When you have hundreds of millions of keys, how do you move them all quickly?

Can't you just run the S3 sync command?

aws s3 sync is the fastest way to move data within S3. Under the hood, it will perform two steps in series:

List all of the keys in your bucket
Use this list to recursively copy any new and updated keys from the source to the destination

It is your best option if you have hundreds, thousands or even a million keys.

When you have hundreds of millions of keys, you have a big data problem:

Just listing all of the available keys takes a long time - and if the process fails, this list is lost
Copying 1 million small keys can take between 10 and 30 minutes
Copying 300 million small keys in serial could take between 50 and 150 hours

Even though aws s3 sync is really fast (and is also idempotent) - it's no fun to babysit a long running process on an EC2 instance.

Ideally, we would be able to break this problem down into smaller chunks:

We want to move a small amount of data at a time
We want to keep track of completion
We want to be able to experiment with concurrency without risking progress

We need to partition the keyspace

We suspected that the key to solving this problem (pun intended) was to copy many prefixes concurrently using a fleet of EC2 instances.

Because our keyspace has hierarchical prefixes, we can increase or decrease the amount of data we want to sync by specifying more or less of our prefix:

s3://source-bucket/2018-02-12/05/30/<logfile>

In our case - we found that each date prefix such as s3://source-bucket/2018-02-12/ contained roughly a million small keys - around 10 minutes of aws s3 sync time.

If we could only find a way to run a distributed aws s3 sync across hundreds of date prefixes!

Meet AWS Batch

AWS Batch can enqueue up to 1 million invocations of a Docker container, and execute each one in parallel on arbitrary EC2 instances.

What if we were able to enqueue a couple of thousand invocations of aws s3 sync? Wouldn't that be a fast way to copy keys from one bucket to another?

With just a couple of steps, we were able to create a distributed aws-cli cluster:

We used the public docker image mesosphere/aws-cli as the basis for our job template.
We created one job queue for each bucket or top level prefix that we wanted to migrate. This allowed us to monitor how far each queue had progressed.
As icing on the cake, the output of each aws-cli was sent straight into Cloudwatch Logs!

Now we just need to submit all of our date prefixes. We only have a few hundred dates - so we just pasted each one into a heredoc:

#! /bin/bash
function execute () {
    # Migrates a given $DATE prefix. These are executed concurrently by AWS batch!
    DATE=$1
    aws batch submit-job \
        --job-name copy-logs-${DATE} \
        --job-queue copy-logs-queue \
        --job-definition aws-cli-cluster \
        --parameters input=s3://source-bucket/${DATE},output=s3://target-bucket/${DATE}
}

while read DATE; do execute $DATE; done << EOF
2013-05-01
2013-05-02
2013-05-03
2013-05-04
... many dates later...
2018-05-11
EOF

After submitting all of our jobs to AWS Batch, all we needed to do was wait for our distributed aws-cli-cluster to...

503: Slow down

...hit an S3 rate limit and get throttled.

Dealing with rate limits

Before embarking on this journey, we highly recommend reading this article. It's full of tips for getting the most performance out of S3.

With AWS Batch, we simply reduced the number of cores that ECS could provision to process our queue. We also provisioned smaller instances in an attempt to maximize network bandwidth.

By experimenting with concurrency on the fly, we were able to find a "good enough" throughput while staying under our rate limits.

Victory

With 24 cores running aws s3 sync, we were able to copy 100 million S3 keys between two buckets in 24 hours.

We believe AWS when they say that S3 truly scales - we suspect that even greater performance is possible.

It's common to solve big data problems by processing smaller chunks of data concurrently. AWS Batch is a great tool for distributing work, and we look forward to solving more problems with it.

Moving millions of S3 Keys with AWS Batch