AlphaFold Quickstart on AWS

ProteinQure 9 MIN READ Jul 22, 2021

AlphaFold 2.0 is widely regarded as a breakthrough milestone in predicting 3D structures of proteins using a Deep Neural Network approach. Naturally, when the AlphaFold paper was published and its source code made publicly available earlier this month, we curious folks at ProteinQure could not resist the temptation to take it for a spin ourselves. This blog post details our process for getting it running on a GPU-capable instance using the AWS Deep Learning AMI, and documents some workarounds for potentially unexpected pitfalls.

For the purposes of this post, we’ll assume you already have an AWS account and are somewhat comfortable with its use and related terminology. It is also worth mentioning that this experimentation isn’t free (but not insurmountably expensive either), and we’ll mention some cost saving techniques along the way. With this post we are aiming to provide a high-level overview of the process, but would be happy to reply in the comments with additional technical clarifications.

With that out of the way, let’s get started!

TL;DR: to just see how we fixed the GPU not being recognized, skip to step 6.

. . .

1. Pick your (virtual) hardware

You will presumably want to run the actual workload on a GPU-capable instance, but you will quickly see that a large chunk of time will be spent downloading the datasets required by AlphaFold. There’s no reason to run an expensive GPU instance for this menial task. So, we will use two instances in this experiment:

  • t3a.large instance as our downloading workhorse. It offers 2 vCPUs and 8GB of RAM at $0.0752/hr. The download will take many hours, so we can expect this to cost somewhere between $3 and $6. The reason we want to pick a large instance type is because of the archive extraction step that follows the download. If we pick a smaller instance, we may exhaust our CPU credits and the extraction will slow down to a crawl.
  • p3.2xlarge instance as our AlphaFold runner. It offers one NVIDIA Tesla V100 GPU, which is perfectly capable for our purpose. AlphaFold supports multiple GPUs, but currently has some issues using them reliably.

💡 To be clear, if you have deep pockets or just don’t want to bother, it’s totally fine to do the downloading on the GPU instance and save yourself the hassle. In that case, adjust accordingly while following the walk-through.

2. Create a persistent volume to hold your data

AlphaFold databases, once downloaded and uncompressed, will require just over 2.5TB of disk space. We certainly wouldn’t want this to be accidentally lost if we terminate our instances. For this, we’ll navigate to EBS and create a dedicated data volume with the following parameters:

  • Volume type: gp3. It has a good balance of cost and performance, and you can increase its IOPS and throughput as necessary
  • Capacity: 3TB. You don’t want to run out of space!
  • IOPS/Througput: This is a tricky one. One of the steps during AlphaFold inference will be bottlenecked by I/O, but also you don’t want to overpay for this. We recommend leaving IOPS at the default value of 3,000, but bumping the throughput to at least 300MB/s from the default 125. But we haven’t experimentally A/B tested this, and this is not likely to be a major performance bottleneck.

Make sure to give your volume a name so you can find it later in the EBS volume list.

Make especially sure that you stick to only a single Availability Zone (e.g. us-east-2b) when you create the volume AND your instances, as you can not attach EBS volumes to instances in different AZs! Check that your preferred GPU instance type is available in your selected AZ.

3. Set up the downloader instance

Let’s run a new t3a.large instance using an Ubuntu AMI (any version).

  1. Attach the data volume you created in the previous step, for example at /dev/sdc, which will make it visible in the OS at /dev/xvdc.
  2. Log in to the instance and verify this with lsblk — you should see /dev/xvdc as the last block device on the list.
  3. Create a partition, format it with xfs, and mount it under /data:

4. Install the prerequisites:

4. Pull down code and databases

This is somewhat redundant to the official documentation, but worth mentioning here. For clarity, we will not be building any Docker images or testing GPU capabilities (we have no GPU) just yet. We’re only here to download the databases.

💡 A couple of things worth noting here. First, notice how we installed tmux in the previous step. Using tmux is a great way to not lose your console session if your SSH connection drops. You simply run tmuxas the first command when you ssh in to the instance. A tmux tutorial is definitely out of scope for this post, but now you have some terms to fuel your search, if you’re not familiar with it already! Second, it is possible to speed up your downloads by: 1) running them in concurrent tmux windows (or sessions), and 2) modifying the individual download scripts (under scripts/) by adding -s8 -x8 options to the aria2c commands. However, note: this puts additional load on the servers hosting the data, some of them belonging to academic institutions. So, having been armed with this knowledge, please use this in moderation and consider being a good citizen of the internet!

With that out of the way, let’s download:

This will take a looooong time. While that’s happening, we can set up our AlphaFold GPU runner instance.

5. Spin up the runner

Let’s create a p3.2xlarge instance in the AWS console. Use the AWS Deep Learning AMI for this purpose (Ubuntu 18.04). This is actually one of the easiest steps in this entire tutorial: you don’t need do much about this machine. Once started and you’ve logged in to it, install a couple of missing prerequisites (the others are already present):

You can skip the First Time Setup section of the official documentation: the NVIDIA container toolkit and Docker are pre-installed 🙌. Let’s check that the GPUs are usable by Dockerized workloads (as per official docs):

The databases are still being downloaded on the other instance, so we will only run a modified subset of commands from the Running AlphaFold section of the official docs to to build the Docker image and install prerequisites:

Yes, we removed the cloned directory in the last step. Why is that? Because we already have it saved on the data disk (still in use by the download), we’ll be working from there, and we don’t like getting confused!

Now all we can do is wait for the downloads to complete. It’s advisable to stop (NOT terminate!) the GPU instance at this time, and go for a long walk.

once the downloads are done…

…(and finished unarchiving, which will take a non-trivial amount of time as well; aim for ~3 hours!), we have a very precious 3TB EBS volume on our hands. We could optionally make a snapshot of it for safekeeping, depending on our future use cases.

Let’s stop our downloader instance and detach the data volume ( /dev/sdc if you were following along). We can now attach it to our GPU instance, also at /dev/sdc for consistency.

Start the GPU instance and SSH into it. Mount the data volume:

You should see the contents of the volume just as they were on the downloader instance.

6. Fix the issue with GPUs not being recognized

Now we’ve finally gotten to the meat of the issue. This is what prompted the writing of this blog post in the first place.

If you were to run the AlphaFold code right now as per the docs, you might (or might not!) notice that the GPUs are not actually being used. This is evidenced by the following log gnarliness:

But this is a GPU-enabled instance, and the tests ran fine, you might say? Well, it seems that the Docker image we’ve built is missing CUDA libs (though a pull request is open that should hopefully fix it).

Thankfully, the AWS Deep Learning AMI we’re using comes pre-installed with CUDA 11.1 and CuDNN libraries that are required by TensorFlow. They are located, as expected, under /usr/local/cuda-11.1/. All we need to do is get them into the container, and modify the LD_LIBRARY_PATH variable such that TensorFlow can actually find these libraries. This can be easily tested by running a Docker container, like so:

Success! We’ve proven that there is a solution. But we’re not done yet. We now need to modify the docker/run_docker.py script to bake this fix into AlphaFold itself.

To accomplish this, we modify the docker/run_docker.py script described in this pull request: https://github.com/ProteinQure/alphafold/pull/1/files. The change would be too tedious to post here in its entirety, but we are simply doing two things:

  1. Mounting the CUDA libraries from the host;
  2. Setting the LD_LIBRARY_PATH environment variable (admittedly in a very brutal and clobbering way) to include the paths to these libraries

And that’s all there is to it! You’re ready to Fold!

Please carefully read the official documentation and make sure you do set the database dir as well as modify the output paths in your docker/run_docker.py script. Ensuring these directories exist and are writable is left as an exercise to the reader. We highly recommend all this data be kept on your 3TB data drive. In the original script, the outputs are going to /tmp/..., so pay attention.

Finally…

Don’t forget that you’re paying for all of these resources. Watch your bill carefully. If you’re feeling adventurous, you can also try to use AWS Spot instances to reduce your spend, but this is an advanced topic requiring a non-trivial amount of automation. If you’re going down this route, we recommend running your workloads on Kubernetes. We’d be happy to share how we do this at ProteinQure!

Conclusion

In this post, we walked through the process of setting up a functional AlphaFold environment on AWS, including some cost-management techniques, and proposed a fix for an issue with missing CUDA libraries that seems to be a blocker for many people trying to run AlphaFold in their environments.