It is quite evident at this stage we are dealing with a massively parallel computing problem, and we require a cluster. Traditional, on-premises clusters force a one-size-fits-all approach to the cluster infrastructure. However, the cloud offers a wide range of possibilities and allows for optimization of performance and cost.
You can define your entire workload as code (Infrastructure-as-code)and update it with the application code in the cloud. This enables you to automate repetitive processes or procedures. You benefit from being able to reproduce infrastructure and implement operational procedures consistently. This includes automating the job submission process and responses to events, such as job start, completion, or failure.
AWS, and as do any other cloud provider, offers the capability to design the cluster for the application. A one-size-fits-all model is no longer necessary with individual clusters for each application. When running a variety of applications on AWS, various architectures can be used to meet each application’s demands. This allows for the best performance while minimizing cost.
What is the cloud architecture that is most suited to run this type of workload? — this massively parallel workload of running 150K containers, each computing an event footprint.
Let us delve a bit deeper into this massively parallel workload that we have by asking a series of questions and, in the end, ascertain an appropriate cloud architecture.
- Are our workloads embarrassingly parallel? Yes, As mentioned before, our workload is embarrassingly parallel — computations have little or no dependency on other computations. The entire workload is not iterative.
- Do our workloads vary in storage requirements? No, our workload does not differ in storage requirements but is driven by desired performance and reliability for transferring, reading, and writing the data.
- Do our workloads vary in compute requirements? Yes, Our workloads do vary in compute requirements. Some event footprints take a few seconds, and others take a few hours, hence the need to choose an appropriate memory-to-compute ratio for the containers. You could optimize and find the sweet spot ratio (2GB RAM:1 vCPU) for the entire workload or customize the ratio ([1GB RAM:1 vCPU], [2GB RAM:1 vCPU]) for every workload.
- Do our workloads require high network bandwidth/latency? No, Because our workloads do not typically interact with each other, the workloads' feasibility or performance is not sensitive to the bandwidth and latency capabilities of the network between containers. Therefore, clustered placement groups are not necessary for our case because they weaken the resiliency without providing a performance gain.
After analyzing the answers for the above-posed questions, the architecture that lends itself to such a design is a loosely coupled cloud architecture. Another point to note is that workloads' scalability is conspicuously absent in the above list of questions. Scalability is a no-brainer since we are dealing with a massively parallel system.
Loosely coupled applications are found in many areas, including Monte Carlo simulations, image processing, genomics analysis, and Electronic Design Automation (EDA). The loss of one node or job in a loosely coupled workload usually doesn’t delay the entire calculation. The lost work can be picked up later or omitted altogether. The nodes involved in the calculation can vary in specification and power.
The loosely coupled cloud journey often leads to an entirely serverless environment, meaning that you can concentrate on your applications and leave the server provisioning responsibility to managed services. You can run code without the need to provision or manage servers. You pay only for the compute time you consume — there is no charge when your code is not running. You upload your code or your container, and the system takes care of everything required to run and scale your code.
Scalability is another advantage of the serverless approach. Although each task may be modest in size — for example, a compute core with some memory — the architecture can spawn thousands of concurrent nodes, thus reaching a large compute throughput capacity.
Serverless in AWS is synonymous with Lambda functions, Lambdas are not the only serverless compute engines in AWS, and Lambdas do have their limitations. One of the main limitations is the amount of computing time (15 min) that is available. In our project, the compute time for each task is different and varies between a few seconds to hours, and hence the choice of Fargate is a serverless compute engine for containers. The other limitation is the choice of programming language available in Lambda functions. Lambda functions do not have support for FORTRAN code, we needed to scale FORTRAN code, and one of the best ways to do that is to scale it using containers.