Performing Machine Learning (especially at scale) involves substantial coordination and countless technologies. Until very recently, this required setting up custom environments tailored to specific Machine Learning tasks. However, recent technological advancements have made it possible to create complex Machine Learning workflows that are portable to new environments and runnable with the click of a button.
FlyteHub is a new, free public registry of open-source Machine Learning workflows. With FlyteHub, you can click-to-run any open-source workflow and perform powerful machine learning at scale without writing any code. Since these workflows are portable, they can run on your laptop, or across a cluster of 1000 machines.
Imagine the following scenario:
- A Data Scientist writes a workflow to perform image recognition.
- The Data Scientist realizes that others might benefit from their work, and uploads the workflow to FlyteHub.
- Because FlyteHub workflows are portable, anyone with FlyteHub installed can click-to-import this open-source workflow and perform image recognition without writing any code, or installing any additional packages.
- These users can contribute back to the project, improving the workflow for everyone.
FlyteHub is only possible now due to several industry advancements, including “Flyte”, which was launched just weeks ago at KubeCon. Try FlyteHub now, or read on to learn about the timeline of advancements that made open-source workflows possible today.
Along the way, we will show how open-source workflows can be used to hunt and capture the elusive Sasquatch (better known as “Bigfoot”).
Similar workflows could be used to uncover “Lazarus species” (creatures like Omura’s whale that were once thought to be extinct), or track the ultra-rare Siberian tiger, who’s numbers are thought to be below 5000.
Let’s take it way back to 2013.
Data from the BFRO (Bigfoot Field Researchers Organization) shows multiple Bigfoot sightings in King County just outside Seattle. We’ve set up cameras in the woods near the sighting points that automatically take a photo every few seconds.
We don’t have time to look at all the photos. Luckily, we have a data scientist who wrote some Python code he calls “Bigfoot Detector”. The code uses computer vision to scan each photo and tell us if it contains Bigfoot.
We’d like to share this code with a team in California so they can use it as well. It’s December now and reports indicate the creature may be traveling south.
Before 2013, a user would need to manually install all the necessary requirements on their computer before they could use this code.
To run our Python task, the California team would have to install the correct version of Python, and all required libraries. If the task required a different technology such as Apache Spark, they’d need to make sure that Java and Spark were installed, and configured correctly.
It would be very difficult to create portable “click-button” AI this way, since each task comes with a unique set of requirements. Different AI tasks might even require different versions of the same library. Even for seasoned engineers, this is a frustrating experience commonly known as “Dependency Hell”.
The California Team has no engineers, and we don’t want to put them through Dependency Hell. We need all the help we can get. Anyone in the world should be able to simply submit their photos with the click of a button, and join the hunt to capture Bigfoot.
2013: Containers Revolutionize How We Share Code
A “container” is like a piece of code that is bundled up with all of its necessary requirements. For example, our container could include Python version 3.6, and all the Python packages required to complete our Bigfoot Detector task.
Containers run in an isolated environment. If you launch the Bigfoot Detector container on your laptop, the container will run using Python 3.6, even if your laptop has Python 2 installed, or doesn’t have Python installed at all.
When you run an open-source FlyteHub workflow, all the code runs inside containers. This ensures that the code runs reliably on all environments.
Solved? Not So Fast.
Certain AI tasks can take a long time to run (especially at scale).
When we retrieve our cameras, we will have 1 million photos. If the AI code takes 10 seconds to process each image, it will take 115 days for a computer to complete all 1 million images.
I know what you’re thinking: “115 days?! The beast could be half-way to Mexico in that time!”
We need faster feedback.
To accelerate the process, we could use 100 computers and divide the images up between them. If each machine processes 10,000 images, all 1 million photos can be completed in 1 day.
Since the code is packaged into a container, you do not need to worry about installing any requirements on all these computers, however, you still need to instruct each of the 100 computers to run the Bigfoot Detector container.
We could write code that loops 100 times, telling each machine to run the container. However, if the code assumes the there are exactly 100 machines, the code is not portable.
In order for this AI task to be portable, our team should be able execute the task on 100 computers in a day. The California team (who does not have 100 computers) should be able to execute the same task on a single laptop and process the images in 115 days. In other words, the code should not need to understand infrastructure it’s running on.
2015: Kubernetes is Introduced
Kubernetes is a platform for launching containers on a “cluster” of computers. Your cluster might have a just single machine (your laptop), or 1000 machines.
When you instruct Kubernetes to launch 100 containers, Kubernetes will schedule them on your cluster of machines as soon as possible.
Kubernetes keeps track of how many machine resources are available. If your cluster is only big enough to run 50 containers at once, Kubernetes will launch the first 50 containers immediately. As soon as one of the containers is finished running, Kubernetes will detect that there is now free space, and launch the 51st container. As more containers complete, Kubernetes will continue to launch new containers until all containers are finished.
If you only have enough resources to launch one container at a time, the Bigfoot Detector workflow will take 115 days to run, but it will complete eventually.
FlyteHub open-source AI runs on Kubernetes. The Kubernetes cluster can be as small as a single machine like your laptop, or a large cluster in AWS. Kubernetes ensures that workflows run successfully in either scenario.
Solved? Not So Fast.
Once all these Bigfoot Detector containers have completed, we have 1 million outputs (one output per image). We need to combine these results into something useful.
To solve this, we can create another container that combines the results from each of the image processing containers, and looks for positive results.
The trouble is, ordering is important for this container.
Unlike the Bigfoot Detector containers, which can all run at the same time, we need to wait until after all the Bigfoot Detector tasks are complete before we start this last container.
This is not the only scenario where container ordering is important. Before we run the image processing, we should divide the photos into small batches so that we can launch a Bigfoot Detector container for each batch of photos. Our flow of logic looks something like this:
- Step 1: Divide the photos into small batches.
- Step 2: Launch Bigfoot Detector container for each batch of photos.
- Step 3: Combine the results from step 2, and check for any positive results.
Data Science processes tend to take on a form that resembles a “Workflow” of “Tasks” like this. These workflows can become very complex. Tasks in a workflow might include data gathering, data pre-processing, feature extraction, model training, model evaluation, and many more.
Some tasks (like the Bigfoot Detector) can run in parallel. Other tasks (like the “check for positives” task) need to wait for previous tasks to complete before they are launched.
Since Kubernetes does not know anything about the logic of our ordering, it doesn’t know to wait for some tasks to complete before starting other tasks.
We need a “Workflow Manager” that understands our workflow logic. The “Workflow Manager” needs to babysit the workflow while it runs. It should launch containers in Kubernetes only when all previous steps have completed. To complete our Workflow, the Workflow Manager needs to do something like this:
- Tell Kubernetes to Launch the “Divide Photos” task.
- Wait for task to complete.
- Tell Kubernetes to Launch “Bigfoot Detector” tasks for each batch.
- Wait for all Bigfoot Detector tasks to complete.
- Tell Kubernetes to Launch the the final task to check for positive results.
- Store the result, and show it to the user.
- User captures Bigfoot, and travels the world with the circus.
2019: Lyft Introduces Flyte
Flyte was the missing piece of the puzzle required to make workflows portable.
One of the key concepts introduced by Flyte is the “Workflow Specification”. The Workflow Specification provides a robust way to express the flow of logic in a workflow.
All workflows in Flyte must conform to this specification. Flyte can read these workflows, understand the ordering of tasks, and launch them on Kubernetes in the correct order.
Flyte not only understands ordering, but the flow of data through a workflow. Typically, the outputs of one task become the inputs to the next task. From the Workflow Specification, Flyte understand this.
Flyte makes sure that the outputs of one task are available to the next task, even if these tasks are running on entirely separate computers. Regardless of if whether we run Bigfoot Detector on 1 laptop, or 1000 machines, the output of those image recognition tasks will be available to the “Check for Positives” task.
When we upload our workflow to FlyteHub, FlyteHub stores the workflow spec. Any Bigfoot hunter can import the workflow, submit their photos, and get back results, without needing to understand any details about how it works.
In fact, the workflow was recently used by an at-home researcher in Montana to produce the following photo:
Looking at the photo, it’s pretty clear that hair of this type could only come from… the myth himself.
Today’s post just scratches the surface of possibilities unlocked by these technologies. In a follow-up, I will further describe the Flyte workflow specification and show how developers can leverage each-other’s work by combining tasks into super-workflows.
Give FlyteHub a try. I hope that by collaborating on open-source workflows, we can create better products.