DevOps for Data Science with GCP

By Ben Weber


Deploying Production-Grade Containers for Model Serving

Ben Weber

One of the functions of data science teams is building machine learning (ML) models that provide predictive signals for products and personalization. While DevOps has not always been considered a core responsibility of data science teams, it is becoming increasingly important as these teams start to take more ownership of running and maintaining data products. Instead of handing off code or models to an engineering team for deployment, it is becoming more common for data scientists to own systems in production. While I previously wrote a book on building scalable ML pipelines with Python, I didn’t focus much on the maintenance aspect of building data products. The goal of this post is to work through an example ML system that covers some of the aspects of DevOps for data science.

Model serving refers to hosting an endpoint that other services can call in order to get predictions from a machine learning model. These systems are typically real-time predictions, where state about a user or web session is passed to the endpoint. Some of the methods for setting up an endpoint include using a web service such as Flask and Gunicorn, using serverless functions such as GCP Cloud Functions, and using a messaging protocol such as GCP PubSub. If you’re looking to set up a system that needs to service a large volume of requests with minimal latency, then you’ll likely want to use a containerized approach rather than serverless functions, because you’ll have more control over the runtime environment of your system and the cost can be significantly cheaper. For this post, we’ll show how to build a scalable ML system using Flask and Docker in combination with Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP). Kubernetes is a technology that is growing in popularity for DevOps and a skill set that is increasing in demand for data scientists.

In addition to using these tools to build a data product that scales to a large volume of requests with low latency, we’ll explore the following DevOps concerns for model serving:

  1. How do you deploy ML models with high-availability?
  2. How can you scale your deployment to match demand?
  3. How do you set up alerting and track incidents?
  4. How do you roll out new deployments of your system?

For this post, we’ll focus on tools that enable deployment and monitoring of ML models, instead of touching on other data science issues such as model training. We’ll explore the following topics in this post:

  1. Setting up a development environment for GCP
  2. Developing a Flask application for local model serving
  3. Containerizing the application with Docker
  4. Publishing the container to GCP Container Registry
  5. Using GKE to host the service across a cluster of machines
  6. Using a load balancer to distribute requests
  7. Using Stackdriver for distributed logging and metric tracking
  8. Using GKE with rolling updates to deploy new versions of the service
  9. Setting up alerts with Stackdriver for alerting

In this post, we’ll start with a Flask application and use GCP to scale it up to a production-grade application. Along the way, we’ll use Gunicorn, Docker, Kubernetes, and Stackdriver to cover DevOps issues in deploying this endpoint at scale. The full code listings for this post are available on GitHub.

Before getting started, you’ll need to set up a GCP account, which provides $300 in free credits to get started. These credits provide enough funding to get up and running with a variety of GCP tools. You’ll also need to create a new GCP project, as shown in the figure below. We’ll create a project where the name is serving and the full project id is serving-268422.

Creating a new project on GCP.

I recommend using a Linux environment for developing Python applications, because Docker and Gunicorn work best with these environments. You should be able to follow along if you have the following applications installed and set up for command line access:

  • python: The language runtime we’ll use for model serving.
  • pip: We’ll install additional GCP python client libraries.
  • docker: For containerizing our model serving endpoint.
  • gcloud: GCP command line tools.

I won’t dig into the details of installing these applications, because the setup can vary significantly based on your computing environment. One thing to note is that the gcloud tools install Python 2.7 under the hood, so make sure that your command line Python versions are not impacted by this installation. Once you have glcoud set up, you can create a credentials file for programmatic access to GCP using the following commands:

This script will create a new service account called serving with access to your GCP resources. The last step created a credentials file called serving.json that we’ll use to connect with services such as Stackdriver. Next, we’ll use pip to install Python libraries for logging and monitoring in Google Cloud, as well as standard libraries for hosting ML models as endpoints using Python.

pip install google-cloud-logging
pip install google-cloud-monitoring
pip install pandaspip install scikit-learnpip install flask

pip install gunicorn

We now have the tools set up that we’ll need in order to build and monitor a web service for serving ML models.

To demonstrate how to host a predictive model as an endpoint, we’ll train a logistic regression model with scikit-learn and then expose the model using Flask. Next, we’ll use Gunicorn to enable our service to be multi-threaded and scale to larger request volumes.

The complete code for a sample web application that serves model requests is shown in the snippet below. The code first fetches a training dataset directly from GitHub and then builds a logistic regression model with scikit-learn. Next, the predict function is used to define an endpoint that services model requests. The passed-in parameters (G1, G2, G3, … , G10) are converted from a JSON representation into a Pandas dataframe with a single row. Each of these variables tracks if a customer has previously purchased a specific product in the past, and the output is the propensity of the customer to purchase a new product. The dataframe is passed to the fitted model object to generate a propensity score that is returned as a JSON payload to the client making the model serving request. For additional details on this code, refer to my previous post on models as web endpoints.

Assuming that the file is named we can test the application by running the following command: python . To test that the service is working, we can use the request module in Python:

The output of running this code block is shown below. The results show that the web request was successful and that the model provided a propensity score of 0.0673 for the sample customer.

>>> print(result)
<Response [200]>
>>> print(result.json())
{'response': '0.06730006696024807', 'success': True}

Flask uses a single process to serve web requests, which limits the volume of traffic that the endpoint can service. To scale up to more requests, we can use Flask in combination with a WSGI server such as Gunicorn. To run the model service with Gunicorn, we simply pass the name of the file to Gunicorn, as shown below.

gunicorn --bind serving:app

After running this command, you’ll have a service running on port 8000. In order to test that the service is working, you can change the request snippet above to use port 8000 instead of port 5000.

We now have a model serving application that runs on our local machine, but we need a way of distributing the service so that it can scale to a large volume of requests. While it’s possible to provision hardware using a cloud provider to manually distribute model serving, tools such as Kubernetes provide fully-managed approaches for setting up machines to service requests. We’ll use Docker to containerize the application and then host the service using GKE.

The Dockerfile for containerizing the application is shown in the gist below. We start with an Ubuntu environment and first install Python and pip. Next, we install the required libraries and copy in our application code and credentials file. The last command runs Gunicorn to expose the model serving application on port 8000.

The commands below so how to build the container, view the resulting Docker images on your machine and run the container locally in interactive mode. It’s useful to test the container locally to make sure that the services you are using in your application work as expected. To test that the application works, update the request snippet above to use port 80.

sudo docker image build -t "model_service" .sudo docker images

sudo docker run -it -p 80:8000 model_service

To use our container with Kubernetes, we’ll need to push the image to a Docker Registry where the Kubernetes pods can pull it. GCP provides a service called Container Registry that provides this functionality. To push our container to the registry, run the commands shown in the gist below.

The first command in the snippet above passes the credentials file to the docker login command in order to sign in to the Container Registry. The next command tags the image with your GCP account ID, and the last command pushes the image to the Registry. After running these commands, you can browse to the Container Registry section of the GCP console to validate that the image was pushed successfully.

The model serving image on Container Registry.

We can now spin up a Kubernetes cluster to host the model serving application in a distributed environment. The first step is to set up a cluster by browsing to the “Kubernetes Engine” tab in the GCP console and clicking on “create cluster”. Assign model-serving as the cluster name, select a pool size of 2 nodes, and use the g1-small instance type for the node pool. This will create a small cluster that we can use for testing.