Detecting Locked Bicycle Stations: An AWS Serverless Story (Part 1)

By Jean Baptiste Muscat

This series of articles is about me spending way too much time trying to solve a niche problem while learning how to use the AWS Serverless stack (and Kafka, initially).

It’s not supposed to be a tutorial or an exhaustive presentation, but more a story of how I built and hosted a “real” application and the challenges I faced. Hopefully, you will end up with a better understanding of the AWS Serverless stack and what it means to develop with it. Enjoy!

Photo by Guillaume Bontemps / Ville de Paris

Paris, May 2020, end of the first lockdown. I decide to join some of my friends for a picnic. So, like many Parisians, I’m headed toward the Bois de Vincennes, Paris’ largest public park, on the eastern edge of the city.

The weather is perfect for a short bike ride and I quickly rent a Velib, one of the many public bikes that can be rented and returned in any of the 1500 stations scattered across the city. Arriving near the park, I look at the official app to find the nearest station with available slots to return my bike. As I approach, something seems off. I see several people, puzzled, trying to rent or return a bike. I sigh. The station is locked. Dead.

Velib was created in 2007 and quickly proved successful, becoming one of the most used bike sharing platforms in the world. In 2017, after the end of the initial 10 years contract, the city chose a new operator to manage and develop the network. And let’s just say the change was not smooth. Three years later the main problems have been solved but, still, Velib users face two common hurdles: the broken bikes that are not removed from the stations (whoses seats are returned by the disapointed cyclist to kindly warn the next user) and, more rarely, stations that are “locked” for a few hours, without notice on the official app or website.

After finding another station, I join my friends and wash away my misadventure with a cold beer. But I can’t stop thinking about it. It’s not the first time I had to find another station because one was unusable. Why didn’t the official app show the station as dead? Isn’t it easy to detect a station that suddenly stopped renting or returning bikes? How hard could it be with the proper data?

The Velib API

Luckily Velib exposes a public API. The same data can also be found in the Paris OpenData project which incorporates and exposes a lot of public data about Paris (if you ever looked for the dataset of every one of the 200.000 trees in Paris, it’s there!).

The API is quite simple, with two main endpoints:

  • /station_status.json which returns, in “real-time” (refreshed every minute) the content and status of each station.
//station_status.json simplified structure
{
"data": {
"stations": [
{
"is_installed": 0,
"is_renting": 0,
"is_returning": 0,
"num_bikes_available_types": [
{
"ebike": 0,
"mechanical": 0
}
],
"num_docks_available": 0,
"station_id": 0
}
]
}
}
//station_informations.json simplified structure
{
"data": {
"stations": [
{
"lat": 0,
"lon": 0,
"name": "string",
"station_id": 0
}
]
}
}

I could have wished for a little more. For example, an endpoint that would return the past activity of a given station. Or maybe a callback endpoint that would warn me every time a bike is rented or returned to a station.
But this is fine. With enough time, I can collect the minute-to-minute activity of each station by polling the /station-status endpoint continuously and feeding some kind of database. And by knowing the past activity of each station, I should be able to know when a station stopped renting or returning a bike for too long and flag it as “locked”. Easy, right?

I’ve been wanting to learn more about Kafka and event-driven architecture for a long time. And ingesting and processing each update of the /station-status endpoint seems like a good fit for this technology. So, Kafka, it will be.

One event at a time

First, a quick Kafka refresher.

Kafka is a scalable persistent event log system that was orignially developed at LinkedIn and has since been open sourced. Some of its original developers left LinkedIn to create Confluent, a company dedicated to improving the Kafka ecosystem.

You can interact with Kafka in two ways. By being a “producer” which sends new events to be appended to a log (a “topic”). Or by being a “consumer”, that will receive the events from the topic. Contrary to a message queue, the events are not sent synchronously to the consumers. They are stored in the log and can be consumed now or later. Kafka keeps track of what is the last event received by each consumer, allowing them to consume the events in their own time, at their own rate. If a consumer fails to process an event, the event itself is not lost and can be retried.

A Kafka topic with multiple consumers

This is with only one topic. But nothing prevents a consumer from also being a producer for another topic. That way you can architecture an application that reacts to a single source event, triggering in cascade different services. This sort of network of topics, consumers, and producers is called a topology in the Kafka world.

Fictional Kafka topology: Youtube’s video import process

Each of these services is independent. They can be written in different languages. They can be “real-time”, recurrent batches, or one-time exports. They can be scaled on their own.

Still, all those services need to interact with Kafka and often do similar things: group events by a given attribute, discard some of them, …

So it’s no surprise to find specific tools that have been created to take care of those common tasks. They go from lightweight libraries like Faust (python) or Kafka Streams (java) to full platforms like Flink. As I have some experience with Java, I’ll pick Kafka Streams.

Event driven architecture is not just a Kafka thing. Others tools can be used or even mixed with Kafka. Flink, for example, support a variety of event sources and sinks.

The cost of Kafka

With all that new knowledge, I can start building a small application able to ingest the Velib API data and process it to detect (very imprecisely) the locked stations.

The Kafka-based prototype architecture

After a few weeks, this application quickly grows to a sizeable Kafka+KafkaStream+PostgreSQL+SpringBoot+Thymeleaf mess. The whole infrastructure being created from a single Docker Compose file makes development easy.

To be more in tune with Kafka’s “event log” philosophy, it first transforms the /station-status payload into station-by-station incremental updates. This essentially allows me to design the rest of the application as if I could subscribe to a Velib endpoint that would warn me each time a bike is rented or returned to a specific station. From that, it computes hour-by-hour usage statistics for each station and feeds a SQL database that is itself used for the locked station detection. It also tries to detect individual broken bikes but without much success.

Simplified Kafka topology

The resulting Kafka topology is quite simple (Kafka Stream creates additional hidden topics to maintain its inner state). Overall a little more than 10 topics are used. It “works”. But each time I switch off my computer or the application itself, I create gaps in my hour-by-hour statistics database. It’s time to deploy to production.

This should be simple, as I only need a small Kafka cluster with ~10 topics and minimal replication, a small SQL instance and one or two servers to run the rest. As learning more about AWS was also on my 2020 wish list, I will try to leverage only AWS services.

Let’s start with the Kafka cluster. What kind of offering is there?

  • Fully managed: I could use AWS MSK. The smallest instance (Kafka.t3.small) cost 0,0456 USD per hour. I need to run it full time, so that means… $400 per year

It seems I have a choice between spending way more than I was expecting (and it’s just for the Kafka cluster) or massively reworking my application. Maybe the real problem is not the cost of Kafka. Maybe it is that I should have taken my constraints into account when choosing my architecture.

Back to the drawing board

Photo by Diana Polekhina on Unsplash

What do I expect from this project? What does the result should look like?

  • It should run “in production”, on AWS.

To ease my burden, I’ll also define what I don’t expect from this project.

  • Kafka is not mandatory. At this point, I feel like I have already learned a lot about it, and would rather focus on delivering something.

If my little experiment with Kafka has taught me one thing, I need to better understand the AWS offering and its pricing if I want to stay within my budget. So I’ll take some time off to learn about it.

λ function

AWS is a big topic. New services are added every year, and it’s easy to get lost when trying to learn about it. To help me focus on what are supposed to be its most important services for application development, I decided to pass the AWS Certified Developer Associate certification. It took me a couple of months, but I felt way more confident on what could be done using AWS and at what cost. And it allowed me to discover the serverless offering of AWS.

Having already a side project in mind helped extract the most of what I learnt along the way. So if you intend to pass this certification, I can only recommend you to have already some kind of project that you want to “migrate” to AWS

So what is this serverless stuff? There must be some servers somewhere, right?

Let’s say you want to create some kind of simple web API. A single endpoint, GET /something, coded in JavaScript using NodeJs/Express.

Self-managed: you rent a “server”, for example, an EC2 instance from AWS. This instance lives in your private network inside AWS (your VPC or “Virtual Private Cloud”). On this instance, you are free to install whatever you want. For example NodeJs. You then upload your application source code/package/compiled artifact that will run using the runtime you installed.

If you want to scale your application, you can rent several EC2 instances and use a load balancer to dispatch incoming requests to them. You need to set up each one of them similarly.

Container platform: to make scaling easier and simplify the creation of new instances, you can use containers. Either by manually installing a container service on each EC2 instance or by using a container platform like ECS. This platform will manage the provisioning of the EC2 instances for you. You just need to provide the container and specify the number of instances to be run.

But, in the end, you still own those instances, they are part of your EC2 fleet and you can manually connect to them.

A quick note on auto scaling groups: the number of instances you use can be set manually or automatically, using an autoscaling group. You will create a rule that defines how and within which boundaries should the number of instances evolve. CPU or memory usage is often used to trigger this kind of scaling. Such a rule could be :

“If my instances’ CPU > 80% then increase the number of instances up to 10. I my instances’ CPU < 30% then decrease the number of instances down to 1”.

The key point is that you always need at least one instance.

Lambda: AWS Lambda is a serverless computing platform. Instead of coding a full application, you divide it into smaller “functions” that can each be called independently.

Each function can be seen as a small script that is supposed to run for a few seconds or minutes and perform a specific task. For example, you could have a function in charge of answering to GET /something.

When a Lambda is triggered, an available server instance is taken from a global pool. Your source code is deployed to it and run. Then your code is removed and the instance is freed up.

A very simple Lambda function code can look like this:

//a basic javascript lambda performing a REST call
const https = require('https')
let url = "https://example.com"
exports.handler = async function(event) {
const promise = new Promise(function(resolve, reject) {
https.get(url, (res) => {
resolve(res.statusCode)
}).on('error', (e) => {
reject(Error(e))
})
})
return promise
}

So your function still runs on a server. But you don’t know which one, and you don’t care.

What about cost?

For the Self Managed and Container Platform, it’s quite easy: each instance type as an hourly cost. For Lambdas it’s a little bit more complicated: each invocation cost $0.0000002 ($0.20 per million)+ $0.0000000021 per milliseconds of execution time for a basic Lambda, up to $0.0000001667 per milliseconds for the biggest one. More on Lambda performance in a later article.

Cost comparison between the smallest EC2 instance and the smallest Lambda

The biggest takeaway is that a Lambda, unused, costs nothing. So, for my project with a very limited audience and peak usages around 8 AM and 5 PM, using Lambdas instead of having an EC2 instance twiddling its thumbs most of the time makes sense.

The serverless offering

The same principles can be applied to more than computation :

  • A database? DynamoDb (you pay for the number of items created/read)
The current AWS serverless offering

All those services still rely on servers, but those servers are abstracted from the client and the pricing is derived from the usage instead of the infrastructure.

They can be used alongside a classic application. But you can also design complex applications using only serverless services. And what makes everything things stick together are the Lambda functions. Or, more exactly what triggers the functions.

A REST call? A file being pushed to S3? A new message sent in SQS? A new event in a Kafka topic? Almost any AWS service can trigger a Lambda function. And AWS provides event managers (like EventBridge) and schedulers (like Step Functions) for more complex use cases.

Some triggers are very simple (“fire and forget”), but others can be more complex. For example, DynamoDb (a serverless document database) can publish events (object inserted/updated/deleted) as a “stream” that can itself trigger a function. Each function subscribing to a Dynamo Stream will have its own queue for events to pile up and can process those events at its rhythm. And if a lambda invocation fails, the event is kept to allow for retries. Sound a little bit like Kafka, no?

So combining DynamoDb and Lambda, I can recreate the same kind of “topology” as my first Kafka prototype.

Side note. The real “serverless” Kafka at AWS is Kinesis. But, in 2020, DynamoDb only supported publishing events in DynamoDb Stream. The support for Kinesis publication from DynamoDb has been added in 2021.

Let’s go back to the cost. One thing I did not mention in my initial discussion about the cost of a Kafka cluster is the AWS Free Tier.

To attract new clients, AWS has free offers on some of its services. For example, you can have a small EC2 instance “for free”, but only for the first year. Luckily for me, the Lambda and DynamoDb offers are quite generous and, most importantly, they do not expire.

AWS Free Tier for Dynamo and Lambda

So the plan is simple: re-architecture my application to leverage the AWS serverless services (mainly Lambda and DynamoDb) while staying within the constraints of the free tier.

Let’s do this!