General-purpose OLTP/OLAP databases are great, and not re-inventing the wheel is always a good principle. However, it doesn’t mean all query use-cases are easy to implement correctly, run quickly or at a reasonable cost. Often, there’s a significant effort to design and manage the database to make it somehow support the needed scale and complexity — not getting the performance or cost where we wanted it to be, especially as data and query volumes grow over time.
At Dynamic Yield, one such challenge was running nuanced ad-hoc queries over raw data: over many billions of end-user events (a.k.a “clickstream”) which we collect for our customers, find me users who’ve performed A, B, and C — potentially with time constraints or an ordered sequence, and filtering on fine-grained details (e.g. a specific attribute of the shirt they’ve added to cart). I wrote in more detail about the challenge here.
Our existing solution was based on Elasticsearch, but with significant code additions in the form of plug-ins to enable the kinds of conditions we needed. This made the solution considerably less out-of-the-box (there’s little documentation or advice on how to actually develop such modules beyond the “hello world” level), and cost to run has inflated significantly over time.
One other pain point was the need to ingest massive data originating from Spark into Elasticsearch — rather than simply appending new documents (which of course in itself can be resource-intensive), we are re-calculating and then re-ingesting data for a large portion of end users.
I’ve talked to a number of other SaaS vendors with similar query needs, and the general direction seems to be migrating from “classic” SQL databases to Presto, BigQuery, and the like — with significant work on modeling and pre-preparing data for performance.
This query feature being a core service that we provide, and considering our past experience of investing significant effort to adopt and customize off-the-shelf tools, a “Build” option wasn’t off the table — though of course, one should approach with healthy skepticism:
- Can we get better cost, speed and functionality for our specific needs with a new solution that runs on serverless, spots, whatever — depending on our SLA needs and cost targets?
- Do we have the high-level building blocks to build it without re-inventing too much?
- Will we have to “massage” it for years now? Is it all gonna be worth it in the final count a few years down the line, or is this yet another case of unchecked tech optimism? on the other hand: I felt like now we had a much better sense of what we needed to build, and much more insight about the limitations of existing tools, than we had years back.
The Tool: Funnel Rocket
Out of this challenge, and admittedly with the time afforded to me in my current “CTO office”-type role, we’ve set out to test the waters. Here is what we built: Funnel Rocket.
To build that engine, we used a set of tools that are each battle-tested and do their own thing really well: Pandas along with Apache Arrow’s Parquet file reader, S3, Redis, and Amazon Web Services‘ Lambda and S3. Outside the scope of Funnel Rocket, we prepare datasets with Apache Spark and write them to S3, but there is no ingestion phase — just calling Funnel Rocket’s API Server to let it know of the dataset’s location and fetch its schema.
We did create a JSON-based query spec that’s tuned to our needs, but with all effort to push any complexity down to the tools used — the design is covered extensively here.
I was actually positively surprised with how much can be achieved from gluing together this mature toolset. There’s still a lot to optimize, but initial results definitely aren’t bad, particularly when many Lambda instances are warm and when iteratively querying the same datasets — which fits our use-case of power users drilling down to research and fine-tune user segments. Being able to know exactly how much each query costs, based on the Lambda’s compute time, is also a big plus. Of course, as always there are a myriad of other costs: data preparation in Spark, storage in S3, etc. — but compared to a beefy always-on query cluster, the most significant part (i.e. actual querying) has no fixed costs and is directly correlated to actual usage.
Here are the initial results for datasets of 100 and 500 million rows each, stored in Parquet format in S3 and partitioned to either 100 or 500 files, respectively.
- Each file is handled by a single Lambda instance, meaning 100/500 instances invoked concurrently.
- The Lambda function is configured with 2GB RAM, which corresponds to a bit over 1 vCPU allocated.
Each dataset was tested both when all instances are cold, and when all are hot.
Despite being geared to be a cloud native citizen, running locally is pretty easy: a docker-compose based stack provides serverless and non-serverless workers, a local S3-compatible storage (MinIO), and there’s a public example dataset to download and play with — guided by a tutorial to the query language.
While Funnel Rocket is ready to use today, for me it’s also served as a testbed for how to construct new, cloud native solutions for scale. And I’m always up for a healthy argument! To learn more, head over to the GitHub repo.
Feature image via Pixabay.