This interview is part of the Decibel OSS Spotlight series where we showcase founders of fast-growing community-led projects that are solving really unique problems and experiencing strong community adoption.
Sudip Chakrabarti spoke to Gaurav T Kakkar and Joy Arulraj, co-creators of EvaDB, an open source database system that aims to simplify the development and deployment of multi-modal AI apps operating on both unstructured data (text documents, videos, PDFs, podcasts, etc.) and structured data (tables, vector index).
Gaurav and Joy shared with us their inspiration behind creating EvaDB and their vision to make it a widely adopted project.
Gaurav: I am currently a PhD student in Computer Science at Georgia Tech, being advised by Joy. I did my undergrad from IIT Kanpur and worked at Adobe for a couple of years. Subsequently, I joined Joy’s lab at Georgia Tech where I started working on video analytics for my Master’s thesis. That is when I got interested in building a database system. I spent the next couple of years learning everything I could about building database systems and started my PhD research in Joy’s group. Since then, my entire PhD work has been focused on building EvaDB, which is a complete database system for unstructured and structured multi-modal data.
Joy: I am an Assistant Professor of Database Systems at the School of Computer Science at Georgia Tech. I did my PhD at CMU and was advised by Professor Andy Pavlo, one of the leading researchers in database systems. Back then, I was focused on building database systems tailored for emerging hardware technologies. As a part of my PhD research, I lead the development of the Peloton DBMS that was explicitly tailored for non-volatile memory technology, improving performance and availability while reducing cost and offering a great developer experience. When I joined Georgia Tech as an Assistant Professor, I started working on data systems and deep learning, bringing the best principles of relational database systems to the world of AI. The focus of my research is to make it easy for people to get insights from large datasets. For example, one of our early open source projects is SQLCheck that automatically finds anti-patterns in SQL which can slow down query performance and has a large user community. Our core thesis is that you shouldn’t need to be an expert in the internals of a database system to get insights from your data and/or build your applications that are powered by that data - EvaDB is the latest project in that same direction.
Joy: When we started the research project that eventually became EvaDB, we were focusing on creating a system that could produce analytics on video data. Hence, we named the first version of the project EVA, short for Exploratory Video Analytics, and when we created a full database system we decided to go with EvaDB as a nod to the origin of the project.
Joy: The motivation behind creating EvaDB goes all the way back to my graduate research at CMU when I worked with the The Intel Science and Technology Center (ISTC) for Visual Cloud Systems. While deep learning has revolutionized analyzing and searching images, many of those approaches do not scale to video, which is the richest online data source and has mission-critical use cases ranging from self-driving cars to security cameras. I started applying deep learning techniques to video search and analytics and continued that work when I joined Georgia Tech. We started working on EvaDB to enable search and analytics for unstructured data using deep learning by building an end-to-end database system, starting all the way from a SQL parser, through a query optimizer, to a query execution engine, and all the way down to a storage engine.
Since then, the vision behind EvaDB has significantly expanded to create an end-to-end DBMS that empowers software developers to build multi-modal AI applications encompassing unstructured data (text documents, videos, PDFs, podcasts, etc.) and structured data (tables, vector indexes).
We started working on EvaDB to enable search and analytics for unstructured data using deep learning by building an end-to-end database system
Gaurav: Today, to use an AI model, the user needs to program against multiple low-level libraries, like PyTorch, Hugging Face, Open AI, etc. This tedious process often leads to a complex AI app that glues together these libraries to accomplish the given task. This programming complexity prevents people who are experts in other domains from benefiting from these models.
For example, say you want to build a simple application to search and analyze a lot of podcasts. Without EvaDB, your workflow today would be the following: first, you will need to find and choose a model from Hugging Face to process the audio and convert speech to text; then you would need to figure out where and how to store the extracted text in a database system; next, you will need to write SQL queries against that database system to serve the search application; and finally, you will need to deploy the application in production and monitor its performance. All of this requires building a complex data pipeline and collaboration across multiple teams - data scientists, ML and data engineers, and software developers.
With EvaDB, we fuse all these steps together into one. You could just store the audio data from the podcasts into EvaDB, and then simply use SQL to query to serve the search application. Using SQL and abstracting all the complexity of building AI data pipelines enables any data scientist or software developer to significantly reduce the time needed to go from an idea to a production app. In addition, because EvaDB is a complete database system, we do all kinds of optimizations for performance and scaling under the hood that leads to a delightful user experience.
Gaurav: The key technical breakthrough here is that EvaDB is an end-to-end database system for both unstructured and structured multi-modal data that makes it easy to chain multiple models in a single query to accomplish complicated tasks with minimal programming. Because EvaDB supports a simple SQL-like query language, it makes it really easy for users to leverage AI models without the need to put together complex data pipelines. The declarative query language reduces the complexity of the app, leading to more maintainable code that allows users to build on top of each other’s queries.
The battery-powered experience of EvaDB encompasses a wide array of models tailored for unstructured data analysis, including LLMs, image classification, object detection, OCR, face detection, and more. It features integrations with well-established AI pipelines founded on Hugging Face, PyTorch, and OpenAI technologies. EvaDB is fully developed in Python and is licensed under the Apache license.
One of the key technical breakthroughs behind EvaDB is its sophisticated Cascade-style query optimization framework tailored for AI applications. This framework facilitates optimizations such as cost-based model reordering, automated model selection, and fine-grained function caching. This cutting-edge optimizer empowers EvaDB to automatically optimize queries, reducing inference cost and query execution time. For instance, in a social media moderation app, multiple models are utilized for text analysis, image recognition, and contextual understanding. When handling posts containing both text and images, EvaDB would arrange the models in an optimal sequence, prioritizing swift image recognition to identify explicit content, followed by the use of language models for contextual understanding.
Joy: We are still in the early stages of gaining adoption but we have been excited by the progress so far. Our open source project has garnered 2,200 GitHub stars and we have 100+ members in our Slack. In particular, we are proud of the number of contributors - 54 right now - which indicates that our users are motivated enough to contribute to the project. That said, we have only just begun to market EvaDB and expect to significantly expand the user community in the coming months.
Joy: We took a different approach to develop EvaDB compared to most academic projects. Instead of focusing on quick publications, we spent the first couple of years building a platform. Yes, it required significant upfront investment, but now it is really paying off because we can now easily explore new ideas on a robust platform. In addition, other researchers and industry users are able to use EvaDB as a platform for their own work, which has helped us battle-test the system at large scales much earlier than most other academic projects. Because we have built an end-to-end database system, it has also helped us engage industry practitioners early on and have heavily influenced our roadmap to build a scalable platform that is also easy to use. This has been a huge driver behind all the success we have had with EvaDB so far.
Gaurav: As a PhD student, balancing between innovative research and building an open source project is not easy as the requirements for both tend to be different. For PhD, you have to focus on innovative contributions, most of which tends to solve specific and complex technical problems. However, to build a successful open source project, you have to make it really usable and accessible to a wide range of people with varying degrees of technical skills. I have at times struggled with balancing the needs of both, but I wouldn’t do it any other way - open sourcing our academic research has helped us gain tremendous visibility, usage and extremely valuable feedback that we wouldn’t have gotten otherwise.
Joy: I wish we had started interacting with the enterprise users a bit earlier, which would help us better prioritize our project roadmap and not just go after technically challenging features. For example, one of our recent projects was on localizing actions in videos by using reinforcement learning. Localizing actions - e.g., tracking a suspicious vehicle by law enforcement - requires analyzing multiple video frames. We built an AI-powered optimization technique to figure out which frames needed to be analyzed, leading to faster action localization. While this work led to a well recognized SIGMOD paper, we also found out by talking to industry practitioners that this kind of optimization goes beyond what people currently need in real life.
Gaurav: I follow several open-source projects that are relevant to our field, such as Ray, Ludwig, and MindsDB, and respect the progress they have made. EvaDB has seamless integrations with Ray and Ludwig; in fact, our execution engine thrives on the power of Ray. While MindsDB concentrates on processing structured data, at EvaDB, we tackle the challenge of handling both unstructured and structured data. Our thesis is that people want a unified database system capable of storing and analyzing both data types using a language like SQL, and that is where EvaDB excels.
Joy: I have two specific pieces of advice for those in academia who are building open source projects. First, it is never too early to start marketing your project and engage with your user community. If you have a broad vision like we do for EvaDB, getting a ton of early user feedback can really help you shape the direction of your project and prioritize your project roadmap. Second, if you are building a system it pays off to build a strong foundation before you build all the bells and whistles. We were fortunate enough to invest the first couple of years into building a strong foundational system which is now paying dividends for us in helping us quickly deliver on new feature requests from our growing user community.