

To contextualize our legacy system we will dive into how Apache Airflow was set up to orchestrate all the ETL’s that power DoorDash’s data platform. How Airflow helped orchestrate our initial data delivery This solution was perfect for our needs, as we already use Kubernetes clusters and the combination scaled to handle our traffic. The open source community came to the rescue with a new Airflow version adding support for Kubernetes pod operators. When scalability became an issue, we looked for another orchestration solution. Initially, we used Airflow for orchestration to build data pipelines and set up a single node to get started quickly. Managing data in our infrastructure to make it usable for the DoorDash teams who need it requires various pipelines and ETL jobs. As we grew to cover more than 4,000 cities, all the data became more complex and voluminous, making orchestration hard to manage. Our solution came from a new Airflow version which let us pair it with Kubenetes, ensuring that our data infrastructure could keep up with our business growth.ĭata not only powers workflows in our infrastructure, such as sending an order from a customer to a restaurant, but also supplies models with fresh data and enables our Data Science and Business Intelligence teams to analyze and optimize our services. However, as our business grew to 2 billion orders delivered, scalability became an issue. As an orchestration engine, Apache Airflow let us quickly build pipelines in our data infrastructure.
