Databricks is an exciting and powerful platform for creating solutions that can process big data into actionable content. Originally, the platform lacked several important aspects that are necessary to fully automate the platform. This historically limited the ability to integrate it into a holistic DevOps practice. Thankfully, those days are long past. Those limitations have been removed, making it possible to utilize the platform more fully.
There are three parts to the DevOps story with this platform: infrastructure, notebooks, and jobs. Over the last few years, the Databricks team has worked hard to build up the DevOps story around these aspects. Today, we’ll explore some of those features!
Infrastructure
Setting up Databricks requires a lot of pieces to be configured: accounts, workspaces, repositories, clusters, and more. For newcomers, this can be pretty daunting! At the same time, understanding these pieces is critical to getting the most out of this system.
For people used to the API-first nature of most Azure services, Azure Databricks can catch them off guard. The majority of these items are not configurable with ARM templates. While Databricks is treated as a first-party solution on Azure, ultimately that platform still has to support deployments with other cloud vendors that offer the solution. To support this, it provides APIs for deploying most of the infrastructure. Those APIs are not platform-specific. This means I can script deployment solutions, but I can’t rely on the cloud-native tools to manage those features. If you’re a fan of Terraform, you’re in luck – they have implemented providers for Azure, AWS, and Google Cloud (GCP). Their goal is to support all of the Databricks APIs.
If you prefer to handle things manually, it’s not too difficult. It’s just a lot of RESTful calls to query and deploy resources. You’ll need the API references( Azure, AWS, GCP) to make your own calls. For the most part, these APIs are agnostic to the cloud provider. Scripts can generally target multiple providers with minimal issues.
Notebooks
This is the bread and butter of most workloads! Because of this, Databricks was kind enough to introduce Workspaces v2 (aka, Repos). The Repos functionality allows you to have platform-agnostic support for managing and updating notebooks at scale across multiple teams while still enabling collaboration.
They still provide the ability for multiple users to collaborate on the same notebook simultaneously. At the same time, they support a full Git integration (with the user controlling the push, pull, and merge). These two options avoid issues with concurrent edits creating conflicting changes (a problem that can exist in Azure Data Factory if two users are working on the same branch).
From a high level, Repos allow us to associate a Git repository to a folder. This means that changes can be committed, pushed, and pulled from a central repository such as GitHub. Unlike the original Workspaces, deployments are handled by simply instructing a Repo to update itself to the latest copy of a branch. As a workflow, users work on their own copies in their own personal Repos, creating pull requests to merge the code to a collaboration branch. Once they are ready, they can create a branch or release, triggering a workflow that calls a Databricks API to update the branch used by their production Repo (or alternatively using the Databricks CLI for Repos). The workflow looks something like this:
You can read more about the Git integration for Azure, GCP, or AWS.
That covers part of the CI/CD and dev processes, but what about testing? It’s certainly possible to build unit tests for notebooks (typically, with a separate, parallel notebook with those tests), but this can be painful. For Scala code, it’s often compiled into Jars and tested using standard test frameworks to validate the logic (and optionally some Spark-based tests). The same pattern was often used with Python. This often results in the need to deploy the compiled code to clusters.
Databricks has feature called “Files In Repos” (still in public preview, as of November 2022) that aims to solve this problem. Available on Azure, GCP, and AWS, this allows you to include non-notebooks files in the repositories. This makes it possible to create standard Python code (and unit tests).
For Python projects, this allows you to move away from .whl
file deployments and test notebooks. Instead, the notebooks just act as a basic orchestrator to connect inputs (such as via a
widget) to testable methods. By making it possible to use traditional tooling and native test frameworks (such as pytest
), this simplifies the process of creating maintainable code.
Jobs
Jobs can follow a few different DevOps workflows, depending on the kind of job you’re using and how you dpeloy the dependencies. For notebook based jobs, it’s possible to have Databricks utilize the latest version of a notebook from a Git repository. This makes it easy to deploy updates to the jobs and doesn’t require any special action to trigger the update process. This is obviously becoming the preferred method.
For more advanced situations (or for projects using legacy approaches), you can deploy JAR files to the clusters or scripts directly to Workspaces or the DBFS filesystem. In this case, you need to have a process for creating and automating those deployments to ensure that jobs are using the appropriate version of the code for a given release. Databricks provides APIs that help with this process.
Testing is handled similarly to Repos – either as modular code or paired notebooks.
Putting it all together
Combining these aspects, you can plan an effective approach to managing the complete software and deployment lifecycle on Databricks. In a future post, I’ll explore ways to implement some of these processes.
Until then, Happy DevOp’ing!