Enables the management of jobs on Databricks clusters by building and uploading Python wheel packages, creating job definitions, triggering runs with specific arguments, and monitoring job execution status.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Lakeflow MCP Serverlaunch 4 copies of this job with arguments 'fi', 'fie', 'fo', and 'fum'"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Launching jobs on Lakeflow
Lakeflow is Databrick's way to manage jobs on a cluster. Databricks lets you edit scripts in their UI (they call these "notebooks"). That's too simple for some our more complex experiments. Thankfully, Databricks supports lots of other ways to submit jobs.
This tool is an opinionated way to spawn jobs on Databricks: It asks you author your code as a Python package, which forces you to specific its dependencies. It then uploads that package (as a python wheel) for Databricks to run it. This is heavier-weight than Databrick's notebook approach of shipping scripts, but it lets you capture large package dependencies across repos via git submodules. It's lighter weight than other job submisison systems that operate on docker containers. For most of our work, wheels represent all the containerization we neeed.
It has one more opinion: that uv is a good way to capture those Python dependencies, with
a pyproject.toml.
Once you have your packaged defined in a pyproject.toml, you can use this tool
to build the wheel, upload it to Databricks, and spawn copies of it with
different command line arguments. Databricks gives you a UI to check the state
of your jobs.
The tool provides several interfaces:
A set of Python functions you can call.
A CLI you can use from the shell.
An MCP server you can connect to from an AI agent.
Getting access to Databricks
Check if you have access to Databrick by visiting this url. If you get stuck in an infinite loop where Databricks sends you a code that doesn't work, it means you don't have an account. Ask for one in #help-data-platform.
Your package's structure
First, ensure you can run your package by calling uv run. Here's the structure we'll assume:
The pakage lakeflow_demo under this directory follows this structure.
Lakeflow assumes you've added an entry point to your pyproject.toml called "lakeflow-task" to your pyproject.toml.
If your package is called my_package, and it has a driver called my_package_py.py, and the
main function in it is called main, you would define the entry point like this:
This entry point is called with no arguments. Instead, argv will be and the
environment variable DATABRICKS_RUN_ID will be populated.
Building and launching your package
To launch a package on Databricks, have to first build the wheel, then upload it, then tell Databricks to run it:
Build the wheel:
uv run lakeflow.py build-wheel ~/my_project # Output: /path/to/dist/my_package-0.1.0-py3-none-any.whlThis outputs the local wheel path, which we'll use in the next step.
Upload the wheel:
uv run lakeflow.py upload-wheel /path/to/dist/my_package-0.1.0-py3-none-any.whl # Output: /Users/me/wheels/my_package-0.1.0-py3-none-any.whlThis outputs the remote wheel path, which we'll use in the next step.
Create the Job:
python lakeflow.py create-job \ "my-lakeflow-job" \ "my-package" \ "/Users/me/wheels/my_package-0.1.0-py3-none-any.whl" # Output: 123456 (Job ID)This returns the job ID, which we'll use in the next step. This doesn't yet run any jobs. It just starts a cluster that can run them.
Trigger a Run:
python lakeflow.py trigger-run 123456 argv1 argv2This starts one instance of the job with the given arguments. If you have shards of data, you can call this operation multiple times with different arguments to kick off a bunch of jobs in parallel.
Monitor the Runs:
python lakeflow.py list-job-runs 123456This lists the runs for the given job ID.
Python Interface
The above illustrated how to use the CLI. But you can also use the programmatic interface to the package. See run_lakeflow_demo.py for an example.
Model Context Protocol (MCP) Integration
You can install this package as an MCP server. To do that, add this to ~/.cursor/mcp.json:
Then you can ask the agent to do things like this: