Skip to main content
Glama

Launching jobs on Lakeflow

Lakeflow is Databrick's way to manage jobs on a cluster. Databricks lets you edit scripts in their UI (they call these "notebooks"). That's too simple for some our more complex experiments. Thankfully, Databricks supports lots of other ways to submit jobs.

This tool is an opinionated way to spawn jobs on Databricks: It asks you author your code as a Python package, which forces you to specific its dependencies. It then uploads that package (as a python wheel) for Databricks to run it. This is heavier-weight than Databrick's notebook approach of shipping scripts, but it lets you capture large package dependencies across repos via git submodules. It's lighter weight than other job submisison systems that operate on docker containers. For most of our work, wheels represent all the containerization we neeed.

It has one more opinion: that uv is a good way to capture those Python dependencies, with a pyproject.toml.

Once you have your packaged defined in a pyproject.toml, you can use this tool to build the wheel, upload it to Databricks, and spawn copies of it with different command line arguments. Databricks gives you a UI to check the state of your jobs.

The tool provides several interfaces:

  • A set of Python functions you can call.

  • A CLI you can use from the shell.

  • An MCP server you can connect to from an AI agent.

Getting access to Databricks

Check if you have access to Databrick by visiting this url. If you get stuck in an infinite loop where Databricks sends you a code that doesn't work, it means you don't have an account. Ask for one in #help-data-platform.

Your package's structure

First, ensure you can run your package by calling uv run. Here's the structure we'll assume:

my_project/ ├── pyproject.toml ├── src/ │ └── my_package/ │ ├── __init__.py │ └── my_package_py.py

The pakage lakeflow_demo under this directory follows this structure.

Lakeflow assumes you've added an entry point to your pyproject.toml called "lakeflow-task" to your pyproject.toml. If your package is called my_package, and it has a driver called my_package_py.py, and the main function in it is called main, you would define the entry point like this:

[project.scripts] lakeflow-task = "my_package.my_package_py:main"

This entry point is called with no arguments. Instead, argv will be and the environment variable DATABRICKS_RUN_ID will be populated.

Building and launching your package

To launch a package on Databricks, have to first build the wheel, then upload it, then tell Databricks to run it:

  1. Build the wheel:

    uv run lakeflow.py build-wheel ~/my_project # Output: /path/to/dist/my_package-0.1.0-py3-none-any.whl

    This outputs the local wheel path, which we'll use in the next step.

  2. Upload the wheel:

    uv run lakeflow.py upload-wheel /path/to/dist/my_package-0.1.0-py3-none-any.whl # Output: /Users/me/wheels/my_package-0.1.0-py3-none-any.whl

    This outputs the remote wheel path, which we'll use in the next step.

  3. Create the Job:

    python lakeflow.py create-job \ "my-lakeflow-job" \ "my-package" \ "/Users/me/wheels/my_package-0.1.0-py3-none-any.whl" # Output: 123456 (Job ID)

    This returns the job ID, which we'll use in the next step. This doesn't yet run any jobs. It just starts a cluster that can run them.

  4. Trigger a Run:

    python lakeflow.py trigger-run 123456 argv1 argv2

    This starts one instance of the job with the given arguments. If you have shards of data, you can call this operation multiple times with different arguments to kick off a bunch of jobs in parallel.

  5. Monitor the Runs:

    python lakeflow.py list-job-runs 123456

    This lists the runs for the given job ID.

Python Interface

The above illustrated how to use the CLI. But you can also use the programmatic interface to the package. See run_lakeflow_demo.py for an example.

Model Context Protocol (MCP) Integration

You can install this package as an MCP server. To do that, add this to ~/.cursor/mcp.json:

{ "mcpServers": { "lakeflow": { "command": "/Users/arahimi/.local/bin/uv", "args": [ "run", "--quiet", "--directory", "/Users/arahimi/lakeflow-mcp", "python", "lakeflow.py" ], "env": { "DATABRICKS_HOST": "https://hims-machine-learning-staging-workspace.cloud.databricks.com", "DATABRICKS_TOKEN": "<your tocken>>" } }, ... } }

Then you can ask the agent to do things like this:

let's launch 4 copies of this job on lakeflow, and pass them the arguments "fi", "fie", "fo", and "fum" respectively.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arahimi-hims/lakeflow-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server