Dataproc MCP Server

hive-query-examples.md•5.35 KiB

# Hive Query Examples for Dataproc MCP Server This file contains example Hive queries that can be executed on a Dataproc cluster using the `submit_hive_query` tool. ## Basic Queries ### Create a Table ```sql CREATE TABLE IF NOT EXISTS example_table ( id INT, name STRING, value DOUBLE, created_date DATE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; ``` ### Load Data from GCS ```sql LOAD DATA INPATH 'gs://your-bucket/path/to/data.csv' OVERWRITE INTO TABLE example_table; ``` ### Simple SELECT Query ```sql SELECT * FROM example_table LIMIT 10; ``` ### Aggregation Query ```sql SELECT created_date, COUNT(*) as count, AVG(value) as avg_value, MAX(value) as max_value FROM example_table GROUP BY created_date ORDER BY created_date DESC; ``` ## Advanced Queries ### Create a Partitioned Table ```sql CREATE TABLE IF NOT EXISTS partitioned_table ( id INT, name STRING, value DOUBLE ) PARTITIONED BY (dt DATE) STORED AS PARQUET; ``` ### Insert Data with Partition ```sql INSERT INTO partitioned_table PARTITION (dt='2023-01-01') SELECT id, name, value FROM example_table WHERE created_date = '2023-01-01'; ``` ### Join Example ```sql SELECT a.id, a.name, b.additional_info FROM example_table a JOIN another_table b ON a.id = b.id WHERE a.value > 100; ``` ### Window Functions ```sql SELECT id, name, value, created_date, RANK() OVER (PARTITION BY created_date ORDER BY value DESC) as rank FROM example_table; ``` ## Using the MCP Server ### Synchronous Query Execution To execute a query and wait for results: ```json { "projectId": "your-project-id", "region": "us-central1", "clusterName": "your-cluster-name", "query": "SELECT * FROM example_table LIMIT 10;", "async": false } ``` ### Asynchronous Query Execution To submit a query and get a job ID for later status checking: ```json { "projectId": "your-project-id", "region": "us-central1", "clusterName": "your-cluster-name", "query": "SELECT * FROM example_table LIMIT 10;", "async": true } ``` ### Query with Custom Properties To set Hive properties for the query: ```json { "projectId": "your-project-id", "region": "us-central1", "clusterName": "your-cluster-name", "query": "SELECT * FROM example_table LIMIT 10;", "queryOptions": { "properties": { "hive.exec.dynamic.partition.mode": "nonstrict", "hive.exec.max.dynamic.partitions": "1000" }, "timeoutMs": 300000 } } ``` ## PySpark Job Examples with Local File Staging The MCP server supports automatic local file staging for PySpark jobs, allowing you to reference local Python files using template syntax. Files are automatically uploaded to GCS and the job configuration is transformed to use the staged files. ### Basic PySpark Job with Local File ```json { "projectId": "your-project-id", "region": "us-central1", "clusterName": "pyspark-cluster", "tool": "submit_dataproc_job", "arguments": { "jobType": "pyspark", "jobConfig": { "mainPythonFileUri": "{@./test-spark-job.py}", "args": ["--mode", "test"] } } } ``` ### PySpark Job with Multiple Local Files ```json { "projectId": "your-project-id", "region": "us-central1", "clusterName": "analytics-cluster", "tool": "submit_dataproc_job", "arguments": { "jobType": "pyspark", "jobConfig": { "mainPythonFileUri": "{@./scripts/data_processor.py}", "pythonFileUris": [ "{@./utils/data_utils.py}", "{@./utils/spark_helpers.py}", "{@/absolute/path/to/shared_library.py}" ], "args": ["--input", "gs://data-bucket/raw/", "--output", "gs://data-bucket/processed/"], "properties": { "spark.sql.adaptive.enabled": "true", "spark.executor.memory": "4g" } } } } ``` ### Local File Staging Features - **Template Syntax**: Use `{@./relative/path}` or `{@/absolute/path}` to reference local files - **Automatic Upload**: Files are automatically staged to the cluster's GCS staging bucket - **Unique Naming**: Staged files get unique names with timestamps to avoid conflicts - **Cleanup**: Staged files are automatically cleaned up after job completion - **Supported Extensions**: `.py`, `.jar`, `.sql`, `.R` files are supported ### Example Test Job Output When using the test-spark-job.py file, you can expect output similar to: ``` === PySpark Local File Staging Test === Spark version: 3.1.3 Arguments received: ['--mode', 'test'] === Sample Data === +---------+---+---------+ | name|age| role| +---------+---+---------+ | Alice| 25| Engineer| | Bob| 30| Manager| | Charlie| 35| Analyst| | Diana| 28| Designer| +---------+---+---------+ === Data Analysis === Total records: 4 Average age: 29.5 === Role Distribution === +---------+-----+ | role|count| +---------+-----+ | Engineer| 1| | Manager| 1| | Analyst| 1| | Designer| 1| +---------+-----+ === Test Completed Successfully! === Local file staging is working correctly. ``` ### Successful Test Cases The following job IDs demonstrate successful local file staging: - Job ID: `db620480-135f-4de6-b9a6-4045b308fe97` - Basic PySpark job with local file - Job ID: `36ed88b2-acad-4cfb-8fbf-88ad1ba22ad7` - PySpark job with multiple local files These examples show that local file staging works seamlessly with the Dataproc MCP server, providing the same experience as using `gcloud dataproc jobs submit pyspark` with local files.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dipseth/dataproc-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

hive-query-examples.md•5.35 KiB