Skip to main content
Glama

scrublet

Identify and filter doublets in single-cell RNA sequencing data to improve analysis accuracy by detecting merged cell artifacts.

Instructions

Predict doublets using Scrublet

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
adata_simNoOptional path to AnnData object with simulated doublets.
batch_keyNoKey in adata.obs for batch information.
sim_doublet_ratioNoNumber of doublets to simulate relative to observed transcriptomes.
expected_doublet_rateNoEstimated doublet rate for the experiment.
stdev_doublet_rateNoUncertainty in the expected doublet rate.
synthetic_doublet_umi_subsamplingNoRate for sampling UMIs when creating synthetic doublets.
knn_dist_metricNoDistance metric used when finding nearest neighbors.euclidean
normalize_varianceNoNormalize data such that each gene has variance of 1.
log_transformNoWhether to log-transform the data prior to PCA.
mean_centerNoCenter data such that each gene has mean of 0.
n_prin_compsNoNumber of principal components used for embedding.
use_approx_neighborsNoUse approximate nearest neighbor method (annoy).
get_doublet_neighbor_parentsNoReturn parent transcriptomes that generated doublet neighbors.
n_neighborsNoNumber of neighbors used to construct KNN graph.
thresholdNoDoublet score threshold for calling a transcriptome a doublet.

Implementation Reference

  • Generic handler function that executes the scrublet tool logic by retrieving sc.pp.scrublet from pp_func mapping, applying it to the active AnnData with provided arguments and inplace=True, handling errors and logging operations.
    def run_pp_func(ads, func, arguments):
        adata = ads.adata_dic[ads.active]
        if func not in pp_func:
            raise ValueError(f"不支持的函数: {func}")
        
        run_func = pp_func[func]
        parameters = inspect.signature(run_func).parameters
        arguments["inplace"] = True
        kwargs = {k: arguments.get(k) for k in parameters if k in arguments}
        try:
            res = run_func(adata, **kwargs)
            add_op_log(adata, run_func, kwargs)
        except KeyError as e:
            raise KeyError(f"Can not foud {e} column in adata.obs or adata.var")
        except Exception as e:
           raise e
        return res
  • Pydantic model defining the input schema and validation for the scrublet tool, including parameters like sim_doublet_ratio, expected_doublet_rate, and validators for positive numbers and distance metrics.
    class ScrubletModel(JSONParsingModel):
        """Input schema for the scrublet doublet prediction tool."""
        
        adata_sim: Optional[str] = Field(
            default=None,
            description="Optional path to AnnData object with simulated doublets."
        )
        
        batch_key: Optional[str] = Field(
            default=None,
            description="Key in adata.obs for batch information."
        )
        
        sim_doublet_ratio: float = Field(
            default=2.0,
            description="Number of doublets to simulate relative to observed transcriptomes.",
            gt=0
        )
        
        expected_doublet_rate: float = Field(
            default=0.05,
            description="Estimated doublet rate for the experiment.",
            ge=0,
            le=1
        )
        
        stdev_doublet_rate: float = Field(
            default=0.02,
            description="Uncertainty in the expected doublet rate.",
            ge=0,
            le=1
        )
        
        synthetic_doublet_umi_subsampling: float = Field(
            default=1.0,
            description="Rate for sampling UMIs when creating synthetic doublets.",
            gt=0,
            le=1
        )
        
        knn_dist_metric: str = Field(
            default="euclidean",
            description="Distance metric used when finding nearest neighbors."
        )
        
        normalize_variance: bool = Field(
            default=True,
            description="Normalize data such that each gene has variance of 1."
        )
        
        log_transform: bool = Field(
            default=False,
            description="Whether to log-transform the data prior to PCA."
        )
        
        mean_center: bool = Field(
            default=True,
            description="Center data such that each gene has mean of 0."
        )
        
        n_prin_comps: int = Field(
            default=30,
            description="Number of principal components used for embedding.",
            gt=0
        )
        
        use_approx_neighbors: Optional[bool] = Field(
            default=None,
            description="Use approximate nearest neighbor method (annoy)."
        )
        
        get_doublet_neighbor_parents: bool = Field(
            default=False,
            description="Return parent transcriptomes that generated doublet neighbors."
        )
        
        n_neighbors: Optional[int] = Field(
            default=None,
            description="Number of neighbors used to construct KNN graph.",
            gt=0
        )
        
        threshold: Optional[float] = Field(
            default=None,
            description="Doublet score threshold for calling a transcriptome a doublet.",
            ge=0,
            le=1
        )
        
        @field_validator('sim_doublet_ratio', 'expected_doublet_rate', 'stdev_doublet_rate',
                       'synthetic_doublet_umi_subsampling', 'n_prin_comps', 'n_neighbors')
        def validate_positive_numbers(cls, v: Optional[Union[int, float]]) -> Optional[Union[int, float]]:
            """Validate positive numbers where applicable"""
            if v is not None and v <= 0:
                raise ValueError("must be a positive number")
            return v
        
        @field_validator('knn_dist_metric')
        def validate_knn_dist_metric(cls, v: str) -> str:
            """Validate distance metric is supported"""
            valid_metrics = ['euclidean', 'manhattan', 'cosine', 'correlation']
            if v.lower() not in valid_metrics:
                raise ValueError(f"knn_dist_metric must be one of {valid_metrics}")
            return v.lower()
  • Registers the scrublet tool as an MCP Tool with name, description, and references the ScrubletModel schema.
    scrublet = types.Tool(
        name="scrublet",
        description="Predict doublets using Scrublet",
        inputSchema=ScrubletModel.model_json_schema(),
    )
  • Maps the scrublet tool name to the underlying scanpy.pp.scrublet function for execution in the handler.
    pp_func = {
        "filter_genes": sc.pp.filter_genes,
        "filter_cells": sc.pp.filter_cells,
        "calculate_qc_metrics": partial(sc.pp.calculate_qc_metrics, inplace=True),
        "log1p": sc.pp.log1p,
        "normalize_total": sc.pp.normalize_total,
        "pca": sc.pp.pca,
        "highly_variable_genes": sc.pp.highly_variable_genes,
        "regress_out": sc.pp.regress_out,
        "scale": sc.pp.scale,
        "combat": sc.pp.combat,
        "scrublet": sc.pp.scrublet,
        "neighbors": sc.pp.neighbors,
    }
  • Registers the scrublet Tool object in the pp_tools dictionary, likely used for MCP server tool listing.
    # 模型与函数名称的映射
    pp_tools = {
        "filter_genes": filter_genes,
        "filter_cells": filter_cells,
        "calculate_qc_metrics": calculate_qc_metrics,
        "log1p": log1p,
        "normalize_total": normalize_total,
        "pca": pca,
        "highly_variable_genes": highly_variable_genes,
        "regress_out": regress_out,
        "scale": scale,
        "combat": combat,
        "scrublet": scrublet,
        "neighbors": neighbors,
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden for behavioral disclosure. 'Predict doublets' implies a read-only analysis operation, but doesn't disclose whether this modifies input data, what output format to expect, whether it's computationally intensive, or what happens with the results. For a 15-parameter tool with complex statistical operations, this is a significant gap in behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise at just 4 words, with zero wasted language. It's front-loaded with the core purpose. While it may be too brief for adequate completeness, it earns full marks for conciseness.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex 15-parameter statistical tool with no annotations and no output schema, the description is severely inadequate. It doesn't explain what doublets are, what Scrublet is, what domain this applies to, what the output looks like, or how results should be interpreted. The description fails to provide the necessary context for effective tool selection and use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all 15 parameters thoroughly. The description adds no parameter-specific information beyond what's in the schema. According to scoring rules, when schema coverage is high (>80%), the baseline is 3 even with no param info in the description.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose3/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description 'Predict doublets using Scrublet' states the action ('predict') and target ('doublets'), but is vague about what doublets are in this context and doesn't distinguish this tool from sibling tools like 'filter_cells' or 'score_genes' which might also involve cell quality assessment. It doesn't specify what Scrublet is or what domain this applies to (single-cell RNA-seq analysis).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. With many sibling tools for single-cell analysis (e.g., 'filter_cells', 'score_genes', 'calculate_qc_metrics'), there's no indication whether this should be used before/after other quality control steps, what data prerequisites exist, or what alternatives might be available for doublet detection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/huang-sh/scmcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server