Skip to main content
Glama

scrublet

Identify and filter doublets in single-cell RNA sequencing data to improve analysis accuracy by detecting merged cell artifacts.

Instructions

Predict doublets using Scrublet

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
adata_simNoOptional path to AnnData object with simulated doublets.
batch_keyNoKey in adata.obs for batch information.
sim_doublet_ratioNoNumber of doublets to simulate relative to observed transcriptomes.
expected_doublet_rateNoEstimated doublet rate for the experiment.
stdev_doublet_rateNoUncertainty in the expected doublet rate.
synthetic_doublet_umi_subsamplingNoRate for sampling UMIs when creating synthetic doublets.
knn_dist_metricNoDistance metric used when finding nearest neighbors.euclidean
normalize_varianceNoNormalize data such that each gene has variance of 1.
log_transformNoWhether to log-transform the data prior to PCA.
mean_centerNoCenter data such that each gene has mean of 0.
n_prin_compsNoNumber of principal components used for embedding.
use_approx_neighborsNoUse approximate nearest neighbor method (annoy).
get_doublet_neighbor_parentsNoReturn parent transcriptomes that generated doublet neighbors.
n_neighborsNoNumber of neighbors used to construct KNN graph.
thresholdNoDoublet score threshold for calling a transcriptome a doublet.

Implementation Reference

  • Generic handler function that executes the scrublet tool logic by retrieving sc.pp.scrublet from pp_func mapping, applying it to the active AnnData with provided arguments and inplace=True, handling errors and logging operations.
    def run_pp_func(ads, func, arguments):
        adata = ads.adata_dic[ads.active]
        if func not in pp_func:
            raise ValueError(f"不支持的函数: {func}")
        
        run_func = pp_func[func]
        parameters = inspect.signature(run_func).parameters
        arguments["inplace"] = True
        kwargs = {k: arguments.get(k) for k in parameters if k in arguments}
        try:
            res = run_func(adata, **kwargs)
            add_op_log(adata, run_func, kwargs)
        except KeyError as e:
            raise KeyError(f"Can not foud {e} column in adata.obs or adata.var")
        except Exception as e:
           raise e
        return res
  • Pydantic model defining the input schema and validation for the scrublet tool, including parameters like sim_doublet_ratio, expected_doublet_rate, and validators for positive numbers and distance metrics.
    class ScrubletModel(JSONParsingModel):
        """Input schema for the scrublet doublet prediction tool."""
        
        adata_sim: Optional[str] = Field(
            default=None,
            description="Optional path to AnnData object with simulated doublets."
        )
        
        batch_key: Optional[str] = Field(
            default=None,
            description="Key in adata.obs for batch information."
        )
        
        sim_doublet_ratio: float = Field(
            default=2.0,
            description="Number of doublets to simulate relative to observed transcriptomes.",
            gt=0
        )
        
        expected_doublet_rate: float = Field(
            default=0.05,
            description="Estimated doublet rate for the experiment.",
            ge=0,
            le=1
        )
        
        stdev_doublet_rate: float = Field(
            default=0.02,
            description="Uncertainty in the expected doublet rate.",
            ge=0,
            le=1
        )
        
        synthetic_doublet_umi_subsampling: float = Field(
            default=1.0,
            description="Rate for sampling UMIs when creating synthetic doublets.",
            gt=0,
            le=1
        )
        
        knn_dist_metric: str = Field(
            default="euclidean",
            description="Distance metric used when finding nearest neighbors."
        )
        
        normalize_variance: bool = Field(
            default=True,
            description="Normalize data such that each gene has variance of 1."
        )
        
        log_transform: bool = Field(
            default=False,
            description="Whether to log-transform the data prior to PCA."
        )
        
        mean_center: bool = Field(
            default=True,
            description="Center data such that each gene has mean of 0."
        )
        
        n_prin_comps: int = Field(
            default=30,
            description="Number of principal components used for embedding.",
            gt=0
        )
        
        use_approx_neighbors: Optional[bool] = Field(
            default=None,
            description="Use approximate nearest neighbor method (annoy)."
        )
        
        get_doublet_neighbor_parents: bool = Field(
            default=False,
            description="Return parent transcriptomes that generated doublet neighbors."
        )
        
        n_neighbors: Optional[int] = Field(
            default=None,
            description="Number of neighbors used to construct KNN graph.",
            gt=0
        )
        
        threshold: Optional[float] = Field(
            default=None,
            description="Doublet score threshold for calling a transcriptome a doublet.",
            ge=0,
            le=1
        )
        
        @field_validator('sim_doublet_ratio', 'expected_doublet_rate', 'stdev_doublet_rate',
                       'synthetic_doublet_umi_subsampling', 'n_prin_comps', 'n_neighbors')
        def validate_positive_numbers(cls, v: Optional[Union[int, float]]) -> Optional[Union[int, float]]:
            """Validate positive numbers where applicable"""
            if v is not None and v <= 0:
                raise ValueError("must be a positive number")
            return v
        
        @field_validator('knn_dist_metric')
        def validate_knn_dist_metric(cls, v: str) -> str:
            """Validate distance metric is supported"""
            valid_metrics = ['euclidean', 'manhattan', 'cosine', 'correlation']
            if v.lower() not in valid_metrics:
                raise ValueError(f"knn_dist_metric must be one of {valid_metrics}")
            return v.lower()
  • Registers the scrublet tool as an MCP Tool with name, description, and references the ScrubletModel schema.
    scrublet = types.Tool(
        name="scrublet",
        description="Predict doublets using Scrublet",
        inputSchema=ScrubletModel.model_json_schema(),
    )
  • Maps the scrublet tool name to the underlying scanpy.pp.scrublet function for execution in the handler.
    pp_func = {
        "filter_genes": sc.pp.filter_genes,
        "filter_cells": sc.pp.filter_cells,
        "calculate_qc_metrics": partial(sc.pp.calculate_qc_metrics, inplace=True),
        "log1p": sc.pp.log1p,
        "normalize_total": sc.pp.normalize_total,
        "pca": sc.pp.pca,
        "highly_variable_genes": sc.pp.highly_variable_genes,
        "regress_out": sc.pp.regress_out,
        "scale": sc.pp.scale,
        "combat": sc.pp.combat,
        "scrublet": sc.pp.scrublet,
        "neighbors": sc.pp.neighbors,
    }
  • Registers the scrublet Tool object in the pp_tools dictionary, likely used for MCP server tool listing.
    # 模型与函数名称的映射
    pp_tools = {
        "filter_genes": filter_genes,
        "filter_cells": filter_cells,
        "calculate_qc_metrics": calculate_qc_metrics,
        "log1p": log1p,
        "normalize_total": normalize_total,
        "pca": pca,
        "highly_variable_genes": highly_variable_genes,
        "regress_out": regress_out,
        "scale": scale,
        "combat": combat,
        "scrublet": scrublet,
        "neighbors": neighbors,
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/huang-sh/scmcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server