scrublet
Identify and predict doublets in single-cell RNA sequencing data by analyzing transcriptomes, enabling accurate downstream analysis with configurable parameters for simulation and detection.
Instructions
Predict doublets using Scrublet
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| adata_sim | No | Optional path to AnnData object with simulated doublets. | |
| batch_key | No | Key in adata.obs for batch information. | |
| expected_doublet_rate | No | Estimated doublet rate for the experiment. | |
| get_doublet_neighbor_parents | No | Return parent transcriptomes that generated doublet neighbors. | |
| knn_dist_metric | No | Distance metric used when finding nearest neighbors. | euclidean |
| log_transform | No | Whether to log-transform the data prior to PCA. | |
| mean_center | No | Center data such that each gene has mean of 0. | |
| n_neighbors | No | Number of neighbors used to construct KNN graph. | |
| n_prin_comps | No | Number of principal components used for embedding. | |
| normalize_variance | No | Normalize data such that each gene has variance of 1. | |
| sim_doublet_ratio | No | Number of doublets to simulate relative to observed transcriptomes. | |
| stdev_doublet_rate | No | Uncertainty in the expected doublet rate. | |
| synthetic_doublet_umi_subsampling | No | Rate for sampling UMIs when creating synthetic doublets. | |
| threshold | No | Doublet score threshold for calling a transcriptome a doublet. | |
| use_approx_neighbors | No | Use approximate nearest neighbor method (annoy). |
Implementation Reference
- src/scmcp/tool/pp.py:120-137 (handler)Generic handler function for all preprocessing (pp) tools, including 'scrublet'. It dispatches to sc.pp.scrublet based on func='scrublet' and executes it on the active AnnData object with validated arguments.def run_pp_func(ads, func, arguments): adata = ads.adata_dic[ads.active] if func not in pp_func: raise ValueError(f"不支持的函数: {func}") run_func = pp_func[func] parameters = inspect.signature(run_func).parameters arguments["inplace"] = True kwargs = {k: arguments.get(k) for k in parameters if k in arguments} try: res = run_func(adata, **kwargs) add_op_log(adata, run_func, kwargs) except KeyError as e: raise KeyError(f"Can not foud {e} column in adata.obs or adata.var") except Exception as e: raise e return res
- src/scmcp/tool/pp.py:88-101 (helper)Dictionary mapping tool names to their corresponding scanpy.pp functions. 'scrublet' maps to sc.pp.scrublet, used by the handler to dispatch execution.pp_func = { "filter_genes": sc.pp.filter_genes, "filter_cells": sc.pp.filter_cells, "calculate_qc_metrics": partial(sc.pp.calculate_qc_metrics, inplace=True), "log1p": sc.pp.log1p, "normalize_total": sc.pp.normalize_total, "pca": sc.pp.pca, "highly_variable_genes": sc.pp.highly_variable_genes, "regress_out": sc.pp.regress_out, "scale": sc.pp.scale, "combat": sc.pp.combat, "scrublet": sc.pp.scrublet, "neighbors": sc.pp.neighbors, }
- src/scmcp/tool/pp.py:104-117 (registration)Registration of all pp tools including 'scrublet' Tool object into pp_tools dictionary, which is imported and used by the MCP server.pp_tools = { "filter_genes": filter_genes, "filter_cells": filter_cells, "calculate_qc_metrics": calculate_qc_metrics, "log1p": log1p, "normalize_total": normalize_total, "pca": pca, "highly_variable_genes": highly_variable_genes, "regress_out": regress_out, "scale": scale, "combat": combat, "scrublet": scrublet, "neighbors": neighbors, }
- src/scmcp/tool/pp.py:75-79 (registration)Creation and registration of the 'scrublet' MCP Tool object with name, description, and input schema.scrublet = types.Tool( name="scrublet", description="Predict doublets using Scrublet", inputSchema=ScrubletModel.model_json_schema(), )
- src/scmcp/schema/pp.py:392-496 (schema)Pydantic model defining the input schema for the 'scrublet' tool, including fields like sim_doublet_ratio, expected_doublet_rate, and validators.class ScrubletModel(JSONParsingModel): """Input schema for the scrublet doublet prediction tool.""" adata_sim: Optional[str] = Field( default=None, description="Optional path to AnnData object with simulated doublets." ) batch_key: Optional[str] = Field( default=None, description="Key in adata.obs for batch information." ) sim_doublet_ratio: float = Field( default=2.0, description="Number of doublets to simulate relative to observed transcriptomes.", gt=0 ) expected_doublet_rate: float = Field( default=0.05, description="Estimated doublet rate for the experiment.", ge=0, le=1 ) stdev_doublet_rate: float = Field( default=0.02, description="Uncertainty in the expected doublet rate.", ge=0, le=1 ) synthetic_doublet_umi_subsampling: float = Field( default=1.0, description="Rate for sampling UMIs when creating synthetic doublets.", gt=0, le=1 ) knn_dist_metric: str = Field( default="euclidean", description="Distance metric used when finding nearest neighbors." ) normalize_variance: bool = Field( default=True, description="Normalize data such that each gene has variance of 1." ) log_transform: bool = Field( default=False, description="Whether to log-transform the data prior to PCA." ) mean_center: bool = Field( default=True, description="Center data such that each gene has mean of 0." ) n_prin_comps: int = Field( default=30, description="Number of principal components used for embedding.", gt=0 ) use_approx_neighbors: Optional[bool] = Field( default=None, description="Use approximate nearest neighbor method (annoy)." ) get_doublet_neighbor_parents: bool = Field( default=False, description="Return parent transcriptomes that generated doublet neighbors." ) n_neighbors: Optional[int] = Field( default=None, description="Number of neighbors used to construct KNN graph.", gt=0 ) threshold: Optional[float] = Field( default=None, description="Doublet score threshold for calling a transcriptome a doublet.", ge=0, le=1 ) @field_validator('sim_doublet_ratio', 'expected_doublet_rate', 'stdev_doublet_rate', 'synthetic_doublet_umi_subsampling', 'n_prin_comps', 'n_neighbors') def validate_positive_numbers(cls, v: Optional[Union[int, float]]) -> Optional[Union[int, float]]: """Validate positive numbers where applicable""" if v is not None and v <= 0: raise ValueError("must be a positive number") return v @field_validator('knn_dist_metric') def validate_knn_dist_metric(cls, v: str) -> str: """Validate distance metric is supported""" valid_metrics = ['euclidean', 'manhattan', 'cosine', 'correlation'] if v.lower() not in valid_metrics: raise ValueError(f"knn_dist_metric must be one of {valid_metrics}") return v.lower()