read_tool
Extract and process single-cell RNA sequencing data from multiple file formats (h5ad, 10x, text files) or directories. Supports memory-efficient backed modes, URL retrieval, and customizable parsing options for analysis.
Instructions
Read data from various file formats (h5ad, 10x, text files, etc.) or directory path.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| backed | No | If 'r', load AnnData in 'backed' mode instead of fully loading it into memory ('memory' mode). If you want to modify backed attributes of the AnnData object, you need to choose 'r+'. | |
| backup_url | No | Retrieve the file from an URL if not present on disk. | |
| cache | No | If False, read from source, if True, read from fast 'h5ad' cache. | |
| cache_compression | No | See the h5py dataset_compression. (Default: settings.cache_compression) | |
| delimiter | No | Delimiter that separates data within text file. If None, will split at arbitrary number of white spaces, which is different from enforcing splitting at any single white space. | |
| ext | No | Extension that indicates the file type. If None, uses extension of filename. | |
| filename | Yes | Path to the file to read. | |
| first_column_names | No | Assume the first column stores row names. This is only necessary if these are not strings: strings in the first column are automatically assumed to be row names. | |
| first_column_obs | No | If True, assume the first column stores observations (cell or barcode) names when provide text file. If False, the data will be transposed. | |
| gex_only | No | Only keep 'Gene Expression' data and ignore other feature types, e.g. 'Antibody Capture', 'CRISPR Guide Capture', or 'Custom'. Used for 10x formats. | |
| make_unique | No | Whether to make the variables index unique by appending '-1', '-2' etc. or not. Used for 10x mtx format. | |
| prefix | No | Any prefix before matrix.mtx, genes.tsv and barcodes.tsv. For instance, if the files are named patientA_matrix.mtx, patientA_genes.tsv and patientA_barcodes.tsv the prefix is patientA_. Used for 10x mtx format. | |
| sampleid | No | Sample identifier to mark and distinguish different samples. | |
| sheet | No | Name of sheet/table in hdf5 or Excel file. | |
| var_names | No | The variables index for 10x mtx format. Either 'gene_symbols' or 'gene_ids'. | gene_symbols |
Implementation Reference
- src/scmcp/tool/io.py:28-44 (handler)Core handler function that implements the logic to read AnnData objects from files or directories using scanpy.read or sc.read_10x_mtx based on input parameters.def read_func(**kwargs): file = Path(kwargs["filename"]) if file.is_dir(): kwargs["path"] = kwargs["filename"] parameters = inspect.signature(sc.read_10x_mtx).parameters func_kwargs = {k: kwargs.get(k) for k in parameters if k in kwargs} adata = sc.read_10x_mtx(**func_kwargs) elif file.is_file(): parameters = inspect.signature(sc.read).parameters func_kwargs = {k: kwargs.get(k) for k in parameters if k in kwargs} logger.info(func_kwargs) adata = sc.read(**func_kwargs) if not kwargs.get("first_column_obs", True): adata = adata.T else: adata = "there are no file" return adata
- src/scmcp/schema/io.py:13-92 (schema)Pydantic model that defines the input schema (parameters and validators) for the read_tool MCP tool.class ReadModel(JSONParsingModel): """Input schema for the read tool.""" filename: str = Field( description="Path to the file to read." ) sampleid: Optional[str] = Field( default=None, description="Sample identifier to mark and distinguish different samples." ) backed: Optional[Literal['r', 'r+']] = Field( default=None, description="If 'r', load AnnData in 'backed' mode instead of fully loading it into memory ('memory' mode). If you want to modify backed attributes of the AnnData object, you need to choose 'r+'." ) sheet: Optional[str] = Field( default=None, description="Name of sheet/table in hdf5 or Excel file." ) ext: Optional[str] = Field( default=None, description="Extension that indicates the file type. If None, uses extension of filename." ) delimiter: Optional[str] = Field( default=None, description="Delimiter that separates data within text file. If None, will split at arbitrary number of white spaces, which is different from enforcing splitting at any single white space." ) first_column_names: bool = Field( default=False, description="Assume the first column stores row names. This is only necessary if these are not strings: strings in the first column are automatically assumed to be row names." ) first_column_obs: bool = Field( default=True, description="If True, assume the first column stores observations (cell or barcode) names when provide text file. If False, the data will be transposed." ) backup_url: Optional[str] = Field( default=None, description="Retrieve the file from an URL if not present on disk." ) cache: bool = Field( default=False, description="If False, read from source, if True, read from fast 'h5ad' cache." ) cache_compression: Optional[Literal['gzip', 'lzf']] = Field( default=None, description="See the h5py dataset_compression. (Default: settings.cache_compression)" ) var_names: Optional[str] = Field( default="gene_symbols", description="The variables index for 10x mtx format. Either 'gene_symbols' or 'gene_ids'." ) make_unique: bool = Field( default=True, description="Whether to make the variables index unique by appending '-1', '-2' etc. or not. Used for 10x mtx format." ) gex_only: bool = Field( default=True, description="Only keep 'Gene Expression' data and ignore other feature types, e.g. 'Antibody Capture', 'CRISPR Guide Capture', or 'Custom'. Used for 10x formats." ) prefix: Optional[str] = Field( default=None, description="Any prefix before matrix.mtx, genes.tsv and barcodes.tsv. For instance, if the files are named patientA_matrix.mtx, patientA_genes.tsv and patientA_barcodes.tsv the prefix is patientA_. Used for 10x mtx format." ) @field_validator('backed') def validate_backed(cls, v: Optional[str]) -> Optional[str]: if v is not None and v not in ['r', 'r+']: raise ValueError("If backed is provided, it must be either 'r' or 'r+'") return v @field_validator('cache_compression') def validate_cache_compression(cls, v: Optional[str]) -> Optional[str]: if v is not None and v not in ['gzip', 'lzf']: raise ValueError("cache_compression must be either 'gzip', 'lzf', or None") return v @field_validator('var_names') def validate_var_names(cls, v: Optional[str]) -> Optional[str]: if v is not None and v not in ['gene_symbols', 'gene_ids']: raise ValueError("var_names must be either 'gene_symbols' or 'gene_ids'") return v
- src/scmcp/tool/io.py:15-19 (registration)Registers the read_tool as an MCP Tool with name, description, and input schema from ReadModel.read_tool = types.Tool( name="read_tool", description="Read data from various file formats (h5ad, 10x, text files, etc.) or directory path.", inputSchema=ReadModel.model_json_schema(), )
- src/scmcp/tool/io.py:52-55 (registration)Adds the read_tool to the io_tools dictionary, likely used for server registration.io_tools = { "read_tool": read_tool, "write_tool": write_tool, }
- src/scmcp/tool/io.py:47-50 (helper)Maps tool names to their handler functions, linking 'read_tool' to read_func.io_func = { "read_tool": read_func, "write_tool": sc.write, }