read_tool
Read data from various file formats including h5ad, 10x, and text files for single-cell RNA sequencing analysis. Supports multiple formats and configurations to load data into the SCMCP server.
Instructions
Read data from various file formats (h5ad, 10x, text files, etc.) or directory path.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| filename | Yes | Path to the file to read. | |
| sampleid | No | Sample identifier to mark and distinguish different samples. | |
| backed | No | If 'r', load AnnData in 'backed' mode instead of fully loading it into memory ('memory' mode). If you want to modify backed attributes of the AnnData object, you need to choose 'r+'. | |
| sheet | No | Name of sheet/table in hdf5 or Excel file. | |
| ext | No | Extension that indicates the file type. If None, uses extension of filename. | |
| delimiter | No | Delimiter that separates data within text file. If None, will split at arbitrary number of white spaces, which is different from enforcing splitting at any single white space. | |
| first_column_names | No | Assume the first column stores row names. This is only necessary if these are not strings: strings in the first column are automatically assumed to be row names. | |
| first_column_obs | No | If True, assume the first column stores observations (cell or barcode) names when provide text file. If False, the data will be transposed. | |
| backup_url | No | Retrieve the file from an URL if not present on disk. | |
| cache | No | If False, read from source, if True, read from fast 'h5ad' cache. | |
| cache_compression | No | See the h5py dataset_compression. (Default: settings.cache_compression) | |
| var_names | No | The variables index for 10x mtx format. Either 'gene_symbols' or 'gene_ids'. | gene_symbols |
| make_unique | No | Whether to make the variables index unique by appending '-1', '-2' etc. or not. Used for 10x mtx format. | |
| gex_only | No | Only keep 'Gene Expression' data and ignore other feature types, e.g. 'Antibody Capture', 'CRISPR Guide Capture', or 'Custom'. Used for 10x formats. | |
| prefix | No | Any prefix before matrix.mtx, genes.tsv and barcodes.tsv. For instance, if the files are named patientA_matrix.mtx, patientA_genes.tsv and patientA_barcodes.tsv the prefix is patientA_. Used for 10x mtx format. |
Implementation Reference
- src/scmcp/tool/io.py:28-44 (handler)Core handler function implementing the reading logic for various file formats using scanpy.read or sc.read_10x_mtx.def read_func(**kwargs): file = Path(kwargs["filename"]) if file.is_dir(): kwargs["path"] = kwargs["filename"] parameters = inspect.signature(sc.read_10x_mtx).parameters func_kwargs = {k: kwargs.get(k) for k in parameters if k in kwargs} adata = sc.read_10x_mtx(**func_kwargs) elif file.is_file(): parameters = inspect.signature(sc.read).parameters func_kwargs = {k: kwargs.get(k) for k in parameters if k in kwargs} logger.info(func_kwargs) adata = sc.read(**func_kwargs) if not kwargs.get("first_column_obs", True): adata = adata.T else: adata = "there are no file" return adata
- src/scmcp/tool/io.py:70-84 (handler)Dispatch handler for IO tools, specifically handles read_tool by managing AnnData state and calling read_func.def run_io_func(ads, func, arguments): if func == "read_tool": adata_id = f"adata{len(ads.adata_dic)}" if arguments.get("sampleid", None) is not None: adata_id = arguments["sampleid"] else: adata_id = f"adata{len(ads.adata_dic)}" res = read_func(**arguments) ads.active = adata_id ads.adata_dic[adata_id] = res return res else: adata = ads.adata_dic[ads.active] return write_func(adata, func, arguments)
- src/scmcp/schema/io.py:13-92 (schema)Pydantic model defining the input schema for the read_tool, including all parameters and validators.class ReadModel(JSONParsingModel): """Input schema for the read tool.""" filename: str = Field( description="Path to the file to read." ) sampleid: Optional[str] = Field( default=None, description="Sample identifier to mark and distinguish different samples." ) backed: Optional[Literal['r', 'r+']] = Field( default=None, description="If 'r', load AnnData in 'backed' mode instead of fully loading it into memory ('memory' mode). If you want to modify backed attributes of the AnnData object, you need to choose 'r+'." ) sheet: Optional[str] = Field( default=None, description="Name of sheet/table in hdf5 or Excel file." ) ext: Optional[str] = Field( default=None, description="Extension that indicates the file type. If None, uses extension of filename." ) delimiter: Optional[str] = Field( default=None, description="Delimiter that separates data within text file. If None, will split at arbitrary number of white spaces, which is different from enforcing splitting at any single white space." ) first_column_names: bool = Field( default=False, description="Assume the first column stores row names. This is only necessary if these are not strings: strings in the first column are automatically assumed to be row names." ) first_column_obs: bool = Field( default=True, description="If True, assume the first column stores observations (cell or barcode) names when provide text file. If False, the data will be transposed." ) backup_url: Optional[str] = Field( default=None, description="Retrieve the file from an URL if not present on disk." ) cache: bool = Field( default=False, description="If False, read from source, if True, read from fast 'h5ad' cache." ) cache_compression: Optional[Literal['gzip', 'lzf']] = Field( default=None, description="See the h5py dataset_compression. (Default: settings.cache_compression)" ) var_names: Optional[str] = Field( default="gene_symbols", description="The variables index for 10x mtx format. Either 'gene_symbols' or 'gene_ids'." ) make_unique: bool = Field( default=True, description="Whether to make the variables index unique by appending '-1', '-2' etc. or not. Used for 10x mtx format." ) gex_only: bool = Field( default=True, description="Only keep 'Gene Expression' data and ignore other feature types, e.g. 'Antibody Capture', 'CRISPR Guide Capture', or 'Custom'. Used for 10x formats." ) prefix: Optional[str] = Field( default=None, description="Any prefix before matrix.mtx, genes.tsv and barcodes.tsv. For instance, if the files are named patientA_matrix.mtx, patientA_genes.tsv and patientA_barcodes.tsv the prefix is patientA_. Used for 10x mtx format." ) @field_validator('backed') def validate_backed(cls, v: Optional[str]) -> Optional[str]: if v is not None and v not in ['r', 'r+']: raise ValueError("If backed is provided, it must be either 'r' or 'r+'") return v @field_validator('cache_compression') def validate_cache_compression(cls, v: Optional[str]) -> Optional[str]: if v is not None and v not in ['gzip', 'lzf']: raise ValueError("cache_compression must be either 'gzip', 'lzf', or None") return v @field_validator('var_names') def validate_var_names(cls, v: Optional[str]) -> Optional[str]: if v is not None and v not in ['gene_symbols', 'gene_ids']: raise ValueError("var_names must be either 'gene_symbols' or 'gene_ids'") return v
- src/scmcp/tool/io.py:15-19 (registration)Definition of the MCP Tool object for read_tool, specifying name, description, and input schema.read_tool = types.Tool( name="read_tool", description="Read data from various file formats (h5ad, 10x, text files, etc.) or directory path.", inputSchema=ReadModel.model_json_schema(), )
- src/scmcp/tool/io.py:52-55 (registration)Registration of the read_tool in the io_tools dictionary, which is used by the MCP server for listing and dispatching tools.io_tools = { "read_tool": read_tool, "write_tool": write_tool, }