Skip to main content
Glama

extract_douyin_text

Extract text from Douyin video content using a share link. Supports optional speech recognition models for accurate transcription. Requires DASHSCOPE_API_KEY for integration.

Instructions

从抖音分享链接提取视频中的文本内容 参数: - share_link: 抖音分享链接或包含链接的文本 - model: 语音识别模型(可选,默认使用paraformer-v2) 返回: - 提取的文本内容 注意: 需要设置环境变量 DASHSCOPE_API_KEY

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
modelNo
share_linkYes

Implementation Reference

  • Main handler function for the extract_douyin_text tool. Decorated with @mcp.tool() for automatic registration in the FastMCP server. Orchestrates URL parsing and text extraction using DouyinProcessor.
    @mcp.tool() async def extract_douyin_text( share_link: str, model: Optional[str] = None, ctx: Context = None ) -> str: """ 从抖音分享链接提取视频中的文本内容 参数: - share_link: 抖音分享链接或包含链接的文本 - model: 语音识别模型(可选,默认使用paraformer-v2) 返回: - 提取的文本内容 注意: 需要设置环境变量 DASHSCOPE_API_KEY """ try: # 从环境变量获取API密钥 api_key = os.getenv('DASHSCOPE_API_KEY') if not api_key: raise ValueError("未设置环境变量 DASHSCOPE_API_KEY,请在配置中添加阿里云百炼API密钥") processor = DouyinProcessor(api_key, model) # 解析视频链接 ctx.info("正在解析抖音分享链接...") video_info = processor.parse_share_url(share_link) # 直接使用视频URL进行文本提取 ctx.info("正在从视频中提取文本...") text_content = processor.extract_text_from_video_url(video_info['url']) ctx.info("文本提取完成!") return text_content except Exception as e: ctx.error(f"处理过程中出现错误: {str(e)}") raise Exception(f"提取抖音视频文本失败: {str(e)}")
  • Helper method in DouyinProcessor class that parses Douyin share text/link to extract no-watermark video URL, title, and ID by scraping the page and parsing JSON data.
    def parse_share_url(self, share_text: str) -> dict: """从分享文本中提取无水印视频链接""" # 提取分享链接 urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', share_text) if not urls: raise ValueError("未找到有效的分享链接") share_url = urls[0] share_response = requests.get(share_url, headers=HEADERS) video_id = share_response.url.split("?")[0].strip("/").split("/")[-1] share_url = f'https://www.iesdouyin.com/share/video/{video_id}' # 获取视频页面内容 response = requests.get(share_url, headers=HEADERS) response.raise_for_status() pattern = re.compile( pattern=r"window\._ROUTER_DATA\s*=\s*(.*?)</script>", flags=re.DOTALL, ) find_res = pattern.search(response.text) if not find_res or not find_res.group(1): raise ValueError("从HTML中解析视频信息失败") # 解析JSON数据 json_data = json.loads(find_res.group(1).strip()) VIDEO_ID_PAGE_KEY = "video_(id)/page" NOTE_ID_PAGE_KEY = "note_(id)/page" if VIDEO_ID_PAGE_KEY in json_data["loaderData"]: original_video_info = json_data["loaderData"][VIDEO_ID_PAGE_KEY]["videoInfoRes"] elif NOTE_ID_PAGE_KEY in json_data["loaderData"]: original_video_info = json_data["loaderData"][NOTE_ID_PAGE_KEY]["videoInfoRes"] else: raise Exception("无法从JSON中解析视频或图集信息") data = original_video_info["item_list"][0] # 获取视频信息 video_url = data["video"]["play_addr"]["url_list"][0].replace("playwm", "play") desc = data.get("desc", "").strip() or f"douyin_{video_id}" # 替换文件名中的非法字符 desc = re.sub(r'[\\/:*?"<>|]', '_', desc) return { "url": video_url, "title": desc, "video_id": video_id }
  • Core helper method in DouyinProcessor that extracts text from the video URL using Dashscope ASR Transcription API (Aliyun). Handles async transcription task and retrieves the text result.
    """从视频URL中提取文字(使用阿里云百炼API)""" try: # 发起异步转录任务 task_response = dashscope.audio.asr.Transcription.async_call( model=self.model, file_urls=[video_url], language_hints=['zh', 'en'] ) # 等待转录完成 transcription_response = dashscope.audio.asr.Transcription.wait( task=task_response.output.task_id ) if transcription_response.status_code == HTTPStatus.OK: # 获取转录结果 for transcription in transcription_response.output['results']: url = transcription['transcription_url'] result = json.loads(request.urlopen(url).read().decode('utf8')) # 保存结果到临时文件 temp_json_path = self.temp_dir / 'transcription.json' with open(temp_json_path, 'w') as f: json.dump(result, f, indent=4, ensure_ascii=False) # 提取文本内容 if 'transcripts' in result and len(result['transcripts']) > 0: return result['transcripts'][0]['text'] else: return "未识别到文本内容" else: raise Exception(f"转录失败: {transcription_response.output.message}") except Exception as e: raise Exception(f"提取文字时出错: {str(e)}")

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/yzfly/douyin-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server