U
�i � @ s� d Z ddlZddlmZmZmZ ddlZddlmZm Z m
Z
mZ ddlm
Z
mZ edddddd feeeeeee eeeeef d
� dd�ZdS )
z&Web page fetching tool implementation.� N)�Any�Dict�Optional� )�DEFAULT_MAX_LENGTH�DEFAULT_USER_AGENT_AUTONOMOUS�DEFAULT_USER_AGENT_MANUAL� PROXY_URL)�check_robots_txt�html_to_markdownF� T) �url�
max_length�start_index�raw�check_robots�
user_agent�timeout� use_proxy�returnc
C s� t �d| � d|� d|� d|� d|� d�� |dkr@|r<t}nt}|r`t| ||�s`td| � d ���t�� }|j� |d
ddd
dd�� d} |r�t
t
d�} t �d| � �� z|j| |d| d�}
|
�� W n� tj
jk
r� td|� d| � ���Y n� tj
jk
�r* } ztd|� dt
� d���W 5 d}~X Y n� tj
jk
�rx } z,d|� �}|�r`|dt
� d�7 }t|��W 5 d}~X Y n6 tj
jk
�r� } ztd|� ���W 5 d}~X Y nX |
jdk�s�|
jdk�r�|
j|
_|
j}
|�r�|
}d}nt|
�}d}t|�}|d k�r(||k�rd!d"d||d#�S ||d� }t|�|k}|�rP|d|� }|| }n|}d}t �d$|� d%|� d&t|�� d'|� �� |||||d#�S )(a; Fetch web page content.
This function fetches web pages and converts them to readable formats,
with support for chunked reading, robots.txt checking, and proxy configuration.
Args:
url: Target URL to fetch.
max_length: Maximum characters to return (default: 5000).
start_index: Starting character index for chunked reading (default: 0).
raw: Return raw HTML instead of Markdown (default: False).
check_robots: Check robots.txt before fetching (default: False).
user_agent: Custom User-Agent string.
timeout: Request timeout in seconds (default: 15).
use_proxy: Use proxy at PROXY_URL (default: True).
Returns:
Dictionary containing:
- content: The page content (Markdown or HTML)
- is_truncated: Whether content was truncated
- next_index: Next starting index if truncated, None otherwise
- total_length: Total content length
- format: "markdown" or "html"
Raises:
Exception: On network errors, proxy failures, or robots.txt disallow.
zFetching page: z (raw=z, start=z, max=z, proxy=�)NzNAccording to robots.txt rules, automatic fetching of this URL is not allowed: zR
Suggestion: You can manually visit this URL or use a browser to view the content.z?text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8zzh-CN,zh;q=0.9,en;q=0.8zgzip, deflate, brz
keep-alive�1)z
User-Agent�AcceptzAccept-LanguagezAccept-Encoding�
ConnectionzUpgrade-Insecure-Requests)�http�httpsz
Using proxy: T)r �allow_redirects�proxieszRequest timeout (zs): zProxy connection failed: z&
Please check:
1. Is the proxy server z. running
2. Is the proxy configuration correctzConnection failed: z
Note: Currently using proxy zK, if the proxy is not running, please disable proxy or check proxy settingszRequest failed: z
ISO-8859-1�html�markdownr � F)�content�is_truncated�
next_index�total_length�formatzFetch successful - format: z, total length: z, returned length: z
, truncated: )�logging�infor r r
� Exception�requests�Session�headers�updater �get�raise_for_status�
exceptions�Timeout�
ProxyError�ConnectionError�RequestException�encoding�apparent_encoding�textr �len)r
r r r r r r r �sessionr �response�e� error_msgZhtml_contentZfull_contentZcontent_formatr$ r"