URL Reputation and Validity Checker

extractors.cpython-313.pyc•6.97 KiB

� ��7h��X�SrSSKrSSKJrJr SSKJrJr SSKJ r SSK r "SS5rg)zLink extraction utilities.�N)�List�Set)�urljoin�urlparse)� BeautifulSoupc �*�\rSrSrSr\R"S\R5r\R"S\R5r SS\ S\ S\ S \\ 4S jjrS\ S \ 4Sjr SS\ S\ S \\ 4S jjrS\ S \\ 4SjrS\ S \4SjrS\\ S \\ 4SjrSrg)� LinkExtractor�z(Extract links from HTML or text content.zfhttps?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&/=]*)z\[([^\]]+)\]$([^)]+)$N�content�content_type�base_url�returnc�r�US:XaURU5n[5nUS:Xa URURX55 URUR U55 /nUH,nURU5(dMUR U5 M. [[[U555$)a Extract all links from content. Args: content: The content to extract links from content_type: "html", "text", or "auto" (auto-detect) base_url: Base URL for resolving relative links Returns: List of unique URLs found in the content �auto�html) �_detect_content_type�set�update�_extract_html_links�_extract_text_links�_is_valid_link�append�sorted�list)�selfrrr �links�valid_links�links �X/home/josh/Projects/reputation-and-validity-checker/url_reputation_checker/extractors.py� extract_links�LinkExtractor.extract_linkss��6�!��4�4�W�=�L��%��6�!��L�L��1�1�'�D�E� ��T�-�-�g�6�7��D��"�"�4�(�(��"�"�4�(��d�3�{�+�,�-�-�c�~�SUR5;d(SUR5;dSUR5;agg)zAuto-detect content type.z<htmlz<bodyz<a hrefr�text)�lower)rrs rr�"LinkExtractor._detect_content_type:s6��g�m�m�o�%��G�M�M�O�)C�y�T[�Ta�Ta�Tc�Gc��r"�html_contentc��[5n[US5nURSSS9H+nUSnU(a[X&5nUR U5 M- URSSS9H+nUSnU(a[X&5nUR U5 M- URSSS9H+nUS nU(a[X'5nUR U5 M- URS SS9H+nUS nU(a[X'5nUR U5 M- URSSS 0S9H�nURSS5n[R"SU[R5n U (dMDU RS5RS5n U(a[X*5n UR U 5 M� U$![a U$f=f)z Extract links from HTML content.zhtml.parser�aT)�hrefr*r�img)�srcr,�script�metaz http-equiv�refresh)�attrsr�zurl=([^;]+)�z"')rr�find_allr�add�get�re�search� IGNORECASE�group�strip� Exception)rr'r r�soup�tagr*r,r�match�urls rr�!LinkExtractor._extract_html_linksAs��+ � ��}�=�D��}�}�S�t�}�4��6�{��"�8�2�D�� $�� 5��}�}�V�$�}�7��6�{��"�8�2�D�� $�� 8��}�}�U��}�5��%�j��!�(�0�C�� #�� 6��}�}�X�4�}�8��%�j��!�(�0�C�� #�� 9��}�}�V�L�)�3L�}�M��'�'�)�R�0�� .�'�2�=�=�I��5��+�+�a�.�.�.�u�5�C��%�h�4��I�I�c�N�N�� s�EF3�*AF3�3 G�G�text_contentc�<�[5nURRU5H#nURUR S55 M% UR RU5HSnUR S5nUR S5(dUR S5(dMBURU5 MU [R"S[R5nURU5H#nURUR S55 M% U$)z*Extract links from plain text using regex.r��http://�https://z"["\']?(https?://[^"\'\s<>]+)["\']?r2) r�URL_PATTERN�finditerr4r9�MARKDOWN_PATTERN� startswithr6�compiler8)rrArr>r?�quoted_url_patterns rr�!LinkExtractor._extract_text_linksts��%�%�.�.�|�<�E��I�I�e�k�k�!�n�%�=��*�*�3�3�L�A�E��+�+�a�.�C��~�~�i�(�(�C�N�N�:�,F�,F�� #��B� �Z�Z�1��M�M� ��(�0�0��>�E��I�I�e�k�k�!�n�%�?��r"r?c�4�U(dg/SQnUR5nUHnURU5(dM g URS5(agURS5(dURS5(a[R"U5SL$g)z/Check if a URL is valid and should be included.F) zjavascript:zmailto:ztel:zftp:zfile:zdata:zabout:zchrome:zedge:�#rDrET)r%rI� validatorsr?)rr?�skip_protocols� url_lower�protocols rr�LinkExtractor._is_valid_link�s�� I�I�K� �&�H��#�#�H�-�-��'� �>�>�#��>�>�)�$�$��z�(B�(B��>�>�#�&�$�.�.�r"�urlsc��[5nUHKn[U5nUR(a+URURR 55 MKMM [[ U55$! Mh=f)z+Extract unique domains from a list of URLs.)rr�hostnamer4r%rr)rrT�domainsr?�parseds r�extract_domains�LinkExtractor.extract_domains�sa��%��C� �!�#��?�?��K�K�� 5� 5� 7�8�#��d�7�m�$�$�� s�AA0�0A5�)rN)N)�__name__� __module__�__qualname__�__firstlineno__�__doc__r6rJr8rFrH�strrr rrrr�boolrrY�__static_attributes__r[r"rr r s��2��*�*� .� � � ��K��z�z�"� � � �� .�S�.��.�PS�.�_c�dg�_h�.�>�C��C��1��1�s�1�c�RU�h�1�f��C��0�#��$��4%�D��I�%�$�s�)�%r"r )r`r6�typingrr�urllib.parserr�bs4rrOr r[r"r�<module>rgs#�� *��g%�g%r"

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/prismon/reputation-checker-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

extractors.cpython-313.pyc•6.97 KiB