Skip to main content
Glama

URL Reputation and Validity Checker

by prismon
extractors.cpython-313.pyc7.13 kB
� ��7h��X�SrSSKrSSKJrJr SSKJrJr SSKJ r SSK r "SS5r g)zLink extraction utilities.�N)�List�Set)�urljoin�urlparse)� BeautifulSoupc �*�\rSrSrSr\R "S\R5r\R "S\R5r SS\ S\ S\ S \ \ 4S jjr S\ S \ 4S jr SS \ S\ S \\ 4S jjrS\ S \\ 4SjrS\ S \4SjrS\ \ S \ \ 4SjrSrg)� LinkExtractor� z(Extract links from HTML or text content.zfhttps?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&/=]*)z\[([^\]]+)\]\(([^)]+)\)N�content� content_type�base_url�returnc�r�US:XaURU5n[5nUS:Xa URURX55 URUR U55 /nUH,nUR U5(dMUR U5 M. [[[U555$)a Extract all links from content. Args: content: The content to extract links from content_type: "html", "text", or "auto" (auto-detect) base_url: Base URL for resolving relative links Returns: List of unique URLs found in the content �auto�html) �_detect_content_type�set�update�_extract_html_links�_extract_text_links�_is_valid_link�append�sorted�list)�selfr r r �links� valid_links�links �X/home/josh/Projects/reputation-and-validity-checker/url_reputation_checker/extractors.py� extract_links�LinkExtractor.extract_linkss��� �6� !��4�4�W�=�L��%�� �6� !� �L�L��1�1�'�D� E� � � �T�-�-�g�6�7�� ��D��"�"�4�(�(��"�"�4�(���d�3�{�+�,�-�-�c�~�SUR5;d(SUR5;dSUR5;agg)zAuto-detect content type.z<htmlz<bodyz<a hrefr�text)�lower)rr s rr�"LinkExtractor._detect_content_type:s6�� �g�m�m�o� %��G�M�M�O�)C�y�T[�Ta�Ta�Tc�Gc��r"� html_contentc��[5n[US5nURSSS9H+nUSnU(a [X&5nUR U5 M- URSSS9H+nUSnU(a [X&5nUR U5 M- URSSS9H+nUS nU(a [X'5nUR U5 M- URS SS9H+nUS nU(a [X'5nUR U5 M- URS S S 0S9H�nUR SS5n[ R"SU[ R5n U (dMDU RS5RS5n U(a [X*5n UR U 5 M� U$![a U$f=f)z Extract links from HTML content.z html.parser�aT)�hrefr*r�img)�srcr,�script�metaz http-equiv�refresh)�attrsr �z url=([^;]+)�z"') rr�find_allr�add�get�re�search� IGNORECASE�group�strip� Exception) rr'r r�soup�tagr*r,r �match�urls rr�!LinkExtractor._extract_html_linksAs������+ � ��}�=�D��}�}�S�t�}�4���6�{���"�8�2�D�� � �$�� 5��}�}�V�$�}�7���6�{���"�8�2�D�� � �$�� 8��}�}�U��}�5���%�j���!�(�0�C�� � �#�� 6��}�}�X�4�}�8���%�j���!�(�0�C�� � �#�� 9��}�}�V�L�)�3L�}�M���'�'�)�R�0��� � �.�'�2�=�=�I���5��+�+�a�.�.�.�u�5�C��%�h�4���I�I�c�N�N�� �� � � �� �  �s�EF3�*AF3�3 G�G� text_contentc�<�[5nURRU5H#nURUR S55 M% UR RU5HSnUR S5nUR S5(dUR S5(dMBURU5 MU [R"S[R5nURU5H#nURUR S55 M% U$)z*Extract links from plain text using regex.r��http://�https://z"["\']?(https?://[^"\'\s<>]+)["\']?r2) r� URL_PATTERN�finditerr4r9�MARKDOWN_PATTERN� startswithr6�compiler8)rrArr>r?�quoted_url_patterns rr�!LinkExtractor._extract_text_linksts�������%�%�.�.�|�<�E� �I�I�e�k�k�!�n� %�=��*�*�3�3�L�A�E��+�+�a�.�C��~�~�i�(�(�C�N�N�:�,F�,F�� � �#��B�  �Z�Z� 1� �M�M� ��(�0�0��>�E� �I�I�e�k�k�!�n� %�?�� r"r?c�4�U(dg/SQnUR5nUHnURU5(dM g URS5(agURS5(dURS5(a[R"U5SL$g)z/Check if a URL is valid and should be included.F) z javascript:zmailto:ztel:zftp:zfile:zdata:zabout:zchrome:zedge:�#rDrET)r%rI� validatorsr?)rr?�skip_protocols� url_lower�protocols rr�LinkExtractor._is_valid_link�s����� �� �I�I�K� �&�H��#�#�H�-�-��'� �>�>�#� � �� �>�>�)� $� $����z�(B�(B��>�>�#�&�$�.� .�r"�urlsc���[5nUHKn[U5nUR(a+URURR 55 MKMM [ [ U55$! Mh=f)z+Extract unique domains from a list of URLs.)rr�hostnamer4r%rr)rrT�domainsr?�parseds r�extract_domains�LinkExtractor.extract_domains�sa���%���C� �!�#����?�?��K�K���� 5� 5� 7�8�#���d�7�m�$�$�� ��s �AA0�0A5�)rN)N)�__name__� __module__� __qualname__�__firstlineno__�__doc__r6rJr8rFrH�strrr rrrr�boolrrY�__static_attributes__r[r"rr r s���2��*�*� .� � � ��K��z�z�"� � � ��� .�S�.��.�PS�.�_c�dg�_h�.�>�C��C��1��1�s�1�c�RU�h�1�f����C���0�#��$��4 %�D��I� %�$�s�)� %r"r ) r`r6�typingrr� urllib.parserr�bs4rrOr r[r"r�<module>rgs#�� � ��*���g%�g%r"

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/prismon/reputation-checker-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server