Skip to main content
Glama

SEC Filing MCP Server

chunking.cpython-311.pyc10.8 kB
� ��h�� ���ddlZddlZddlZddlmZmZmZddlmZddl m Z ddl Z ddl m Z ddl Z e j�dej�ej�ej�e��������eGd�d����ZGd�d ��Zed kr&d Ze��Ze�ed d ���dSdS)�N)�List�Dict�Tuple)� dataclass)�Pinecone)�tqdmc�`�eZdZUeed<eed<eed<eed<eed<eed<eed<eed<d S) �Filing�ticker� company_name� report_type� filing_date� file_path� summary_path�content�summaryN)�__name__� __module__� __qualname__�str�__annotations__���%/Users/sharhad/mcp/ingest/chunking.pyr r sd������� �K�K�K��������������N�N�N����� �L�L�L� �L�L�L�L�Lrr c ���eZdZddefd�Zdedeeeeffd�Z dd ed ed edeefd �Z d ede fd�Z deedeee fd�Z ddeded edefd�Zdded edefd�ZdS)�SECFilingProcessor�mcp�pinecone_index_namec ���ttjd�����|_|j�|��|_t jd��|_dddddd d d d �|_ dS) N�PINECONE_API_KEY)�api_key�text-embedding-3-smallz Apple Inc.zAmazon.com Inc.zFoot Locker Inc.zThe Coca-Cola CompanyzMeta Platforms Inc.zMicrosoft CorporationzNVIDIA Corporationz Tesla Inc.��AAPL�AMZN�FL�KO�META�MSFT�NVDA�TSLA) r�os�getenv�pc�Index�index�tiktoken�encoding_for_model�encoding� company_names)�selfrs r�__init__zSECFilingProcessor.__init__s{���R�Y�/A�%B�%B�C�C�C����W�]�]�#6�7�7�� �!�3�4L�M�M�� �!�%�$�)�)�+�(� �  �  ����r�filename�returnc��|�dd���d��}|d}|d}|d}|dkrdnd }|||fS) z3Extract ticker, report type, and date from filename�.txt��_r���10K�yearly� quarterly)�replace�split)r5r7�partsr � report_coderr s r�parse_filenamez!SECFilingProcessor.parse_filename-sd��� � ���,�,�2�2�3�7�7���q����A�h� ��A�h� �#.��"6�"6�h�h�K� ��{�K�/�/r���text� chunk_size�overlapc�r�|j�|��}g}d}|t|��kr�t||zt|����}|||�}|j�|��} |�| ��|t|��krn||z }|t|��k��|S)z#Create overlapping chunks from textr)r3�encode�len�min�decode�append) r5rIrJrK�tokens�chunks�start�end� chunk_tokens� chunk_texts r� create_chunksz SECFilingProcessor.create_chunks;s��� ��%�%�d�+�+�������c�&�k�k�!�!��e�j�(�#�f�+�+�6�6�C�!�%��)�,�L���-�-�l�;�;�J� �M�M�*� %� %� %��c�&�k�k�!�!���'�M�E��c�&�k�k�!�!�� rrc ���tj�|��}d|vrdS|�|��\}}}|j�||��}|�dd��}tj�tj�|��|��}t|dd���5} | � ��} ddd��n #1swxYwYd} tj� |��r>t|dd���5} | � ��} ddd��n #1swxYwYt||||||| | � ��S) zLoad a filing and its summary�_summaryNr:z _summary.txt�rzutf-8)r3r;)r r r rrrrr) r,�path�basenamerFr4�getrB�join�dirname�open�read�existsr ) r5rr7r r rr �summary_filenamer�frrs r� load_filingzSECFilingProcessor.load_filingQs����7�#�#�I�.�.�� �� !� !��4�,0�+>�+>�x�+H�+H�(�� �[��)�-�-�f�f�=�=� �$�+�+�F�N�C�C���w�|�|�B�G�O�O�I�$>�$>�@P�Q�Q� ��)�S�W� 5� 5� 5� ���f�f�h�h�G� � � � � � � � � � � ���� � � � ��� �7�>�>�,� '� '� #��l�C�G�<�<�<� #���&�&�(�(�� #� #� #� #� #� #� #� #� #� #� #���� #� #� #� #���'�%�%�!�'���  �  �  � s$�?C � C$�'C$�D?�?E�E�docsc�l�tj�|dd���}d�|jD��}|S)z,Generate embeddings for a batch of documentsr"i)�input�model� dimensionsc��g|] }|j�� Sr)� embedding)�.0r[s r� <listcomp>z,SECFilingProcessor.embed.<locals>.<listcomp>s��4�4�4�a�a�k�4�4�4r)�openai� embeddings�create�data)r5rg�res� doc_embedss r�embedzSECFilingProcessor.embedxsE����&�&��,��'� � �� 5�4�3�8�4�4�4� ��r�defaultr �data_dir� namespacec���tj�||��}tj|�d���}d�|D��}t d|�dt |���d���g}t |d|�d����D]�}|�|��} | ��|�| j |� ��} t| ��D]|\} } d | j �d | j �d | j �d | j�d| dz�d | �d� } |�d| j �d| ��}|�|| | j| j | j | j | | dd�d�d����}��t |�dt |���dt |���d���d}d}t#dt |��|��D]�} || | |z�}d�|D��}|�|��}g}t'||��D]*\}}|�|d||dd����+|j�||���|t |��z }t d| |zdz�dt |���d�����t |�d |�d���|S)!z%Process all files for a single tickerz/*.txtc��g|]}d|v�|�� S)rZr)rnres rroz5SECFilingProcessor.process_ticker.<locals>.<listcomp>�s"��9�9�9�q�Z�q�%8�%8��%8�%8�%8rz Processing z: z fileszLoading )�descN)rJz Summary of � z report (z): z Content (Part r=� r<i�)r r r r� chunk_index� original_text)�idrI�metadataz : Created z chunks from �2rc��g|] }|d�� S)rIr)rn�ds rroz5SECFilingProcessor.process_ticker.<locals>.<listcomp>�s��.�.�.�1�Q�v�Y�.�.�.rr�r�)r��valuesr�)�vectorsryz Uploaded batch z vectorsz: Total uploaded )r,r\r_�glob�printrNrrfrXr� enumerater r rrrQr �rangerv�zipr0�upsert)r5r rxrJry� ticker_dir�files� chunk_datar�filingrS�irW� enhanced_text�chunk_id� batch_size�total_uploaded�batch�textsrqr�r��es r�process_tickerz!SECFilingProcessor.process_ticker�sg���W�\�\�(�F�3�3� �� �Z�/�/�/�0�0��:�9�E�9�9�9�� �:�f�:�:��E� � �:�:�:�;�;�;�� ��e�,E�v�,E�,E�,E�F�F�F�! �! �I��%�%�i�0�0�F��~���'�'���Z�'�P�P�F�"+�6�!2�!2� � � ��:�!� � �!�!�"(�"4�!�!�?E�?Q�!�!���!�!��1�u� !�!�  � !�!�!� �%�?�?�v�'9�?�?�A�?�?���!�!�"�)�"(�-�(.�(;�'-�'9�'-�'9�'(�)3�E�T�E�):� !�!� #� #� � � � � �2 ��S�S�3�z�?�?�S�S��U���S�S�S�T�T�T�� ����q�#�j�/�/�:�6�6� S� S�A��q��Z��/�0�E�/�.��.�.�.�E����E�*�*�J��G��E�:�.�.� � ���1�����D�'�� !�*� � � ����� �J� � �!�%� � � � � �c�'�l�l� *�N� �Q�a��m�a�&7�Q�Q�3�w�<�<�Q�Q�Q� R� R� R� R� ��B�B�.�B�B�B�C�C�C��rc�|�gd�}d}|D]}|�||||��}||z }� td|�d���|S)zProcess all tickersr#ru ✅ Complete: Uploaded z total vectors to Pinecone)r�r�)r5rxrJry�tickers� total_vectorsr �counts r�process_all_tickersz&SECFilingProcessor.process_all_tickers�sf��N�N�N��� �� #� #�F��'�'���*�i�P�P�E� �U� "�M�M� �S�-�S�S�S�T�T�T��rN)r)rGrH)rGrw)rrrrr6rrF�intrrXr rf�floatrvr�r�rrrrrsk������ � �C� � � � �* 0�s� 0�u�S�#�s�]�/C� 0� 0� 0� 0� (,�$'�����!$��"��-1��I�����,% �S�% �V�% �% �% �% �N�$�s�)���T�%�[�(9�����P�P�S�P�C�P�S�P�\_�P�P�P�P�d � �C� �S� �TW� � � � � � rr�__main__z../datarGz sec-filings)rxrJry)r,r�r1�typingrrr� dataclassesr�pineconerrpr�sysr\�insertr`�abspath�__file__r rr�DATA_DIR� processorr�rrr�<module>r�s��� � � � � � � � �����$�$�$�$�$�$�$�$�$�$�!�!�!�!�!�!������� � � � ������� � � � ������2�7�?�?�2�7�?�?�2�7�?�?�8�3L�3L�#M�#M�N�N�O�O�O� �������� ���G�G�G�G�G�G�G�G�T �z����H�#�"���I��!�!���!�"�������r

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SharhadBashar/SEC-filing-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server