�
���h$# � �� � d Z ddlZddlZddlZddlmZ ddlmZm Z m
Z
mZ ddlm
Z
mZmZmZmZmZ deee
f de eee
f deee
f d efd
�Zddeded
eee
f fd�Zdded
e eee
f fd�Zy)zF
Centralized dataset processing module for both local and server use.
� N)�BytesIO)�Dict�List�Any�Tuple)�get_drive_service�download_file_from_drive�extract_metadata_from_dataframe�suggest_dq_rules�create_contract_excel�generate_dq_report�metadata�dq_rules� dq_report�readme_pathc �� � d| d � d| d d�d| d � dt j j � j d � � d
� }| d D ]� }|j d� �|j dd� d�nd}|j d� �|j dd� d�nd}|j d� �|j dd� d�nd}|d|d � d|d � d|d � d|d d�d|d � d|� d|� d|� d�z
}�� |d|d d � d|d d � d!|d d" � d#�z
}|D ]1 } | d$ d%k( rd&nd'}
|d(|
� d)| d* j � � d+| d, � d-�z
}�3 |d.z
}|d/ D ] }|d0|d1 � d+|d2 d�d3|d4 d�d5�z
}� |d6| d � d7| d j
d8� d9 � d:| d j
d8� d9 � d;| d j
d8� d9 � d<� z
}t |d=d>�?� 5 }|j |� d
d
d
� y
# 1 sw Y y
xY w)@z=Create a comprehensive README file for the processed dataset.zD# Dataset Processing Report
## Dataset Information
- **Filename**: �filenamez
- **Rows**: � row_count�,z
- **Columns**: �column_countz
- **Processing Date**: z%Y-%m-%d %H:%M:%Sz�
## Column Details
| Column Name | Data Type | Null Count | Null % | Unique Count | Min Value | Max Value | Mean |
|-------------|-----------|------------|--------|--------------|-----------|-----------|------|
�columns� min_valueNzN/Az.4f� max_value�
mean_valuez| �namez | � data_type�
null_count�null_percentagez.1fz% | �unique_countz |
z,
## Data Quality Summary
- **Total Rules**: �data_quality_summary�total_rulesz
- **Error Rules**: �error_rulesz
- **Warning Rules**: �
warning_rulesz
## Data Quality Rules
�severity�erroru 🔴u 🟡z- z **� rule_typez**: �description�
z
## Column Quality Metrics
�column_qualityz- **�column_name�completenessz% complete, �
uniquenessz % unique
z
## Files Generated
- `z` - Original dataset
- `�.r z'_metadata.json` - Detailed metadata
- `z<_contract.xlsx` - Data contract with schema and DQ rules
- `z�_dq_report.json` - Comprehensive data quality report
- `README.md` - This summary document
## Usage
This dataset has been processed through the MCP Dataset Onboarding pipeline. All artifacts are ready for catalog publication or further analysis.
�wzutf-8)�encoding) �pd� Timestamp�now�strftime�get�upper�split�open�write)
r r r r �content�col�min_val�max_val�mean_val�rule�
severity_icon�col_quality�fs
�8C:\Users\Lokesh kumar\Documents\MCP\dataset_processor.py�create_dataset_readmerC s~ � �� �*�%�&� '
�
�k�
"�1�%� &���(�)� *����(�(�*�3�3�4G�H�I� J�
�G� � �"� r��:=�'�'�+�:N�:Z�S�W�W�[�%�0��5�`e��:=�'�'�+�:N�:Z�S�W�W�[�%�0��5�`e��<?�G�G�L�<Q�<]�c�g�g�l�E�2�3�7�ch���R��F��}�C��K�(8�'9��S��=N�<O�s�SV�Wh�Si�jm�Rn�nr�sv� xF� tG� sH� HK� LS� KT� TW� X_� W`� `c� dl� cm� mq� r� r��r� � ��4�5�m�D�E� F��4�5�m�D�E� F�� 6�7��H�I� J� � �G� � a��"&�z�"2�g�"=��6�
��R�
��c�$�{�*;�*A�*A�*C�)D�D��m�I\�H]�]_�`�`��a� � � �G� !�!1�2� Q���T�+�m�4�5�T�+�n�:U�VY�9Z�Zf�gr�s� hA� BE� gF� FP� Q� Q��Q� � ��Z��� ��Z����s�#�A�&�'� (��Z����s�#�A�&�'� (��Z����s�#�A�&�'� (�
�
�G�
�k�3�� 1� �Q� ������ � �s �G/�/G8�file_id�
output_folder�returnc � � t d| � �� t d� t � }t | |� }|j � j | �� j � }|d }t d|� �� |j
d� d }t j j ||� }t j |d� � t d
|� �� |j � j d� rt j t |� � }nI|j � j d� rt j t |� � }nt# d
� �t% ||� } t d| d � d| d � d�� t d� t'