OwlCyberSecurity - MANAGER
Edit File: charsetprober.cpython-39.pyc
a (��e, � @ sL d dl Z d dlZd dlmZmZ ddlmZmZ e�d�Z G dd� d�Z dS )� N)�Optional�Union� )�LanguageFilter�ProbingStates% [a-zA-Z]*[�-�]+[a-zA-Z]*[^a-zA-Z�-�]?c @ s� e Zd ZdZejfedd�dd�Zdd�dd�Zee e d�d d ��Zee e d�dd��Ze eef ed �dd�Zeed�dd��Zed�dd�Zee eef ed�dd��Zee eef ed�dd��Zee eef ed�dd��ZdS )� CharSetProbergffffff�?N)�lang_filter�returnc C s$ t j| _d| _|| _t�t�| _d S )NT) r � DETECTING�_state�activer �logging� getLogger�__name__�logger)�selfr � r �b/home/gouroczh/virtualenv/pat/3.9/lib/python3.9/site-packages/pip/_vendor/chardet/charsetprober.py�__init__, s zCharSetProber.__init__)r c C s t j| _d S �N)r r r �r r r r �reset2 s zCharSetProber.resetc C s d S r r r r r r �charset_name5 s zCharSetProber.charset_namec C s t �d S r ��NotImplementedErrorr r r r �language9 s zCharSetProber.language)�byte_strr c C s t �d S r r )r r r r r �feed= s zCharSetProber.feedc C s | j S r )r r r r r �state@ s zCharSetProber.statec C s dS )Ng r r r r r �get_confidenceD s zCharSetProber.get_confidence)�bufr c C s t �dd| �} | S )Ns ([ -])+� )�re�sub)r r r r �filter_high_byte_onlyG s z#CharSetProber.filter_high_byte_onlyc C sZ t � }t�| �}|D ]@}|�|dd� � |dd� }|�� sJ|dk rJd}|�|� q|S )u7 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [-ÿ] marker: everything else [^a-zA-Z-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. N���� �r! )� bytearray�INTERNATIONAL_WORDS_PATTERN�findall�extend�isalpha)r �filtered�words�wordZ last_charr r r �filter_international_wordsL s z(CharSetProber.filter_international_wordsc C s� t � }d}d}t| ��d�} t| �D ]R\}}|dkrB|d }d}q$|dkr$||krr|sr|�| ||� � |�d� d}q$|s�|�| |d � � |S ) a[ Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Fr �c� >r � <r! TN)r'