python-readability/readability/encoding.py

import re
import chardet
import sys


RE_CHARSET = re.compile(br'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
RE_PRAGMA = re.compile(br'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
RE_XML = re.compile(br'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

CHARSETS = {
    "big5": "big5hkscs",
    "gb2312": "gb18030",
    "ascii": "utf-8",
    "maccyrillic": "cp1251",
    "win1251": "cp1251",
    "win-1251": "cp1251",
    "windows-1251": "cp1251",
}


def fix_charset(encoding):
    """Overrides encoding when charset declaration
       or charset determination is a subset of a larger
       charset.  Created because of issues with Chinese websites"""
    encoding = encoding.lower()
    return CHARSETS.get(encoding, encoding)


def get_encoding(page):
    # Regex for XML and HTML Meta charset declaration
    declared_encodings = (
        RE_CHARSET.findall(page) + RE_PRAGMA.findall(page) + RE_XML.findall(page)
    )

    # Try any declared encodings
    for declared_encoding in declared_encodings:
        try:
            if sys.version_info[0] == 3:
                # declared_encoding will actually be bytes but .decode() only
                # accepts `str` type. Decode blindly with ascii because no one should
                # ever use non-ascii characters in the name of an encoding.
                declared_encoding = declared_encoding.decode("ascii", "replace")

            encoding = fix_charset(declared_encoding)

            # Now let's decode the page
            page.decode(encoding)
            # It worked!
            return encoding
        except (UnicodeDecodeError, LookupError):
            pass

    # Fallback to chardet if declared encodings fail
    # Remove all HTML tags, and leave only text for chardet
    text = re.sub(br"(\s*</?[^>]*>)+\s*", b" ", page).strip()
    enc = "utf-8"
    if len(text) < 10:
        return enc  # can't guess
    res = chardet.detect(text)
    enc = res["encoding"] or "utf-8"
    # print '->', enc, "%.2f" % res['confidence']
    enc = fix_charset(enc)
    return enc
Moved to lxml (based on decruft version); better encoding recognition. 13 years ago			`import re`
			`import chardet`
Fixes checking of declared encodings in get_encoding. In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it. 9 years ago			`import sys`
Moved to lxml (based on decruft version); better encoding recognition. 13 years ago
Added dev branch features. Bumped to version 0.6 9 years ago
			`RE_CHARSET = re.compile(br'<meta.?charset=["\'](.+?)["\'>]', flags=re.I)`
			`RE_PRAGMA = re.compile(br'<meta.?content=["\'];?charset=(.+?)["\'>]', flags=re.I)`
			`RE_XML = re.compile(br'^<\?xml.?encoding=["\'](.+?)["\'>]')`

			`CHARSETS = {`
Use black to format the code 4 years ago			`"big5": "big5hkscs",`
			`"gb2312": "gb18030",`
			`"ascii": "utf-8",`
			`"maccyrillic": "cp1251",`
			`"win1251": "cp1251",`
			`"win-1251": "cp1251",`
			`"windows-1251": "cp1251",`
Added dev branch features. Bumped to version 0.6 9 years ago			`}`

Use black to format the code 4 years ago
Added dev branch features. Bumped to version 0.6 9 years ago			`def fix_charset(encoding):`
			`"""Overrides encoding when charset declaration`
			`or charset determination is a subset of a larger`
			`charset. Created because of issues with Chinese websites"""`
			`encoding = encoding.lower()`
			`return CHARSETS.get(encoding, encoding)`


Moved to lxml (based on decruft version); better encoding recognition. 13 years ago			`def get_encoding(page):`
Added code to check declared encodings first and check them from kennethreitz/requests/utils.py. Also I added some superset encodings I have found in Chinese pages that are mishandled by chardet/character declarations. 10 years ago			`# Regex for XML and HTML Meta charset declaration`
Use black to format the code 4 years ago			`declared_encodings = (`
			`RE_CHARSET.findall(page) + RE_PRAGMA.findall(page) + RE_XML.findall(page)`
			`)`
Added code to check declared encodings first and check them from kennethreitz/requests/utils.py. Also I added some superset encodings I have found in Chinese pages that are mishandled by chardet/character declarations. 10 years ago
			`# Try any declared encodings`
Fixes checking of declared encodings in get_encoding. In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it. 9 years ago			`for declared_encoding in declared_encodings:`
			`try:`
			`if sys.version_info[0] == 3:`
			`# declared_encoding will actually be bytes but .decode() only`
			# accepts `str` type. Decode blindly with ascii because no one should
			`# ever use non-ascii characters in the name of an encoding.`
Use black to format the code 4 years ago			`declared_encoding = declared_encoding.decode("ascii", "replace")`
Fixes checking of declared encodings in get_encoding. In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it. 9 years ago
Added dev branch features. Bumped to version 0.6 9 years ago			`encoding = fix_charset(declared_encoding)`

			`# Now let's decode the page`
fix encoding detection to use the encoding being tested 7 years ago			`page.decode(encoding)`
Added dev branch features. Bumped to version 0.6 9 years ago			`# It worked!`
			`return encoding`
Catch LookupError in case of bad encoding string I've seen cases where bad encoding strings will result in errors, catching LookupError should solve the problem by falling back onto `chardet` or `utf-8` Here's one case: ``` textPayload: "Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 189, in summary self._html(True) File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 132, in _html self.html = self._parse(self.input) File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 141, in _parse doc, self.encoding = build_doc(input) File "/opt/conda/lib/python3.7/site-packages/readability/htmls.py", line 17, in build_doc encoding = get_encoding(page) or 'utf-8' File "/opt/conda/lib/python3.7/site-packages/readability/encoding.py", line 46, in get_encoding page.decode(encoding) LookupError: unknown encoding: utf-8, ie=edge, chrome=1 ``` 5 years ago			`except (UnicodeDecodeError, LookupError):`
Fixes checking of declared encodings in get_encoding. In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it. 9 years ago			`pass`
Added code to check declared encodings first and check them from kennethreitz/requests/utils.py. Also I added some superset encodings I have found in Chinese pages that are mishandled by chardet/character declarations. 10 years ago
			`# Fallback to chardet if declared encodings fail`
Added dev branch features. Bumped to version 0.6 9 years ago			`# Remove all HTML tags, and leave only text for chardet`
Use black to format the code 4 years ago			`text = re.sub(br"(\s</?[^>]>)+\s*", b" ", page).strip()`
			`enc = "utf-8"`
Added dev branch features. Bumped to version 0.6 9 years ago			`if len(text) < 10:`
Use black to format the code 4 years ago			`return enc # can't guess`
Moved to lxml (based on decruft version); better encoding recognition. 13 years ago			`res = chardet.detect(text)`
Use black to format the code 4 years ago			`enc = res["encoding"] or "utf-8"`
			`# print '->', enc, "%.2f" % res['confidence']`
Added dev branch features. Bumped to version 0.6 9 years ago			`enc = fix_charset(enc)`
Moved to lxml (based on decruft version); better encoding recognition. 13 years ago			`return enc`