Commit Graph

13 Commits (master)

Author SHA1 Message Date
Éloi Rivard e9acdd091b Use black to format the code 4 years ago
Yuri Baburov 615ce803c6
Merge pull request #124 from dariobig/patch-1
Catch LookupError in case of bad encoding string
4 years ago
Éloi Rivard 6c1c6391e2 Fixed a few regex warnings 4 years ago
Dario 0442358942
Catch LookupError in case of bad encoding string
I've seen cases where bad encoding strings will result in errors, catching LookupError should solve the problem by falling back onto `chardet` or `utf-8`

Here's one case:

```
 textPayload: "Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 189, in summary
    self._html(True)
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 132, in _html
    self.html = self._parse(self.input)
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 141, in _parse
    doc, self.encoding = build_doc(input)
  File "/opt/conda/lib/python3.7/site-packages/readability/htmls.py", line 17, in build_doc
    encoding = get_encoding(page) or 'utf-8'
  File "/opt/conda/lib/python3.7/site-packages/readability/encoding.py", line 46, in get_encoding
    page.decode(encoding)
LookupError: unknown encoding: utf-8, ie=edge, chrome=1
```
5 years ago
Chris Curvey 9a31587192 fix encoding detection to use the encoding being tested 7 years ago
Yuri Baburov 24bb20c761 Added dev branch features.
Bumped to version 0.6
9 years ago
Martin Thurau 386e48d29b Fixes checking of declared encodings in get_encoding.
In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it.
9 years ago
Martin Thurau 046d2c10c3 Fixes regex declaration in get_encoding.
Since get_encoding() is only called when the input is *not* already unicode we need to declare the regexs as byte type so they continue to work in Python 3.
9 years ago
Nathan Breit 75e2e0cb3a Defaulting to utf-8 when chardet returns None
On articles like this one chardet returns None:
http://news.zing.vn/nhip-song-tre/thay-giao-gay-sot-tung-bo-luat-tinh-yeu/a291427.html
This causes exceptions later on when encoding.lower() is called
10 years ago
Mark Perdomo 3a43a3fe7e Added code to check declared encodings first and check them
from kennethreitz/requests/utils.py.  Also I added some superset
encodings I have found in Chinese pages that are mishandled by
chardet/character declarations.
10 years ago
Yuri Baburov 11c4d95411 Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3 13 years ago
Yuri Baburov c2ec1d1c38 Sorted out unicode issues, thanks to Lee Semel. 13 years ago
Yuri Baburov 43c34bacc1 Renamed encodings to encoding to avoid conflicts with system module. 13 years ago