python-readability

Commit Graph

Author	SHA1	Message	Date
Éloi Rivard	e9acdd091b	Use black to format the code	4 years ago
Yuri Baburov	615ce803c6	Merge pull request #124 from dariobig/patch-1 Catch LookupError in case of bad encoding string	4 years ago
Éloi Rivard	6c1c6391e2	Fixed a few regex warnings	4 years ago
Dario	0442358942	Catch LookupError in case of bad encoding string I've seen cases where bad encoding strings will result in errors, catching LookupError should solve the problem by falling back onto `chardet` or `utf-8` Here's one case: ``` textPayload: "Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 189, in summary self._html(True) File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 132, in _html self.html = self._parse(self.input) File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 141, in _parse doc, self.encoding = build_doc(input) File "/opt/conda/lib/python3.7/site-packages/readability/htmls.py", line 17, in build_doc encoding = get_encoding(page) or 'utf-8' File "/opt/conda/lib/python3.7/site-packages/readability/encoding.py", line 46, in get_encoding page.decode(encoding) LookupError: unknown encoding: utf-8, ie=edge, chrome=1 ```	5 years ago
Chris Curvey	9a31587192	fix encoding detection to use the encoding being tested	7 years ago
Yuri Baburov	24bb20c761	Added dev branch features. Bumped to version 0.6	9 years ago
Martin Thurau	386e48d29b	Fixes checking of declared encodings in get_encoding. In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it.	9 years ago
Martin Thurau	046d2c10c3	Fixes regex declaration in get_encoding. Since get_encoding() is only called when the input is not already unicode we need to declare the regexs as byte type so they continue to work in Python 3.	9 years ago
Nathan Breit	75e2e0cb3a	Defaulting to utf-8 when chardet returns None On articles like this one chardet returns None: http://news.zing.vn/nhip-song-tre/thay-giao-gay-sot-tung-bo-luat-tinh-yeu/a291427.html This causes exceptions later on when encoding.lower() is called	10 years ago
Mark Perdomo	3a43a3fe7e	Added code to check declared encodings first and check them from kennethreitz/requests/utils.py. Also I added some superset encodings I have found in Chinese pages that are mishandled by chardet/character declarations.	10 years ago
Yuri Baburov	11c4d95411	Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3	13 years ago
Yuri Baburov	c2ec1d1c38	Sorted out unicode issues, thanks to Lee Semel.	13 years ago
Yuri Baburov	43c34bacc1	Renamed encodings to encoding to avoid conflicts with system module.	13 years ago

13 Commits (master)