python-readability

Commit Graph

Author	SHA1	Message	Date
Yuri Baburov	73c598df81	Updated version to 0.8.1.1	4 years ago
Nabin Khadka	531ecc7a29	Changes log level #141	4 years ago
anekos	667114463d	Fix UnicodeDecodeError on python2	4 years ago
Yuri Baburov	1997b80eaf	Update __init__.py	4 years ago
anekos	6842ea906e	Fix causing lxml error	4 years ago
Éloi Rivard	e9acdd091b	Use black to format the code	4 years ago
Adrien Barbaresi	bd8293eb63	code linting	4 years ago
Yuri Baburov	615ce803c6	Merge pull request #124 from dariobig/patch-1 Catch LookupError in case of bad encoding string	4 years ago
Yuri Baburov	52f767c812	Update __init__.py	4 years ago
Yuri Baburov	da9e285f73	Merge pull request #128 from azmeuk/self-closing Replaced XHTML output with HTML5 output in summary for empty elements (a, br), issue #125	4 years ago
Yuri Baburov	5032e2d3ab	Merge pull request #127 from azmeuk/warnings Fixed a few regex warnings, thanks azmeuk !	4 years ago
Éloi Rivard	f9977b727d	Documentation draft	4 years ago
Éloi Rivard	0846955dd7	Fixed issue with self-closing tags. Fix #125	4 years ago
Éloi Rivard	6c1c6391e2	Fixed a few regex warnings	4 years ago
Dario	0442358942	Catch LookupError in case of bad encoding string I've seen cases where bad encoding strings will result in errors, catching LookupError should solve the problem by falling back onto `chardet` or `utf-8` Here's one case: ``` textPayload: "Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 189, in summary self._html(True) File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 132, in _html self.html = self._parse(self.input) File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 141, in _parse doc, self.encoding = build_doc(input) File "/opt/conda/lib/python3.7/site-packages/readability/htmls.py", line 17, in build_doc encoding = get_encoding(page) or 'utf-8' File "/opt/conda/lib/python3.7/site-packages/readability/encoding.py", line 46, in get_encoding page.decode(encoding) LookupError: unknown encoding: utf-8, ie=edge, chrome=1 ```	5 years ago
baby5	0ac3c5bbc6	Fix compile_pattern not support uppercase	5 years ago
jkclee	bac691a0a4	Fix #99	5 years ago
Yuri Baburov	494b19ed4e	Merge branch 'master' into many_repeated_spaces_timeout	6 years ago
Linas Valiukas	0233936e72	Add __version__ constant to __init__.py, read it in setup.py Users wouldn't need to install, import and use Pip ("pkg_resources") to find out which version of readability-lxml is being used.	6 years ago
Linas Valiukas	747c46abce	Trim many repeated spaces to make clean() faster When Readability encounters many repeated whitespace, the cleanup regexes in clean() take forever to run, so trim the amount of whitespace to 255 characters. Additionally, test the extracting performance with "timeout_decorator".	6 years ago
Yuri Baburov	f7f439d019	Improved positive_keywords and negative_keywords processing for the CLI	6 years ago
Yuri Baburov	0c8f040d53	Updated docs for positive_keywords and negative_keywords, cleaner implementation.	6 years ago
Yuri Baburov	0e50b53d05	Release version 0.7 . Better HTML5 support and an important bugfix.	6 years ago
Yuri Baburov	537de2b8f6	Improved remove_unlikely_candidates following an advice from issue #102	6 years ago
Chris Curvey	9a31587192	fix encoding detection to use the encoding being tested	7 years ago
Yuri Baburov	e4efc87a20	Update readability.py	8 years ago
Yuri Baburov	b20d5c15ef	Improved Document class documentation	8 years ago
alphapapa	8443a87f5c	Update readability.py	8 years ago
alphapapa	5fc2d3684a	Use Mozilla User-Agent Use a "Mozilla" user-agent to avoid HTTP 403 errors. Fixes #71.	8 years ago
Yuri Baburov	65d1ebb06d	Fixed #70 and added xpath option	9 years ago
Yuri Baburov	c0d794fdd8	Update readability.py Fixed logging namespace	9 years ago
Yuri Baburov	8ff11e68a6	Debugging improvements. Bump to 0.6.0.5	9 years ago
Yuri Baburov	fcdbe563a5	Fixed #49 . Bump to 0.6.0.4	9 years ago
Yuri Baburov	24bb20c761	Added dev branch features. Bumped to version 0.6	9 years ago
Yuri Baburov	154658798b	Merge pull request #64 from martinth/master Added python 3 support (Supported: python 2.6, 2.7, 3.3, 3.4). Thanks a lot to @martinth	9 years ago
Marko Horvatic	f0ff9b2425	Move logging.basicConfig to main function	9 years ago
Yuri Baburov	e2bc1ea055	Improved #65 which has given warning, added cssselect lib, bumped to 0.5.1	9 years ago
Mariusz Osiecki	bf9e7404fa	Failure if best_elem is root (fix #58 )	9 years ago
Martin Thurau	386e48d29b	Fixes checking of declared encodings in get_encoding. In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it.	9 years ago
Martin Thurau	046d2c10c3	Fixes regex declaration in get_encoding. Since get_encoding() is only called when the input is not already unicode we need to declare the regexs as byte type so they continue to work in Python 3.	9 years ago
Martin Thurau	ce7ca26835	Adds compatibility `raise_with_traceback` method to support different `raise` syntax Unfortunately the Python 2 `raise` syntax is not supported in Python 3.3 and not all 3.4.x versions so we deal with that by using conditional imports and a compatibility layer.	9 years ago
Martin Thurau	3ac56329e2	Corrects some things were 2to3 did to much.	9 years ago
Martin Thurau	aa4132f57a	Adds Python 3.4 support. Code now supports Python 2.6, 2.7 and 3.4. PYthon 3.3 isn't support because of some issues with the parser and the difference between old and new `raise` syntax.	9 years ago
Yuri Baburov	987570bef0	Updated package links for Python 2.7 and Python 3 support	9 years ago
Yuri Baburov	1fac7e685a	Added a feature to allow more images per article (with a test)	9 years ago
Miguel Galves	d04d41b749	Insert text inside iframe for correct output	9 years ago
Miguel Galves	be2a1c4646	Let width and height attributes	9 years ago
Miguel Galves	f1759c1404	Allows iframes containing youtube or vimeo videos. People like them	9 years ago
Yuri Baburov	e4bcbe57d7	Fixes #53	9 years ago
Nathan Breit	75e2e0cb3a	Defaulting to utf-8 when chardet returns None On articles like this one chardet returns None: http://news.zing.vn/nhip-song-tre/thay-giao-gay-sot-tung-bo-luat-tinh-yeu/a291427.html This causes exceptions later on when encoding.lower() is called	10 years ago

1 2

88 Commits (master)