Commit Graph

88 Commits (master)

Author SHA1 Message Date
Yuri Baburov 73c598df81 Updated version to 0.8.1.1 4 years ago
Nabin Khadka 531ecc7a29
Changes log level #141 4 years ago
anekos 667114463d Fix UnicodeDecodeError on python2 4 years ago
Yuri Baburov 1997b80eaf
Update __init__.py 4 years ago
anekos 6842ea906e Fix causing lxml error 4 years ago
Éloi Rivard e9acdd091b Use black to format the code 4 years ago
Adrien Barbaresi bd8293eb63 code linting 4 years ago
Yuri Baburov 615ce803c6
Merge pull request #124 from dariobig/patch-1
Catch LookupError in case of bad encoding string
4 years ago
Yuri Baburov 52f767c812
Update __init__.py 4 years ago
Yuri Baburov da9e285f73
Merge pull request #128 from azmeuk/self-closing
Replaced XHTML output with HTML5 output in summary for empty elements (a, br), issue #125
4 years ago
Yuri Baburov 5032e2d3ab
Merge pull request #127 from azmeuk/warnings
Fixed a few regex warnings, thanks azmeuk !
4 years ago
Éloi Rivard f9977b727d Documentation draft 4 years ago
Éloi Rivard 0846955dd7 Fixed issue with self-closing tags. Fix #125 4 years ago
Éloi Rivard 6c1c6391e2 Fixed a few regex warnings 4 years ago
Dario 0442358942
Catch LookupError in case of bad encoding string
I've seen cases where bad encoding strings will result in errors, catching LookupError should solve the problem by falling back onto `chardet` or `utf-8`

Here's one case:

```
 textPayload: "Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 189, in summary
    self._html(True)
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 132, in _html
    self.html = self._parse(self.input)
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 141, in _parse
    doc, self.encoding = build_doc(input)
  File "/opt/conda/lib/python3.7/site-packages/readability/htmls.py", line 17, in build_doc
    encoding = get_encoding(page) or 'utf-8'
  File "/opt/conda/lib/python3.7/site-packages/readability/encoding.py", line 46, in get_encoding
    page.decode(encoding)
LookupError: unknown encoding: utf-8, ie=edge, chrome=1
```
5 years ago
baby5 0ac3c5bbc6 Fix compile_pattern not support uppercase 5 years ago
jkclee bac691a0a4 Fix #99 5 years ago
Yuri Baburov 494b19ed4e
Merge branch 'master' into many_repeated_spaces_timeout 6 years ago
Linas Valiukas 0233936e72 Add __version__ constant to __init__.py, read it in setup.py
Users wouldn't need to install, import and use Pip ("pkg_resources") to
find out which version of readability-lxml is being used.
6 years ago
Linas Valiukas 747c46abce Trim many repeated spaces to make clean() faster
When Readability encounters many repeated whitespace, the cleanup
regexes in clean() take forever to run, so trim the amount of whitespace
to 255 characters.

Additionally, test the extracting performance with "timeout_decorator".
6 years ago
Yuri Baburov f7f439d019 Improved positive_keywords and negative_keywords processing for the CLI 6 years ago
Yuri Baburov 0c8f040d53 Updated docs for positive_keywords and negative_keywords, cleaner implementation. 6 years ago
Yuri Baburov 0e50b53d05 Release version 0.7 . Better HTML5 support and an important bugfix. 6 years ago
Yuri Baburov 537de2b8f6 Improved remove_unlikely_candidates following an advice from issue #102 6 years ago
Chris Curvey 9a31587192 fix encoding detection to use the encoding being tested 7 years ago
Yuri Baburov e4efc87a20 Update readability.py 8 years ago
Yuri Baburov b20d5c15ef Improved Document class documentation 8 years ago
alphapapa 8443a87f5c Update readability.py 8 years ago
alphapapa 5fc2d3684a Use Mozilla User-Agent
Use a "Mozilla" user-agent to avoid HTTP 403 errors.  Fixes #71.
8 years ago
Yuri Baburov 65d1ebb06d Fixed #70 and added xpath option 9 years ago
Yuri Baburov c0d794fdd8 Update readability.py
Fixed logging namespace
9 years ago
Yuri Baburov 8ff11e68a6 Debugging improvements. Bump to 0.6.0.5 9 years ago
Yuri Baburov fcdbe563a5 Fixed #49. Bump to 0.6.0.4 9 years ago
Yuri Baburov 24bb20c761 Added dev branch features.
Bumped to version 0.6
9 years ago
Yuri Baburov 154658798b Merge pull request #64 from martinth/master
Added python 3 support (Supported: python 2.6, 2.7, 3.3, 3.4).
Thanks a lot to @martinth
9 years ago
Marko Horvatic f0ff9b2425 Move logging.basicConfig to main function 9 years ago
Yuri Baburov e2bc1ea055 Improved #65 which has given warning, added cssselect lib, bumped to 0.5.1 9 years ago
Mariusz Osiecki bf9e7404fa Failure if best_elem is root (fix #58) 9 years ago
Martin Thurau 386e48d29b Fixes checking of declared encodings in get_encoding.
In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it.
9 years ago
Martin Thurau 046d2c10c3 Fixes regex declaration in get_encoding.
Since get_encoding() is only called when the input is *not* already unicode we need to declare the regexs as byte type so they continue to work in Python 3.
9 years ago
Martin Thurau ce7ca26835 Adds compatibility `raise_with_traceback` method to support different `raise` syntax
Unfortunately the Python 2 `raise` syntax is not supported in Python 3.3 and not all 3.4.x versions so we deal with that by using conditional imports and a compatibility layer.
9 years ago
Martin Thurau 3ac56329e2 Corrects some things were 2to3 did to much. 9 years ago
Martin Thurau aa4132f57a Adds Python 3.4 support.
Code now supports Python 2.6, 2.7 and 3.4. PYthon 3.3 isn't support
because of some issues with the parser and the difference between old and
new `raise` syntax.
9 years ago
Yuri Baburov 987570bef0 Updated package links for Python 2.7 and Python 3 support 9 years ago
Yuri Baburov 1fac7e685a Added a feature to allow more images per article (with a test) 9 years ago
Miguel Galves d04d41b749 Insert text inside iframe for correct output 9 years ago
Miguel Galves be2a1c4646 Let width and height attributes 9 years ago
Miguel Galves f1759c1404 Allows iframes containing youtube or vimeo videos. People like them 9 years ago
Yuri Baburov e4bcbe57d7 Fixes #53 9 years ago
Nathan Breit 75e2e0cb3a Defaulting to utf-8 when chardet returns None
On articles like this one chardet returns None:
http://news.zing.vn/nhip-song-tre/thay-giao-gay-sot-tung-bo-luat-tinh-yeu/a291427.html
This causes exceptions later on when encoding.lower() is called
10 years ago