Commit Graph

65 Commits (master)

Author SHA1 Message Date
Nabin Khadka 531ecc7a29
Changes log level #141 4 years ago
anekos 667114463d Fix UnicodeDecodeError on python2 4 years ago
anekos 6842ea906e Fix causing lxml error 4 years ago
Éloi Rivard e9acdd091b Use black to format the code 4 years ago
Adrien Barbaresi bd8293eb63 code linting 4 years ago
Yuri Baburov da9e285f73
Merge pull request #128 from azmeuk/self-closing
Replaced XHTML output with HTML5 output in summary for empty elements (a, br), issue #125
4 years ago
Yuri Baburov 5032e2d3ab
Merge pull request #127 from azmeuk/warnings
Fixed a few regex warnings, thanks azmeuk !
4 years ago
Éloi Rivard f9977b727d Documentation draft 4 years ago
Éloi Rivard 0846955dd7 Fixed issue with self-closing tags. Fix #125 4 years ago
Éloi Rivard 6c1c6391e2 Fixed a few regex warnings 4 years ago
baby5 0ac3c5bbc6 Fix compile_pattern not support uppercase 5 years ago
jkclee bac691a0a4 Fix #99 5 years ago
Linas Valiukas 747c46abce Trim many repeated spaces to make clean() faster
When Readability encounters many repeated whitespace, the cleanup
regexes in clean() take forever to run, so trim the amount of whitespace
to 255 characters.

Additionally, test the extracting performance with "timeout_decorator".
6 years ago
Yuri Baburov f7f439d019 Improved positive_keywords and negative_keywords processing for the CLI 6 years ago
Yuri Baburov 0c8f040d53 Updated docs for positive_keywords and negative_keywords, cleaner implementation. 6 years ago
Yuri Baburov 0e50b53d05 Release version 0.7 . Better HTML5 support and an important bugfix. 6 years ago
Yuri Baburov 537de2b8f6 Improved remove_unlikely_candidates following an advice from issue #102 6 years ago
Yuri Baburov e4efc87a20 Update readability.py 8 years ago
Yuri Baburov b20d5c15ef Improved Document class documentation 8 years ago
alphapapa 8443a87f5c Update readability.py 8 years ago
alphapapa 5fc2d3684a Use Mozilla User-Agent
Use a "Mozilla" user-agent to avoid HTTP 403 errors.  Fixes #71.
8 years ago
Yuri Baburov 65d1ebb06d Fixed #70 and added xpath option 9 years ago
Yuri Baburov c0d794fdd8 Update readability.py
Fixed logging namespace
9 years ago
Yuri Baburov 8ff11e68a6 Debugging improvements. Bump to 0.6.0.5 9 years ago
Yuri Baburov fcdbe563a5 Fixed #49. Bump to 0.6.0.4 9 years ago
Yuri Baburov 24bb20c761 Added dev branch features.
Bumped to version 0.6
9 years ago
Yuri Baburov 154658798b Merge pull request #64 from martinth/master
Added python 3 support (Supported: python 2.6, 2.7, 3.3, 3.4).
Thanks a lot to @martinth
9 years ago
Marko Horvatic f0ff9b2425 Move logging.basicConfig to main function 9 years ago
Yuri Baburov e2bc1ea055 Improved #65 which has given warning, added cssselect lib, bumped to 0.5.1 9 years ago
Mariusz Osiecki bf9e7404fa Failure if best_elem is root (fix #58) 9 years ago
Martin Thurau ce7ca26835 Adds compatibility `raise_with_traceback` method to support different `raise` syntax
Unfortunately the Python 2 `raise` syntax is not supported in Python 3.3 and not all 3.4.x versions so we deal with that by using conditional imports and a compatibility layer.
9 years ago
Martin Thurau 3ac56329e2 Corrects some things were 2to3 did to much. 9 years ago
Martin Thurau aa4132f57a Adds Python 3.4 support.
Code now supports Python 2.6, 2.7 and 3.4. PYthon 3.3 isn't support
because of some issues with the parser and the difference between old and
new `raise` syntax.
9 years ago
Yuri Baburov 987570bef0 Updated package links for Python 2.7 and Python 3 support 9 years ago
Yuri Baburov 1fac7e685a Added a feature to allow more images per article (with a test) 9 years ago
Miguel Galves d04d41b749 Insert text inside iframe for correct output 9 years ago
Miguel Galves f1759c1404 Allows iframes containing youtube or vimeo videos. People like them 9 years ago
Yuri Baburov 638f73f6a2 Fix for #52: <input type="hidden"> are not counted any more for "form removal" heuristic. 10 years ago
Yuri Baburov 08658d1d31 Released v 0.3, and uploaded to the pypi. 11 years ago
hush-hush e2e78e4d55 Make lxml clean tree available for user modifications. 12 years ago
Richard Harding e9a5cbfe7f Remove pdb dummy 12 years ago
Richard Harding f1a79fb8f8 Update to make sure we don't drop the html tag when ditching elements 12 years ago
Richard Harding 46f0302ebc rename the document_only flag to html_partial 12 years ago
Richard Harding a46dc14251 Try to pep8 all the things but give up when I got close. 12 years ago
Richard Harding 5a98e2c1b8 Correct appending and allow for document only
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
12 years ago
Richard Harding edccec5d3b Work on why we have an empty <body/> tag
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
12 years ago
Jan Weiß 3cdc3d67af Adding comment about oversight in transform_misused_divs_into_paragraphs(). 12 years ago
Jan Weiß 960f885edf Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute. 12 years ago
Jan Weiß 6b3961cd30 Fixing gap in node_length coverage. 12 years ago
facundo bb93ae1e5f fixed a small issue on the Document score_paragraphs method 12 years ago