Commit Graph

88 Commits (master)

Author SHA1 Message Date
Yuri Baburov 638f73f6a2 Fix for #52: <input type="hidden"> are not counted any more for "form removal" heuristic. 10 years ago
Mark Perdomo 3a43a3fe7e Added code to check declared encodings first and check them
from kennethreitz/requests/utils.py.  Also I added some superset
encodings I have found in Chinese pages that are mishandled by
chardet/character declarations.
10 years ago
Yuri Baburov d8595b7103 Quickfix for #41 11 years ago
Yuri Baburov 318f25c577 Minor fix in encoding guessing. Claiming it v0.3.0.1 11 years ago
Yuri Baburov 08658d1d31 Released v 0.3, and uploaded to the pypi. 11 years ago
hush-hush e2e78e4d55 Make lxml clean tree available for user modifications. 12 years ago
Drew Vogel fdba8d9e11 Added check on title.text to avoid a TypeError on None. 12 years ago
Zach Denton 0843d9cdf2 Explicitly check if title is None. fixes #22
This fixes #22 which caused all titles to be blank.
12 years ago
Andrey Popp 95852d5c18 readability.htmls: some docs do not have title elem 12 years ago
Richard Harding e9a5cbfe7f Remove pdb dummy 12 years ago
Richard Harding f1a79fb8f8 Update to make sure we don't drop the html tag when ditching elements 12 years ago
Richard Harding 46f0302ebc rename the document_only flag to html_partial 12 years ago
Richard Harding a46dc14251 Try to pep8 all the things but give up when I got close. 12 years ago
Richard Harding 5a98e2c1b8 Correct appending and allow for document only
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
12 years ago
Richard Harding edccec5d3b Work on why we have an empty <body/> tag
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
12 years ago
Jan Weiß 3cdc3d67af Adding comment about oversight in transform_misused_divs_into_paragraphs(). 12 years ago
Jan Weiß 960f885edf Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute. 12 years ago
Jan Weiß 6b3961cd30 Fixing gap in node_length coverage. 12 years ago
facundo bb93ae1e5f fixed a small issue on the Document score_paragraphs method 12 years ago
Yuri Baburov 11c4d95411 Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3 13 years ago
Yuri Baburov 61715dca0a Bump to version 0.2 13 years ago
Yuri Baburov c2ec1d1c38 Sorted out unicode issues, thanks to Lee Semel. 13 years ago
Yuri Baburov 97ba2a0369 Debug utilities. 13 years ago
Lee Semel f3d0a8d842 Allow passing unicode objects 13 years ago
Jerry Charumilind 8c1adc5141 Expose Document in readability package 13 years ago
Yuri Baburov 43c34bacc1 Renamed encodings to encoding to avoid conflicts with system module. 13 years ago
Yuri Baburov f55f16baa1 Updated scoring algorithm to match readability.js v1.7.1 13 years ago
Yuri Baburov 96f476181c Improved title shortener method, and added it to the Document class. 13 years ago
Yuri Baburov dada82099b Moved to lxml (based on decruft version); better encoding recognition. 13 years ago
gfxmonk 2b6a2d3db4 removing empty paragraphs is not very useful, and can break some (stupid) websites 14 years ago
gfxmonk 1d862a00c3 fixed bug where only immediate text was being considered for weights, instead of all nested text 14 years ago
gfxmonk 0eacd959a4 failsafe parsing and more logging 14 years ago
gfxmonk 87ad057706 unicode, dammit! 14 years ago
gfxmonk a224c5b759 minor 14 years ago
gfxmonk f73b5f05c4 split out into content and summary methods 14 years ago
gfxmonk c952f421b7 clean up content method and debug 14 years ago
gfxmonk c0ca60ee26 use a more leniant parser 14 years ago
gfxmonk ad3d52ade4 initial 14 years ago