Commit Graph

20 Commits (edccec5d3b4cecee3fdccff7667dd81bb3ed6258)

Author SHA1 Message Date
Richard Harding edccec5d3b Work on why we have an empty <body/> tag
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
12 years ago
Jan Weiß 3cdc3d67af Adding comment about oversight in transform_misused_divs_into_paragraphs(). 12 years ago
Jan Weiß 960f885edf Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute. 12 years ago
Jan Weiß 6b3961cd30 Fixing gap in node_length coverage. 12 years ago
facundo bb93ae1e5f fixed a small issue on the Document score_paragraphs method 12 years ago
Yuri Baburov 11c4d95411 Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3 13 years ago
Yuri Baburov 61715dca0a Bump to version 0.2 13 years ago
Yuri Baburov c2ec1d1c38 Sorted out unicode issues, thanks to Lee Semel. 13 years ago
Yuri Baburov f55f16baa1 Updated scoring algorithm to match readability.js v1.7.1 13 years ago
Yuri Baburov 96f476181c Improved title shortener method, and added it to the Document class. 13 years ago
Yuri Baburov dada82099b Moved to lxml (based on decruft version); better encoding recognition. 13 years ago
gfxmonk 2b6a2d3db4 removing empty paragraphs is not very useful, and can break some (stupid) websites 14 years ago
gfxmonk 1d862a00c3 fixed bug where only immediate text was being considered for weights, instead of all nested text 14 years ago
gfxmonk 0eacd959a4 failsafe parsing and more logging 14 years ago
gfxmonk 87ad057706 unicode, dammit! 14 years ago
gfxmonk a224c5b759 minor 14 years ago
gfxmonk f73b5f05c4 split out into content and summary methods 14 years ago
gfxmonk c952f421b7 clean up content method and debug 14 years ago
gfxmonk c0ca60ee26 use a more leniant parser 14 years ago
gfxmonk ad3d52ade4 initial 14 years ago