Commit Graph

204 Commits (master)
 

Author SHA1 Message Date
Richard Harding f1a79fb8f8 Update to make sure we don't drop the html tag when ditching elements 12 years ago
Richard Harding 46f0302ebc rename the document_only flag to html_partial 12 years ago
Rick Harding 6e8a1f5ce2 Merge pull request #18 from mitechie/add_makefile
Add makefile, update .gitignore for venv potential testfile output.
12 years ago
Richard Harding b8fc399fac Fix rebase issue in the Makefile 12 years ago
Richard Harding 82804b664d Update .gitignore file for venv and nosetests. 12 years ago
Richard Harding 4376eedc13 Add makefile testing, building, uploading.
- Adds a makefile with helpers
- make all will setup a virtualenv and get deps
- make test will install test deps and run nosetests
- make version_update will open the setup.py for updating version string
- make upload will build and upload sdist to pypi
12 years ago
Yuri Baburov 7338e9ef63 Added test suite to setup.py
Bump to version 0.2.4
12 years ago
Yuri Baburov a1ae4eaf72 Merge pull request #15 from mitechie/master
New option only_document of Document.summary(), fixed issue GH-13 with "<body/>", added some docs, tests, and code quality improvements. Thanks, Rick!
12 years ago
Richard Harding 8d3e39f04e Update readme 12 years ago
Richard Harding a46dc14251 Try to pep8 all the things but give up when I got close. 12 years ago
Richard Harding 5a98e2c1b8 Correct appending and allow for document only
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
12 years ago
Richard Harding edccec5d3b Work on why we have an empty <body/> tag
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
12 years ago
Yuri Baburov ab783b25b7 Merge pull request #11 from JanX2/master
Fixing gap in node_length coverage (length=80 was missed)
Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.
Adding comment about oversight in transform_misused_divs_into_paragraphs
12 years ago
Jan Weiß 3cdc3d67af Adding comment about oversight in transform_misused_divs_into_paragraphs(). 12 years ago
Jan Weiß 960f885edf Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute. 12 years ago
Jan Weiß 6b3961cd30 Fixing gap in node_length coverage. 12 years ago
Yuri Baburov f9b604c9a8 Merge pull request #10 from facundo/master
Fix: Document.score_paragraphs should use ._html() not .html in case it's used not from .summary() method.
Thanks to facundo.
12 years ago
facundo bb93ae1e5f fixed a small issue on the Document score_paragraphs method 12 years ago
Yuri Baburov fc6a500298 Merge pull request #9 from Psycojoker/master
Add lxml to the dependencies list in the setup.py
Please note that lxml sometimes can't be built from sources, lots of people use binary distributions, which setup.py/pip can't handle properly!
13 years ago
Laurent Peuch 1583d8a794 add lxml missing dependancy 13 years ago
Yuri Baburov 11c4d95411 Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3 13 years ago
Yuri Baburov 6bf4948e69 More README fixes for pipy and github. Bump to version 0.2.2 13 years ago
Yuri Baburov f189ab905d Fixed README for pypi. 13 years ago
Yuri Baburov 61715dca0a Bump to version 0.2 13 years ago
Yuri Baburov 21906f1c44 Better setup.py, now we're "readability-lxml" in pypi. Thanks to Jerry Charumilind. 13 years ago
Yuri Baburov c2ec1d1c38 Sorted out unicode issues, thanks to Lee Semel. 13 years ago
Yuri Baburov 45781a600f Added command-line usage 13 years ago
Yuri Baburov 97ba2a0369 Debug utilities. 13 years ago
Lee Semel f3d0a8d842 Allow passing unicode objects 13 years ago
Jerry Charumilind ad38fac40a Add chardet to installation requirements 13 years ago
Jerry Charumilind 8c1adc5141 Expose Document in readability package 13 years ago
Jerry Charumilind bae87079e9 Change to automatically find packages 13 years ago
Jerry Charumilind 5bf5192d03 Add version number to track changes more easily 13 years ago
Yuri Baburov 7a1e063c22 Updated setup.py to my fork, changed package name to lxml-readability 13 years ago
Yuri Baburov 43c34bacc1 Renamed encodings to encoding to avoid conflicts with system module. 13 years ago
Yuri Baburov 096d4db6ce Added usage 13 years ago
Yuri Baburov f55f16baa1 Updated scoring algorithm to match readability.js v1.7.1 13 years ago
Yuri Baburov 96f476181c Improved title shortener method, and added it to the Document class. 13 years ago
Yuri Baburov f925e3ef05 Corrected README 13 years ago
Yuri Baburov dada82099b Moved to lxml (based on decruft version); better encoding recognition. 13 years ago
gfxmonk b5639a0822 well that was quick; first fork added 13 years ago
gfxmonk 324e280e16 added note to readme to make it clear that I'm not actively working on this library 13 years ago
Tim Cuthbertson 7ebbcc03d2 made setup.py executable 14 years ago
Sean Brant a5d47a1129 added setup.py 14 years ago
gfxmonk 2b6a2d3db4 removing empty paragraphs is not very useful, and can break some (stupid) websites 14 years ago
gfxmonk 1d862a00c3 fixed bug where only immediate text was being considered for weights, instead of all nested text 14 years ago
gfxmonk 0eacd959a4 failsafe parsing and more logging 14 years ago
gfxmonk 87ad057706 unicode, dammit! 14 years ago
gfxmonk a224c5b759 minor 14 years ago
gfxmonk e42a39e1aa modified readme 14 years ago