You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
747c46abce
When Readability encounters many repeated whitespace, the cleanup regexes in clean() take forever to run, so trim the amount of whitespace to 255 characters. Additionally, test the extracting performance with "timeout_decorator". |
6 years ago | |
---|---|---|
readability | 6 years ago | |
tests | 6 years ago | |
.gitignore | 9 years ago | |
.travis.yml | 6 years ago | |
Makefile | 6 years ago | |
README.rst | 6 years ago | |
requirements.txt | 9 years ago | |
setup.py | 6 years ago | |
tox.ini | 6 years ago |
README.rst
.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master :target: https://travis-ci.org/buriy/python-readability python-readability ================== Given a html document, it pulls out the main body text and cleans it up. This is a python port of a ruby port of `arc90's readability project <http://lab.arc90.com/experiments/readability/>`__. Installation ------------ It's easy using ``pip``, just run: :: $ pip install readability-lxml Usage ----- :: >> import requests >> from readability import Document >> >> response = requests.get('http://example.com') >> doc = Document(response.text) >> doc.title() >> 'Example Domain' Change Log ---------- - 0.7 Improved HTML5 tags handling. Heuristics were changed for a lot of sites: Fixed an important bug with stripping unwanted HTML nodes (only first matching node was removed before). - 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 and 3.4 - 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4 - 0.4 Added Videos loading and allowed more images per paragraph - 0.3 Added Document.encoding, positive\_keywords and negative\_keywords Licensing ========= This code is under `the Apache License 2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license. Thanks to --------- - Latest `readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__ - Ruby port by starrhorne and iterationlabs - `Python port <https://github.com/gfxmonk/python-readability>`__ by gfxmonk - `Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>`__ to move to lxml - "BR to P" fix from readability.js which improves quality for smaller texts - Github users contributions.