You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Go to file
Linas Valiukas 747c46abce Trim many repeated spaces to make clean() faster
When Readability encounters many repeated whitespace, the cleanup
regexes in clean() take forever to run, so trim the amount of whitespace
to 255 characters.

Additionally, test the extracting performance with "timeout_decorator".
6 years ago
readability Trim many repeated spaces to make clean() faster 6 years ago
tests Trim many repeated spaces to make clean() faster 6 years ago
.gitignore Adds tox configuration. 9 years ago
.travis.yml Trying to pass travis tests. 6 years ago
Makefile Updated docs for positive_keywords and negative_keywords, cleaner implementation. 6 years ago
README.rst Release version 0.7 . Better HTML5 support and an important bugfix. 6 years ago
requirements.txt Adds tox configuration. 9 years ago
setup.py Trim many repeated spaces to make clean() faster 6 years ago
tox.ini Trying to pass travis tests. 6 years ago

README.rst

.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master
    :target: https://travis-ci.org/buriy/python-readability


python-readability
==================

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of `arc90's readability
project <http://lab.arc90.com/experiments/readability/>`__.

Installation
------------

It's easy using ``pip``, just run:

::

    $ pip install readability-lxml

Usage
-----

::

    >> import requests
    >> from readability import Document
    >>
    >> response = requests.get('http://example.com')
    >> doc = Document(response.text)
    >> doc.title()
    >> 'Example Domain'

Change Log
----------

-  0.7 Improved HTML5 tags handling. Heuristics were changed for a lot of sites: Fixed an important
bug with stripping unwanted HTML nodes (only first matching node was removed before).
-  0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3
   and 3.4
-  0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and
   3.4
-  0.4 Added Videos loading and allowed more images per paragraph
-  0.3 Added Document.encoding, positive\_keywords and
   negative\_keywords

Licensing
=========

This code is under `the Apache License
2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license.

Thanks to
---------

-  Latest
   `readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__
-  Ruby port by starrhorne and iterationlabs
-  `Python port <https://github.com/gfxmonk/python-readability>`__ by
   gfxmonk
-  `Decruft
   effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>`__
   to move to lxml
-  "BR to P" fix from readability.js which improves quality for smaller
   texts
-  Github users contributions.