You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Go to file
Richard Harding b1966df1c3 Fix docs for changed method 12 years ago
src Remove the get_ in method name, doesn't fit rest of api 12 years ago
.gitignore Move the module into the readable_lxml space so that we can actually import it nicely. 12 years ago
CREDITS Add credits file 12 years ago
LICENSE Add a license file 12 years ago
Makefile Make sure we update both version strings until we can figure out how to pull it into the setup.py by magic 12 years ago
README.rst Fix docs for changed method 12 years ago
setup.py Fix setup.py to pull the rst readme 12 years ago

README.rst

readability_lxml
================

This is a python port of a ruby port of `arc90's readability`_ project

Given a html document, it pulls out the main body text and cleans it up.
It also can clean up title based on latest readability.js code.


Inspiration
-----------
- Latest readability.js ( https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js )
- Ruby port by starrhorne and iterationlabs
- Python port by gfxmonk ( https://github.com/gfxmonk/python-readability , based on BeautifulSoup )
- Decruft effort to move to lxml ( http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ )
- "BR to P" fix from readability.js which improves quality for smaller texts.
- Github users contributions.


Try it out!
-----------
You can try out the parser by entering your test urls on the following test
service.

http://readable.bmark.us


Installation
-------------
::

    $ easy_install readability-lxml
    # or
    $ pip install readability-lxml


Usage
------

Command Line Client
~~~~~~~~~~~~~~~~~~~
::

    $ readability http://pypi.python.org/pypi/readability-lxml
    $ readability /home/rharding/sampledoc.html

As a Library
~~~~~~~~~~~~
::

    from readability.readability import Document
    import urllib
    html = urllib.urlopen(url).read()
    readable_article = Document(html).summary()
    readable_title = Document(html).short_title()

You can also use the `get_summary_with_metadata` method to get back other
metadata such as the confidence score found while processing the input.

::

    doc = Document(html).summary_with_metadata()
    print doc.html
    print doc.confidence


Optional `Document` keyword argument:

- attributes:
- debug: output debug messages
- min_text_length:
- retry_length:
- url: will allow adjusting links to be absolute


Test and BUild Status
---------------------
Tests are run against the package at:

http://build.bmark.us/job/readability-lxml/

You can view it for build history and test status.


History
-------

- `0.2.5` Update setup.py for uploading .tar.gz to pypi


.. _arc90's readability: http://lab.arc90.com/experiments/readability/