python-readability

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Richard Harding b1966df1c3 Fix docs for changed method		12 years ago
src	Remove the get_ in method name, doesn't fit rest of api	12 years ago
.gitignore	Move the module into the readable_lxml space so that we can actually import it nicely.	12 years ago
CREDITS	Add credits file	12 years ago
LICENSE	Add a license file	12 years ago
Makefile	Make sure we update both version strings until we can figure out how to pull it into the setup.py by magic	12 years ago
README.rst	Fix docs for changed method	12 years ago
setup.py	Fix setup.py to pull the rst readme	12 years ago

README.rst

readability_lxml
================

This is a python port of a ruby port of `arc90's readability`_ project

Given a html document, it pulls out the main body text and cleans it up.
It also can clean up title based on latest readability.js code.


Inspiration
-----------
- Latest readability.js ( https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js )
- Ruby port by starrhorne and iterationlabs
- Python port by gfxmonk ( https://github.com/gfxmonk/python-readability , based on BeautifulSoup )
- Decruft effort to move to lxml ( http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ )
- "BR to P" fix from readability.js which improves quality for smaller texts.
- Github users contributions.


Try it out!
-----------
You can try out the parser by entering your test urls on the following test
service.

http://readable.bmark.us


Installation
-------------
::

    $ easy_install readability-lxml
    # or
    $ pip install readability-lxml


Usage
------

Command Line Client
~~~~~~~~~~~~~~~~~~~
::

    $ readability http://pypi.python.org/pypi/readability-lxml
    $ readability /home/rharding/sampledoc.html

As a Library
~~~~~~~~~~~~
::

    from readability.readability import Document
    import urllib
    html = urllib.urlopen(url).read()
    readable_article = Document(html).summary()
    readable_title = Document(html).short_title()

You can also use the `get_summary_with_metadata` method to get back other
metadata such as the confidence score found while processing the input.

::

    doc = Document(html).summary_with_metadata()
    print doc.html
    print doc.confidence


Optional `Document` keyword argument:

- attributes:
- debug: output debug messages
- min_text_length:
- retry_length:
- url: will allow adjusting links to be absolute


Test and BUild Status
---------------------
Tests are run against the package at:

http://build.bmark.us/job/readability-lxml/

You can view it for build history and test status.


History
-------

- `0.2.5` Update setup.py for uploading .tar.gz to pypi


.. _arc90's readability: http://lab.arc90.com/experiments/readability/