Update README to be a rst file and clean up a little bit.

0.3.0.dev
Richard Harding 12 years ago
parent 8b0210c4dc
commit 58c69651d3

@ -1,14 +1,14 @@
This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0 readability_lxml
================
This is a python port of a ruby port of arc90's readability project This is a python port of a ruby port of `arc90's readability`_ project
http://lab.arc90.com/experiments/readability/
In few words,
Given a html document, it pulls out the main body text and cleans it up. Given a html document, it pulls out the main body text and cleans it up.
It also can clean up title based on latest readability.js code. It also can clean up title based on latest readability.js code.
Based on:
Inspiration
-----------
- Latest readability.js ( https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js ) - Latest readability.js ( https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js )
- Ruby port by starrhorne and iterationlabs - Ruby port by starrhorne and iterationlabs
- Python port by gfxmonk ( https://github.com/gfxmonk/python-readability , based on BeautifulSoup ) - Python port by gfxmonk ( https://github.com/gfxmonk/python-readability , based on BeautifulSoup )
@ -16,13 +16,29 @@ Based on:
- "BR to P" fix from readability.js which improves quality for smaller texts. - "BR to P" fix from readability.js which improves quality for smaller texts.
- Github users contributions. - Github users contributions.
Installation::
easy_install readability-lxml Installation
or -------------
pip install readability-lxml ::
$ easy_install readability-lxml
# or
$ pip install readability-lxml
Usage
------
Usage:: Command Line Client
~~~~~~~~~~~~~~~~~~~
::
$ readability http://pypi.python.org/pypi/readability-lxml
$ readability /home/rharding/sampledoc.html
As a Library
~~~~~~~~~~~~
::
from readability.readability import Document from readability.readability import Document
import urllib import urllib
@ -30,21 +46,19 @@ Usage::
readable_article = Document(html).summary() readable_article = Document(html).summary()
readable_title = Document(html).short_title() readable_title = Document(html).short_title()
Command-line usage:: Optional `Document` keyword argument:
python -m readability.readability -u http://pypi.python.org/pypi/readability-lxml
Document() kwarg options: - attributes:
- debug: output debug messages
- min_text_length:
- retry_length:
- url: will allow adjusting links to be absolute
- attributes:
- debug: output debug messages
- min_text_length:
- retry_length:
- url: will allow adjusting links to be absolute
History
-------
Updates - `0.2.5`` Update setup.py for uploading .tar.gz to pypi
- 0.2.5 Update setup.py for uploading .tar.gz to pypi
.. _arc90's readability: http://lab.arc90.com/experiments/readability/

@ -102,11 +102,6 @@ class Document:
self.options = options self.options = options
self.html = None self.html = None
def _html(self, force=False):
if force or self.html is None:
self.html = self._parse(self.input_doc)
return self.html
def _parse(self, input_doc): def _parse(self, input_doc):
doc = build_doc(input_doc) doc = build_doc(input_doc)
doc = html_cleaner.clean_html(doc) doc = html_cleaner.clean_html(doc)
@ -136,7 +131,8 @@ class Document:
try: try:
ruthless = True ruthless = True
while True: while True:
self._html(True) self.html = self._parse(self.input_doc)
for i in self.tags(self.html, 'script', 'style'): for i in self.tags(self.html, 'script', 'style'):
i.drop_tree() i.drop_tree()
for i in self.tags(self.html, 'body'): for i in self.tags(self.html, 'body'):

Loading…
Cancel
Save