You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
python-readability/README

60 lines
2.0 KiB
Plaintext

14 years ago
This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0
This is a python port of a ruby port of arc90's readability project
http://lab.arc90.com/experiments/readability/
13 years ago
In few words,
14 years ago
Given a html document, it pulls out the main body text and cleans it up.
13 years ago
It also can clean up title based on latest readability.js code.
14 years ago
13 years ago
Based on:
- Latest readability.js ( https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js )
- Ruby port by starrhorne and iterationlabs
- Python port by gfxmonk ( https://github.com/gfxmonk/python-readability , based on BeautifulSoup )
- Decruft effort to move to lxml ( http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ )
- "BR to P" fix from readability.js which improves quality for smaller texts.
- Github users contributions.
13 years ago
Installation::
easy_install readability-lxml
or
pip install readability-lxml
Usage::
13 years ago
from readability.readability import Document
import urllib
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
Command-line usage::
python -m readability.readability -u http://pypi.python.org/pypi/readability-lxml
Using positive/negative keywords example::
python -m readability.readability -p intro -n newsindex,homepage-box,news-section -u http://python.org
Document() kwarg options:
12 years ago
- attributes:
- debug: output debug messages
- min_text_length:
- retry_length:
- url: will allow adjusting links to be absolute
- positive_keywords: the list of positive search patterns in classes and ids, for example: ["news-item", "block"]
- negative_keywords: the list of negative search patterns in classes and ids, for example: ["mysidebar", "related", "ads"]
Updates
- 0.2.5 Update setup.py for uploading .tar.gz to pypi
- 0.2.6 Don't crash on documents with no title
12 years ago
- 0.2.6.1 Document.short_title() properly works
- 0.3 Added Document.encoding, positive_keywords and negative_keywords