Checkpoint multi-page readability work

Restructured code to better support multi-page readability.  Improved tests.

Conflicts:

	src/readability_lxml/readability.py
	src/tests/regression.py
0.3.0.dev
Jerry Charumilind 13 years ago committed by Richard Harding
parent 5cb4b8b8c0
commit f02fe79840

@ -27,6 +27,7 @@ log = logging.getLogger()
REGEXES = {
<<<<<<< HEAD:src/readability_lxml/readability.py
'unlikelyCandidatesRe': re.compile(
('combx|comment|community|disqus|extra|foot|header|menu|remark|rss|'
'shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup|'
@ -46,7 +47,7 @@ REGEXES = {
'divToPElementsRe': re.compile(
'<(a|blockquote|dl|div|img|ol|p|pre|table|ul)', re.I),
# Match: next, continue, >, >>, but not >|, as those usually mean last.
'nextLink': re.compile(r'(next|weiter|continue|>[^\|]|$)', re.I),
'nextLink': re.compile(r'(next|weiter|continue|>[^\|]$)', re.I), # Match: next, continue, >, >>, but not >|, as those usually mean last.
'prevLink': re.compile(r'(prev|earl|old|new|<)', re.I),
'page': re.compile(r'pag(e|ing|inat)', re.I),
'firstLast': re.compile(r'(first|last)', re.I)

@ -17,6 +17,7 @@ import os.path
import re
import sys
import unittest
import readability.urlfetch
import yaml
from lxml.html import builder as B

@ -0,0 +1,60 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>A Simple Multi-Page Article For Testing : Page 3</title>
</head>
<body>
<h1>A Simple Multi-Page Article For Testing : Page 3</h1>
<p>
Nullam laoreet, nibh non faucibus dictum, tellus libero varius
erat, lobortis varius est massa quis metus. Donec vitae justo
lacus, nec convallis metus. Suspendisse potenti. Nunc et rutrum
justo. Maecenas ultrices ipsum in magna fermentum eleifend. Fusce
sagittis pretium aliquam. Vestibulum et gravida lorem. Sed turpis
quam, placerat ac ultrices eu, tempor sit amet elit. Curabitur eu
imperdiet velit. Quisque pharetra ornare nunc, a volutpat metus
aliquam quis. Vivamus semper aliquam cursus. Nullam ac nibh nulla,
luctus pharetra nunc. Etiam ut sapien sem. Fusce vehicula, sem sit
amet viverra pretium, magna tortor suscipit nisi, id interdum lorem
orci in tellus. Vivamus vel ipsum eros. Fusce porttitor convallis
ultricies. Etiam in risus diam, viverra suscipit felis. Duis vitae
imperdiet est.
</p>
<p>
Nunc nunc magna, facilisis blandit venenatis ut, scelerisque ac
tortor. Cras condimentum fermentum lectus ac convallis. Suspendisse
cursus, lacus sit amet sodales molestie, dui erat varius velit, non
tincidunt metus dui sed nulla. Aliquam lacus orci, convallis ut
pellentesque ac, molestie et dolor. Ut pretium enim ut nunc auctor
eget placerat magna luctus. Duis mollis ligula a orci ultrices in
facilisis felis feugiat. Morbi eget odio eget erat pulvinar
placerat sed nec erat. Duis dignissim, dolor a lacinia commodo,
metus erat laoreet dui, in lacinia felis lacus vitae nulla. Fusce
imperdiet condimentum volutpat. Vivamus ut lacus a eros cursus
scelerisque non sit amet orci. Phasellus id quam odio. Nulla
adipiscing venenatis lorem nec feugiat. Aenean sit amet nisl odio,
tincidunt scelerisque nisl. Curabitur ut nisl a dui facilisis
vulputate. Mauris eu elit et felis hendrerit blandit. Cras magna
dolor, imperdiet eget rutrum tempus, euismod nec augue.
</p>
<p>
Ut in sem sit amet felis scelerisque elementum. Suspendisse vitae
neque magna, in laoreet felis. Aenean elit ligula, tempor in
vestibulum ac, porttitor nec lacus. Aenean urna mi, dictum feugiat
placerat eget, congue nec dolor. Etiam pellentesque dictum nulla id
vulputate. Etiam sit amet vehicula purus. Integer quis mi nisl,
gravida malesuada enim. Donec malesuada felis nisi. Etiam id magna
a libero pulvinar ullamcorper in nec neque. Duis pulvinar massa nec
magna scelerisque vitae vulputate ipsum luctus.
</p>
<ul id="pageNumbers">
<li> 1 </li>
<li>
<a title="Page 1" href="/article.html">1</a>
</li>
<li>
<a title="Page 2" href="/article.html?pagewanted=2">2</a>
</li>
</ul>
</body>
</html>
Loading…
Cancel
Save