Compare commits

...

127 Commits
dev ... master

Author SHA1 Message Date
Yuri Baburov 73c598df81 Updated version to 0.8.1.1 4 years ago
Yuri Baburov 67f46604dd A fix for mac version test. 4 years ago
Yuri Baburov 449dc2066b Releases packaging improvements. 4 years ago
Yuri Baburov bb81dc2c74
Merge pull request #148 from nabinkhadka/master
Changed log level of doc candidates from .info to .debug. #141
4 years ago
Nabin Khadka 531ecc7a29
Changes log level #141 4 years ago
Yuri Baburov 1e3b8504bb
Merge pull request #147 from anekos/fix/UnicodeDecodeError-on-python2
Fix UnicodeDecodeError on python2
4 years ago
anekos 667114463d Fix UnicodeDecodeError on python2 4 years ago
Yuri Baburov e4a699bbb0
Update README.rst 4 years ago
Yuri Baburov 1997b80eaf
Update __init__.py 4 years ago
Yuri Baburov 9e0fb6ec77
Update README.rst 4 years ago
Yuri Baburov a4dbaee02e
Merge pull request #145 from anekos/fix/causing-lxml-error
Fixed lxml error on some Chinese texts.
4 years ago
anekos 6842ea906e Fix causing lxml error 4 years ago
Yuri Baburov ede4d015ab
Merge pull request #142 from tim77/add-license-1
Add LICENSE file
4 years ago
Artem Polishchuk 5916527898
Add LICENSE file 4 years ago
Yuri Baburov 5800210e99
Merge pull request #136 from adbar/master
add coverage testing
4 years ago
Yuri Baburov 4b864d6306
Merge pull request #131 from azmeuk/black
Used black to format the code
4 years ago
Adrien Barbaresi 14d4474f33 add coverage tests 4 years ago
Éloi Rivard e9acdd091b Use black to format the code 4 years ago
Yuri Baburov 5a74140fdb
Merge pull request #132 from azmeuk/readme
Syntax highlight the README
4 years ago
Yuri Baburov 07f6861ece
Merge pull request #135 from adbar/master
unnecessary imports removed
added lines for conformity and readability
linted code parts
4 years ago
Adrien Barbaresi bd8293eb63 code linting 4 years ago
Yuri Baburov 17ffad5a26
Merge pull request #134 from adbar/patch-1
Extended travis config:
 - Python versions added (3.9, pypy)
 - OS added (MacOS, 2 different versions)
4 years ago
Yuri Baburov baf03e0d8e
Update .travis.yml 4 years ago
Yuri Baburov 8c122cc862
Update .travis.yml 4 years ago
Yuri Baburov 28db33a1ad
Update .travis.yml 4 years ago
Yuri Baburov 44ee1c4a87
Update .travis.yml 4 years ago
Adrien Barbaresi 9a85102555
Set TOXENV for macOS tests 4 years ago
Adrien Barbaresi 8ea6a20e01
Skip missing interpreters in tox.ini 4 years ago
Adrien Barbaresi a98151e6dd
Extended travis config
- Python versions added (3.9, pypy)
- OS added (MacOS, 2 different versions)
4 years ago
Éloi Rivard 0556abb794 Syntax highlight the README 4 years ago
Yuri Baburov 615ce803c6
Merge pull request #124 from dariobig/patch-1
Catch LookupError in case of bad encoding string
4 years ago
Yuri Baburov 52f767c812
Update __init__.py 4 years ago
Yuri Baburov c24808fbb2
Update README.rst 4 years ago
Yuri Baburov da9e285f73
Merge pull request #128 from azmeuk/self-closing
Replaced XHTML output with HTML5 output in summary for empty elements (a, br), issue #125
4 years ago
Yuri Baburov 5032e2d3ab
Merge pull request #127 from azmeuk/warnings
Fixed a few regex warnings, thanks azmeuk !
4 years ago
Yuri Baburov 471d89dde9
Merge pull request #126 from azmeuk/py38
Added official python 3.8 support, dropped python 3.4 support.
Thanks Éloi Rivard (@azmeuk) !
4 years ago
Yuri Baburov 4980b0c141
Merge branch 'master' into py38 4 years ago
Yuri Baburov 331b58ef50
Merge pull request #129 from azmeuk/doc
Added basic documentation
4 years ago
Éloi Rivard f9977b727d Documentation draft 4 years ago
Éloi Rivard 0846955dd7 Fixed issue with self-closing tags. Fix #125 4 years ago
Éloi Rivard 6c1c6391e2 Fixed a few regex warnings 4 years ago
Éloi Rivard 326fb43b4c Drop support for python 3.4 - Add support for python 3.8 4 years ago
Dario 0442358942
Catch LookupError in case of bad encoding string
I've seen cases where bad encoding strings will result in errors, catching LookupError should solve the problem by falling back onto `chardet` or `utf-8`

Here's one case:

```
 textPayload: "Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 189, in summary
    self._html(True)
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 132, in _html
    self.html = self._parse(self.input)
  File "/opt/conda/lib/python3.7/site-packages/readability/readability.py", line 141, in _parse
    doc, self.encoding = build_doc(input)
  File "/opt/conda/lib/python3.7/site-packages/readability/htmls.py", line 17, in build_doc
    encoding = get_encoding(page) or 'utf-8'
  File "/opt/conda/lib/python3.7/site-packages/readability/encoding.py", line 46, in get_encoding
    page.decode(encoding)
LookupError: unknown encoding: utf-8, ie=edge, chrome=1
```
5 years ago
Yuri Baburov de20908e57
Update README.rst 5 years ago
Yuri Baburov 4fa85d2778
Merge pull request #116 from baby5/master
Fixed compile_pattern to support uppercase.
5 years ago
baby5 0ac3c5bbc6 Fix compile_pattern not support uppercase 5 years ago
Yuri Baburov a4ac1c7704
Merge pull request #115 from johnklee/Issue99
Fix #99 - Hiding exception raised during "a href" normalization, added handle_failures parameter defaulting to "discard" bad urls.
5 years ago
jkclee bac691a0a4 Fix #99 5 years ago
Yuri Baburov 3cbede6be4
Update README.rst 6 years ago
Yuri Baburov d40c4dd34d
Update README.rst 6 years ago
Yuri Baburov 9aba330e68
Update README.rst 6 years ago
Yuri Baburov 0b28643f0d
Update README.rst 6 years ago
Yuri Baburov 59b99ffa0b
Merge pull request #105 from pypt/many_repeated_spaces_timeout
Trim many repeated spaces to make clean() faster
6 years ago
Yuri Baburov 494b19ed4e
Merge branch 'master' into many_repeated_spaces_timeout 6 years ago
Yuri Baburov dca6e2197a
Merge pull request #107 from pypt/module_version_constant
Add __version__ constant to __init__.py, read it in setup.py
6 years ago
Yuri Baburov 5215ab657b
Merge pull request #106 from pypt/python_3_7
Improvements for Python 3.7 support and CI
6 years ago
Linas Valiukas 68fb5ad4c0 Try a workaround to make build work on 3.7
https://github.com/travis-ci/travis-ci/issues/9815
6 years ago
Linas Valiukas 34fce7664d Update Python version in .travis.yml 6 years ago
Linas Valiukas 0233936e72 Add __version__ constant to __init__.py, read it in setup.py
Users wouldn't need to install, import and use Pip ("pkg_resources") to
find out which version of readability-lxml is being used.
6 years ago
Linas Valiukas 63fbc36cb8 Close sample input file after reading it
Otherwise tests spit out:

    ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/pypt/Dropbox/etc-MediaCloud/python-readability/tests/samples/si-game.sample.html' mode='r' encoding='UTF-8'>
    return open(os.path.join(SAMPLES, filename)).read()
6 years ago
Linas Valiukas bdb6d671d8 Test with Python 3.7 on Travis 6 years ago
Linas Valiukas 34d198fe5a Add Python 3.7 classifier 6 years ago
Linas Valiukas 2bbb70b3e5 Fix Travis build
Add "test" extra and install dependencies for said extra as detailed in:

https://stackoverflow.com/a/41398850/200603
6 years ago
Linas Valiukas 747c46abce Trim many repeated spaces to make clean() faster
When Readability encounters many repeated whitespace, the cleanup
regexes in clean() take forever to run, so trim the amount of whitespace
to 255 characters.

Additionally, test the extracting performance with "timeout_decorator".
6 years ago
Yuri Baburov 8235f0794c Trying to pass travis tests. 6 years ago
Yuri Baburov f7f439d019 Improved positive_keywords and negative_keywords processing for the CLI 6 years ago
Yuri Baburov 0c8f040d53 Updated docs for positive_keywords and negative_keywords, cleaner implementation. 6 years ago
Yuri Baburov 0e50b53d05 Release version 0.7 . Better HTML5 support and an important bugfix. 6 years ago
Yuri Baburov 537de2b8f6 Improved remove_unlikely_candidates following an advice from issue #102 6 years ago
Yuri Baburov 97e86c4559
Merge pull request #101 from hugovk/add-3.5-3.6
Add support for Python 3.5 and 3.6, drop support for Python 3.3 and 2.6
7 years ago
Hugo f4a04732fd Workaround for py35 7 years ago
Hugo 4172699812 Add Python 3.5 and 3.6 7 years ago
Hugo f74adc6893 Drop support for EOL Python 3.3 7 years ago
Hugo 27159f45b3 Drop support for EOL Python 2.6 7 years ago
Yuri Baburov 78cac34bb3 Merge pull request #96 from ccurvey/master
fix encoding detection to use the encoding being tested
7 years ago
Chris Curvey 9a31587192 fix encoding detection to use the encoding being tested 7 years ago
Yuri Baburov e4efc87a20 Update readability.py 8 years ago
Yuri Baburov b20d5c15ef Improved Document class documentation 8 years ago
Yuri Baburov b6e5921f27 Merge pull request #85 from lwm/add-badge
Add travis badge
8 years ago
Luke Murphy aafcf52e58 add travis badge
[ci skip]
8 years ago
Yuri Baburov 5337adc590 Merge pull request #82 from lwm/fixup-readme
Makeover for the README
8 years ago
Yuri Baburov 86cde20ba6 Merge pull request #84 from lwm/add-travis-file
Add travis file
8 years ago
Luke Murphy a1d6bbcd3f add travis file 8 years ago
Luke Murphy 82837e4b5c makeover for the README [ci skip] 8 years ago
Yuri Baburov 75f2ea0d00 Version bump to 0.6.2 8 years ago
Yuri Baburov 47e473fb91 Merge pull request #73 from alphapapa/patch-1
Using Mozilla User-Agent by default
8 years ago
alphapapa 8443a87f5c Update readability.py 8 years ago
alphapapa 5fc2d3684a Use Mozilla User-Agent
Use a "Mozilla" user-agent to avoid HTTP 403 errors.  Fixes #71.
8 years ago
Yuri Baburov 65d1ebb06d Fixed #70 and added xpath option 9 years ago
Yuri Baburov fae95bad45 Bump to v0.6.1 -- fixed logging 9 years ago
Yuri Baburov c0d794fdd8 Update readability.py
Fixed logging namespace
9 years ago
Yuri Baburov 398f6ad748 Bump to 0.6.0.5 9 years ago
Yuri Baburov 8ff11e68a6 Debugging improvements. Bump to 0.6.0.5 9 years ago
Yuri Baburov fcdbe563a5 Fixed #49. Bump to 0.6.0.4 9 years ago
Yuri Baburov c51886d923 Bump to 0.6.0.3
Fixed installation.
9 years ago
Yuri Baburov 24bb20c761 Added dev branch features.
Bumped to version 0.6
9 years ago
Yuri Baburov 154658798b Merge pull request #64 from martinth/master
Added python 3 support (Supported: python 2.6, 2.7, 3.3, 3.4).
Thanks a lot to @martinth
9 years ago
Yuri Baburov 83a7ce67c1 Merge pull request #68 from digitaldavenyc/python3
fix for setup, convert print to python 3 compatible format
9 years ago
Dave Padovano 1ac3e019bd fix for setup, convert print to python 3 compatible format 9 years ago
Yuri Baburov 1aabdb3d27 Merge pull request #67 from horva/fix-logging-config
Move logging.basicConfig to main function
9 years ago
Marko Horvatic f0ff9b2425 Move logging.basicConfig to main function 9 years ago
Yuri Baburov e2bc1ea055 Improved #65 which has given warning, added cssselect lib, bumped to 0.5.1 9 years ago
Yuri Baburov 1cb17d919b Merge pull request #65 from avalanchy/best_elem_is_root
Failure if best_elem is root (fix #58)
Thanks a lot @avalanchy and @jnothman !
9 years ago
Mariusz Osiecki bf9e7404fa Failure if best_elem is root (fix #58) 9 years ago
Martin Thurau 386e48d29b Fixes checking of declared encodings in get_encoding.
In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it.
9 years ago
Martin Thurau 046d2c10c3 Fixes regex declaration in get_encoding.
Since get_encoding() is only called when the input is *not* already unicode we need to declare the regexs as byte type so they continue to work in Python 3.
9 years ago
Martin Thurau ce7ca26835 Adds compatibility `raise_with_traceback` method to support different `raise` syntax
Unfortunately the Python 2 `raise` syntax is not supported in Python 3.3 and not all 3.4.x versions so we deal with that by using conditional imports and a compatibility layer.
9 years ago
Martin Thurau 3ac56329e2 Corrects some things were 2to3 did to much. 9 years ago
Martin Thurau aa4132f57a Adds Python 3.4 support.
Code now supports Python 2.6, 2.7 and 3.4. PYthon 3.3 isn't support
because of some issues with the parser and the difference between old and
new `raise` syntax.
9 years ago
Martin Thurau 13cca1dd19 Adds tox configuration.
Adds tox.ini to support running the tests on multiple versions. Adds
requirements.txt to support dependency installtion via pip.
9 years ago
Yuri Baburov 1d4ee9d421 Releasing as version 0.5 9 years ago
Yuri Baburov 987570bef0 Updated package links for Python 2.7 and Python 3 support 9 years ago
Yuri Baburov dc648e7d0b Added a test for issue #48 but can't reproduce it -- seems to work fine. 9 years ago
Yuri Baburov c715426584 Releasing as version 0.4 9 years ago
Yuri Baburov 1fac7e685a Added a feature to allow more images per article (with a test) 9 years ago
Yuri Baburov c6796195a7 Fixed makefile testing. 9 years ago
Miguel Galves d04d41b749 Insert text inside iframe for correct output 9 years ago
Miguel Galves be2a1c4646 Let width and height attributes 9 years ago
Miguel Galves f1759c1404 Allows iframes containing youtube or vimeo videos. People like them 9 years ago
Yuri Baburov 332ad810de Bumped to 0.3.0.6 9 years ago
Yuri Baburov e4bcbe57d7 Fixes #53 9 years ago
Yuri Baburov aeb4f4c782 Merge pull request #59 from seomoz/mac_10_10
Fix mac version comparison in setup.py for 10.10
9 years ago
Matthew Peters c8c2f8809c Fix mac version comparison in setup.py for 10.10 9 years ago
Yuri Baburov 2d4cfdb2c8 Merge pull request #56 from nathanathan/patch-1
Defaulting to utf-8 when chardet returns None
10 years ago
Nathan Breit 75e2e0cb3a Defaulting to utf-8 when chardet returns None
On articles like this one chardet returns None:
http://news.zing.vn/nhip-song-tre/thay-giao-gay-sot-tung-bo-luat-tinh-yeu/a291427.html
This causes exceptions later on when encoding.lower() is called
10 years ago
Yuri Baburov 0c2f29ed0d Version bump. 10 years ago
Yuri Baburov 638f73f6a2 Fix for #52: <input type="hidden"> are not counted any more for "form removal" heuristic. 10 years ago

10
.gitignore vendored

@ -1,7 +1,8 @@
*.pyc
__pycache__
*.egg-info
build
dist
/build
/dist
/bin
/include
/lib
@ -9,3 +10,8 @@ dist
/man
nosetests.xml
.coverage
.tox
.idea
.cache
/.noseids
/.venv

@ -0,0 +1,60 @@
language: python
os: linux
cache: pip
matrix:
include:
- name: "Python 2.7 on Linux"
python: 2.7
env: PIP=pip
- name: "Python 3.5 on Linux"
python: 3.5
- name: "Python 3.6 on Linux"
python: 3.6
- name: "Python 3.7 on Linux"
python: 3.7
- name: "Python 3.8 on Linux"
dist: xenial
python: 3.8
- name: "Python 3.9 Nightly on Linux"
dist: bionic
python: nightly
- name: "Pypy on Linux"
python: pypy
env: PIP=pip
- name: "Pypy 3 on Linux"
python: pypy3
- name: "Python 3.7 on older macOS"
os: osx
osx_image: xcode9.4
language: shell
env: TOXENV=py37
before_install:
- sw_vers
- python3 --version
- pip3 --version
- name: "Python 3.7 on macOS"
os: osx
osx_image: xcode11
language: shell
env: TOXENV=py37
before_install:
- sw_vers
- python3 --version
- pip3 --version
allow_failures:
- python: nightly
- python: pypy
- python: pypy3
- os: osx
install:
- if [ $PIP ]; then true; else PIP=pip3; fi
- travis_retry $PIP install -U pip wheel tox-travis pytest-cov codecov
- travis_retry $PIP install -U -r requirements.txt -e ".[test]"
script:
- tox
after_success:
- codecov

@ -0,0 +1,201 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

@ -1,10 +1,10 @@
# Makefile to help automate tasks
WD := $(shell pwd)
PY := bin/python
PIP := bin/pip
PEP8 := bin/pep8
NOSE := bin/nosetests
PY := .venv/bin/python
PIP := .venv/bin/pip
PEP8 := .venv/bin/pep8
NOSE := .venv/bin/nosetests
TWINE := twine
# ###########
# Tests rule!
@ -13,25 +13,29 @@ NOSE := bin/nosetests
test: venv develop $(NOSE)
$(NOSE) --with-id -s tests
$(NOSE):
$(PIP) install nose pep8 coverage
$(NOSE): setup
# #######
# INSTALL
# #######
.PHONY: all
all: venv develop
all: setup develop
venv: .venv/bin/python
setup: venv
$(PIP) install -r requirements-dev.txt
venv: bin/python
bin/python:
virtualenv .
.venv/bin/python:
test -d .venv || which python3 && python3 -m venv .venv || virtualenv .venv
.PHONY: clean_venv
clean_venv:
rm -rf bin include lib local man
.PHONY: clean
clean:
rm -rf .venv
develop: lib/python*/site-packages/bookie-api.egg-link
lib/python*/site-packages/bookie-api.egg-link:
develop: .venv/lib/python*/site-packages/readability-lxml.egg-link
.venv/lib/python*/site-packages/readability-lxml.egg-link:
$(PY) setup.py develop
@ -41,17 +45,17 @@ lib/python*/site-packages/bookie-api.egg-link:
.PHONY: clean_all
clean_all: clean_venv
# ###########
# Deploy
# ###########
.PHONY: dist
dist:
$(PY) setup.py sdist
$(PY) setup.py sdist bdist_wheel
$(TWINE) check dist/*
.PHONY: upload
upload:
$(PY) setup.py sdist upload
$(TWINE) upload dist/*
.PHONY: version_update
version_update:

@ -1,59 +0,0 @@
This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0
This is a python port of a ruby port of arc90's readability project
http://lab.arc90.com/experiments/readability/
In few words,
Given a html document, it pulls out the main body text and cleans it up.
It also can clean up title based on latest readability.js code.
Based on:
- Latest readability.js ( https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js )
- Ruby port by starrhorne and iterationlabs
- Python port by gfxmonk ( https://github.com/gfxmonk/python-readability , based on BeautifulSoup )
- Decruft effort to move to lxml ( http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ )
- "BR to P" fix from readability.js which improves quality for smaller texts.
- Github users contributions.
Installation::
easy_install readability-lxml
or
pip install readability-lxml
Usage::
from readability.readability import Document
import urllib
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
Command-line usage::
python -m readability.readability -u http://pypi.python.org/pypi/readability-lxml
Using positive/negative keywords example::
python -m readability.readability -p intro -n newsindex,homepage-box,news-section -u http://python.org
Document() kwarg options:
- attributes:
- debug: output debug messages
- min_text_length:
- retry_length:
- url: will allow adjusting links to be absolute
- positive_keywords: the list of positive search patterns in classes and ids, for example: ["news-item", "block"]
- negative_keywords: the list of negative search patterns in classes and ids, for example: ["mysidebar", "related", "ads"]
Updates
- 0.2.5 Update setup.py for uploading .tar.gz to pypi
- 0.2.6 Don't crash on documents with no title
- 0.2.6.1 Document.short_title() properly works
- 0.3 Added Document.encoding, positive_keywords and negative_keywords

@ -0,0 +1,68 @@
.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master
:target: https://travis-ci.org/buriy/python-readability
python-readability
==================
Given a html document, it pulls out the main body text and cleans it up.
This is a python port of a ruby port of `arc90's readability
project <http://lab.arc90.com/experiments/readability/>`__.
Installation
------------
It's easy using ``pip``, just run:
.. code-block:: bash
$ pip install readability-lxml
Usage
-----
.. code-block:: python
>>> import requests
>>> from readability import Document
>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'
>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n domain in examples without prior coordination or asking for permission.</p>
\n <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""
Change Log
----------
- 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
- 0.8 Replaced XHTML output with HTML5 output in summary() call.
- 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
- 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
- 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
- 0.4 Added Videos loading and allowed more images per paragraph
- 0.3 Added Document.encoding, positive\_keywords and negative\_keywords
Licensing
---------
This code is under `the Apache License
2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license.
Thanks to
---------
- Latest `readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__
- Ruby port by starrhorne and iterationlabs
- `Python port <https://github.com/gfxmonk/python-readability>`__ by gfxmonk
- `Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>` to move to lxml
- "BR to P" fix from readability.js which improves quality for smaller texts
- Github users contributions.

@ -0,0 +1,30 @@
Reference
=========
.. automodule:: readability
:members:
:show-inheritance:
.. automodule:: readability.browser
:members:
:show-inheritance:
.. automodule:: readability.cleaners
:members:
:show-inheritance:
.. automodule:: readability.debug
:members:
:show-inheritance:
.. automodule:: readability.encoding
:members:
:show-inheritance:
.. automodule:: readability.htmls
:members:
:show-inheritance:
.. automodule:: readability.readability
:members:
:show-inheritance:

@ -0,0 +1,164 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# readability documentation build configuration file, created by
# sphinx-quickstart on Thu Mar 23 16:29:38 2017.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath("../.."))
import readability
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.doctest",
"sphinx.ext.intersphinx",
"sphinx.ext.todo",
"recommonmark",
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
source_suffix = [".rst", ".md"]
# The master toctree document.
master_doc = "index"
# General information about the project.
project = "readability"
copyright = "2020, Yuri Baburov"
author = "Yuri Baburov"
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
# The short X.Y version.
version = readability.__version__
# The full version, including alpha/beta/rc tags.
release = readability.__version__
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = []
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = "sphinx"
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "sphinx_rtd_theme"
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = [] #'_static']
# -- Options for HTMLHelp output ------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = "readabilitydoc"
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [(master_doc, "readability.tex", "Readability Documentation", "Yuri Baburov", "manual")]
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(master_doc, "readability", "readability Documentation", [author], 1)]
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(
master_doc,
"readability",
"Readability Documentation",
author,
"readability",
"One line description of project.",
"Miscellaneous",
)
]
intersphinx_mapping = {
"python": ("https://docs.python.org/3", None),
}

@ -0,0 +1,13 @@
.. include:: ../../README.rst
.. toctree::
:maxdepth: 2
api
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

@ -1 +1,3 @@
__version__ = "0.8.1.1"
from .readability import Document

@ -0,0 +1,21 @@
def open_in_browser(html):
"""
Open the HTML document in a web browser, saving it to a temporary
file to open it. Note that this does not delete the file after
use. This is mainly meant for debugging.
"""
import os
import webbrowser
import tempfile
handle, fn = tempfile.mkstemp(suffix=".html")
f = os.fdopen(handle, "wb")
try:
f.write(b"<meta charset='UTF-8' />")
f.write(html.encode("utf-8"))
finally:
# we leak the file itself here, but we should at least close it
f.close()
url = "file://" + fn.replace(os.path.sep, "/")
webbrowser.open(url)
return url

@ -2,31 +2,51 @@
import re
from lxml.html.clean import Cleaner
bad_attrs = ['width', 'height', 'style', '[-a-z]*color', 'background[-a-z]*', 'on*']
bad_attrs = ["width", "height", "style", "[-a-z]*color", "background[-a-z]*", "on*"]
single_quoted = "'[^']+'"
double_quoted = '"[^"]+"'
non_space = '[^ "\'>]+'
htmlstrip = re.compile("<" # open
"([^>]+) " # prefix
"(?:%s) *" % ('|'.join(bad_attrs),) + # undesirable attributes
'= *(?:%s|%s|%s)' % (non_space, single_quoted, double_quoted) + # value
"([^>]*)" # postfix
">" # end
, re.I)
non_space = "[^ \"'>]+"
htmlstrip = re.compile(
"<" # open
"([^>]+) " # prefix
"(?:%s) *" % ("|".join(bad_attrs),)
+ "= *(?:%s|%s|%s)" # undesirable attributes
% (non_space, single_quoted, double_quoted)
+ "([^>]*)" # value # postfix
">", # end
re.I,
)
def clean_attributes(html):
while htmlstrip.search(html):
html = htmlstrip.sub('<\\1\\2>', html)
html = htmlstrip.sub("<\\1\\2>", html)
return html
def normalize_spaces(s):
if not s: return ''
if not s:
return ""
"""replace any sequence of whitespace
characters with a single space"""
return ' '.join(s.split())
return " ".join(s.split())
html_cleaner = Cleaner(scripts=True, javascript=True, comments=True,
style=True, links=True, meta=False, add_nofollow=False,
page_structure=False, processing_instructions=True, embedded=False,
frames=False, forms=False, annoying_tags=False, remove_tags=None,
remove_unknown_tags=False, safe_attrs_only=False)
html_cleaner = Cleaner(
scripts=True,
javascript=True,
comments=True,
style=True,
links=True,
meta=False,
add_nofollow=False,
page_structure=False,
processing_instructions=True,
embedded=False,
frames=False,
forms=False,
annoying_tags=False,
remove_tags=None,
remove_unknown_tags=False,
safe_attrs_only=False,
)

@ -0,0 +1,20 @@
"""
This module contains compatibility helpers for Python 2/3 interoperability.
It mainly exists because their are certain incompatibilities in the Python
syntax that can only be solved by conditionally importing different functions.
"""
import sys
from lxml.etree import tostring
if sys.version_info[0] == 2:
bytes_ = str
str_ = unicode
def tostring_(s):
return tostring(s, encoding='utf-8').decode('utf-8')
elif sys.version_info[0] == 3:
bytes_ = bytes
str_ = str
def tostring_(s):
return tostring(s, encoding='utf-8')

@ -0,0 +1,6 @@
def raise_with_traceback(exc_type, traceback, *args, **kwargs):
"""
Raise a new exception of type `exc_type` with an existing `traceback`. All
additional (keyword-)arguments are forwarded to `exc_type`
"""
raise exc_type(*args, **kwargs).with_traceback(traceback)

@ -0,0 +1,6 @@
def raise_with_traceback(exc_type, traceback, *args, **kwargs):
"""
Raise a new exception of type `exc_type` with an existing `traceback`. All
additional (keyword-)arguments are forwarded to `exc_type`
"""
raise exc_type(*args, **kwargs), None, traceback

@ -1,25 +1,51 @@
def save_to_file(text, filename):
f = open(filename, 'wt')
f.write('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />')
f.write(text.encode('utf-8'))
f.close()
uids = {}
def describe(node, depth=2):
if not hasattr(node, 'tag'):
import re
# FIXME: use with caution, can leak memory
uids = {}
uids_document = None
def describe_node(node):
global uids
if node is None:
return ""
if not hasattr(node, "tag"):
return "[%s]" % type(node)
name = node.tag
if node.get('id', ''): name += '#'+node.get('id')
if node.get('class', ''):
name += '.' + node.get('class').replace(' ','.')
if name[:4] in ['div#', 'div.']:
if node.get("id", ""):
name += "#" + node.get("id")
if node.get("class", "").strip():
name += "." + ".".join(node.get("class").split())
if name[:4] in ["div#", "div."]:
name = name[3:]
if name in ['tr', 'td', 'div', 'p']:
if not node in uids:
uid = uids[node] = len(uids)+1
else:
uid = uids.get(node)
name += "%02d" % (uid)
if depth and node.getparent() is not None:
return name+' - '+describe(node.getparent(), depth-1)
if name in ["tr", "td", "div", "p"]:
uid = uids.get(node)
if uid is None:
uid = uids[node] = len(uids) + 1
name += "{%02d}" % uid
return name
def describe(node, depth=1):
global uids, uids_document
doc = node.getroottree().getroot()
if doc != uids_document:
uids = {}
uids_document = doc
# return repr(NodeRepr(node))
parent = ""
if depth and node.getparent() is not None:
parent = describe(node.getparent(), depth=depth - 1) + ">"
return parent + describe_node(node)
RE_COLLAPSE_WHITESPACES = re.compile(r"\s+", re.U)
def text_content(elem, length=40):
content = RE_COLLAPSE_WHITESPACES.sub(" ", elem.text_content().replace("\r", ""))
if len(content) < length:
return content
return content[:length] + "..."

@ -1,48 +1,63 @@
import re
import chardet
import sys
def get_encoding(page):
# Regex for XML and HTML Meta charset declaration
charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')
declared_encodings = (charset_re.findall(page) +
pragma_re.findall(page) +
xml_re.findall(page))
RE_CHARSET = re.compile(br'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
RE_PRAGMA = re.compile(br'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
RE_XML = re.compile(br'^<\?xml.*?encoding=["\']*(.+?)["\'>]')
# Try any declared encodings
if len(declared_encodings) > 0:
for declared_encoding in declared_encodings:
try:
page.decode(custom_decode(declared_encoding))
return custom_decode(declared_encoding)
except UnicodeDecodeError:
pass
CHARSETS = {
"big5": "big5hkscs",
"gb2312": "gb18030",
"ascii": "utf-8",
"maccyrillic": "cp1251",
"win1251": "cp1251",
"win-1251": "cp1251",
"windows-1251": "cp1251",
}
# Fallback to chardet if declared encodings fail
text = re.sub('</?[^>]*>\s*', ' ', page)
enc = 'utf-8'
if not text.strip() or len(text) < 10:
return enc # can't guess
res = chardet.detect(text)
enc = res['encoding']
#print '->', enc, "%.2f" % res['confidence']
enc = custom_decode(enc)
return enc
def custom_decode(encoding):
def fix_charset(encoding):
"""Overrides encoding when charset declaration
or charset determination is a subset of a larger
charset. Created because of issues with Chinese websites"""
encoding = encoding.lower()
alternates = {
'big5': 'big5hkscs',
'gb2312': 'gb18030',
'ascii': 'utf-8',
'MacCyrillic': 'cp1251',
}
if encoding in alternates:
return alternates[encoding]
else:
return encoding
return CHARSETS.get(encoding, encoding)
def get_encoding(page):
# Regex for XML and HTML Meta charset declaration
declared_encodings = (
RE_CHARSET.findall(page) + RE_PRAGMA.findall(page) + RE_XML.findall(page)
)
# Try any declared encodings
for declared_encoding in declared_encodings:
try:
if sys.version_info[0] == 3:
# declared_encoding will actually be bytes but .decode() only
# accepts `str` type. Decode blindly with ascii because no one should
# ever use non-ascii characters in the name of an encoding.
declared_encoding = declared_encoding.decode("ascii", "replace")
encoding = fix_charset(declared_encoding)
# Now let's decode the page
page.decode(encoding)
# It worked!
return encoding
except (UnicodeDecodeError, LookupError):
pass
# Fallback to chardet if declared encodings fail
# Remove all HTML tags, and leave only text for chardet
text = re.sub(br"(\s*</?[^>]*>)+\s*", b" ", page).strip()
enc = "utf-8"
if len(text) < 10:
return enc # can't guess
res = chardet.detect(text)
enc = res["encoding"] or "utf-8"
# print '->', enc, "%.2f" % res['confidence']
enc = fix_charset(enc)
return enc

@ -1,76 +1,101 @@
from cleaners import normalize_spaces, clean_attributes
from encoding import get_encoding
from lxml.html import tostring
import logging
import lxml.html
import re, sys
import re
from .cleaners import normalize_spaces, clean_attributes
from .encoding import get_encoding
from .compat import str_
utf8_parser = lxml.html.HTMLParser(encoding="utf-8")
utf8_parser = lxml.html.HTMLParser(encoding='utf-8')
def build_doc(page):
if isinstance(page, unicode):
enc = None
page_unicode = page
if isinstance(page, str_):
encoding = None
decoded_page = page
else:
enc = get_encoding(page) or 'utf-8'
page_unicode = page.decode(enc, 'replace')
doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
return doc, enc
encoding = get_encoding(page) or "utf-8"
decoded_page = page.decode(encoding, "replace")
# XXX: we have to do .decode and .encode even for utf-8 pages to remove bad characters
doc = lxml.html.document_fromstring(
decoded_page.encode("utf-8", "replace"), parser=utf8_parser
)
return doc, encoding
def js_re(src, pattern, flags, repl):
return re.compile(pattern, flags).sub(src, repl.replace('$', '\\'))
return re.compile(pattern, flags).sub(src, repl.replace("$", "\\"))
def normalize_entities(cur_title):
entities = {
u'\u2014':'-',
u'\u2013':'-',
u'&mdash;': '-',
u'&ndash;': '-',
u'\u00A0': ' ',
u'\u00AB': '"',
u'\u00BB': '"',
u'&quot;': '"',
u"\u2014": "-",
u"\u2013": "-",
u"&mdash;": "-",
u"&ndash;": "-",
u"\u00A0": " ",
u"\u00AB": '"',
u"\u00BB": '"',
u"&quot;": '"',
}
for c, r in entities.iteritems():
for c, r in entities.items():
if c in cur_title:
cur_title = cur_title.replace(c, r)
return cur_title
def norm_title(title):
return normalize_entities(normalize_spaces(title))
def get_title(doc):
title = doc.find('.//title')
if title is None or len(title.text) == 0:
return '[no-title]'
title = doc.find(".//title")
if title is None or title.text is None or len(title.text) == 0:
return "[no-title]"
return norm_title(title.text)
def add_match(collection, text, orig):
text = norm_title(text)
if len(text.split()) >= 2 and len(text) >= 15:
if text.replace('"', '') in orig.replace('"', ''):
if text.replace('"', "") in orig.replace('"', ""):
collection.add(text)
TITLE_CSS_HEURISTICS = [
"#title",
"#head",
"#heading",
".pageTitle",
".news_title",
".title",
".head",
".heading",
".contentheading",
".small_header_red",
]
def shorten_title(doc):
title = doc.find('.//title')
title = doc.find(".//title")
if title is None or title.text is None or len(title.text) == 0:
return ''
return ""
title = orig = norm_title(title.text)
candidates = set()
for item in ['.//h1', './/h2', './/h3']:
for item in [".//h1", ".//h2", ".//h3"]:
for e in list(doc.iterfind(item)):
if e.text:
add_match(candidates, e.text, orig)
if e.text_content():
add_match(candidates, e.text_content(), orig)
for item in ['#title', '#head', '#heading', '.pageTitle', '.news_title', '.title', '.head', '.heading', '.contentheading', '.small_header_red']:
for item in TITLE_CSS_HEURISTICS:
for e in doc.cssselect(item):
if e.text:
add_match(candidates, e.text, orig)
@ -80,7 +105,7 @@ def shorten_title(doc):
if candidates:
title = sorted(candidates, key=len)[-1]
else:
for delimiter in [' | ', ' - ', ' :: ', ' / ']:
for delimiter in [" | ", " - ", " :: ", " / "]:
if delimiter in title:
parts = orig.split(delimiter)
if len(parts[0].split()) >= 4:
@ -90,25 +115,30 @@ def shorten_title(doc):
title = parts[-1]
break
else:
if ': ' in title:
parts = orig.split(': ')
if ": " in title:
parts = orig.split(": ")
if len(parts[-1].split()) >= 4:
title = parts[-1]
else:
title = orig.split(': ', 1)[1]
title = orig.split(": ", 1)[1]
if not 15 < len(title) < 150:
return orig
return title
# is it necessary? Cleaner from LXML is initialized correctly in cleaners.py
def get_body(doc):
[ elem.drop_tree() for elem in doc.xpath('.//script | .//link | .//style') ]
raw_html = unicode(tostring(doc.body or doc))
for elem in doc.xpath(".//script | .//link | .//style"):
elem.drop_tree()
# tostring() always return utf-8 encoded string
# FIXME: isn't better to use tounicode?
raw_html = str_(tostring(doc.body or doc))
cleaned = clean_attributes(raw_html)
try:
#BeautifulSoup(cleaned) #FIXME do we really need to try loading it?
# BeautifulSoup(cleaned) #FIXME do we really need to try loading it?
return cleaned
except Exception: #FIXME find the equivalent lxml error
#logging.error("cleansing broke html content: %s\n---------\n%s" % (raw_html, cleaned))
except Exception: # FIXME find the equivalent lxml error
# logging.error("cleansing broke html content: %s\n---------\n%s" % (raw_html, cleaned))
return raw_html

@ -1,39 +1,49 @@
#!/usr/bin/env python
from __future__ import print_function
import logging
import re
import sys
from collections import defaultdict
from lxml.etree import tostring
from lxml.etree import tounicode
from lxml.html import document_fromstring
from lxml.html import fragment_fromstring
from cleaners import clean_attributes
from cleaners import html_cleaner
from htmls import build_doc
from htmls import get_body
from htmls import get_title
from htmls import shorten_title
from .cleaners import clean_attributes
from .cleaners import html_cleaner
from .htmls import build_doc
from .htmls import get_body
from .htmls import get_title
from .htmls import shorten_title
from .compat import str_, bytes_, tostring_
from .debug import describe, text_content
logging.basicConfig(level=logging.INFO)
log = logging.getLogger()
log = logging.getLogger("readability.readability")
REGEXES = {
'unlikelyCandidatesRe': re.compile('combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup|tweet|twitter', re.I),
'okMaybeItsACandidateRe': re.compile('and|article|body|column|main|shadow', re.I),
'positiveRe': re.compile('article|body|content|entry|hentry|main|page|pagination|post|text|blog|story', re.I),
'negativeRe': re.compile('combx|comment|com-|contact|foot|footer|footnote|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|widget', re.I),
'divToPElementsRe': re.compile('<(a|blockquote|dl|div|img|ol|p|pre|table|ul)', re.I),
#'replaceBrsRe': re.compile('(<br[^>]*>[ \n\r\t]*){2,}',re.I),
#'replaceFontsRe': re.compile('<(\/?)font[^>]*>',re.I),
#'trimRe': re.compile('^\s+|\s+$/'),
#'normalizeRe': re.compile('\s{2,}/'),
#'killBreaksRe': re.compile('(<br\s*\/?>(\s|&nbsp;?)*){1,}/'),
#'videoRe': re.compile('http:\/\/(www\.)?(youtube|vimeo)\.com', re.I),
#skipFootnoteLink: /^\s*(\[?[a-z0-9]{1,2}\]?|^|edit|citation needed)\s*$/i,
"unlikelyCandidatesRe": re.compile(
r"combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup|tweet|twitter",
re.I,
),
"okMaybeItsACandidateRe": re.compile(r"and|article|body|column|main|shadow", re.I),
"positiveRe": re.compile(
r"article|body|content|entry|hentry|main|page|pagination|post|text|blog|story",
re.I,
),
"negativeRe": re.compile(
r"combx|comment|com-|contact|foot|footer|footnote|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|widget",
re.I,
),
"divToPElementsRe": re.compile(
r"<(a|blockquote|dl|div|img|ol|p|pre|table|ul)", re.I
),
#'replaceBrsRe': re.compile(r'(<br[^>]*>[ \n\r\t]*){2,}',re.I),
#'replaceFontsRe': re.compile(r'<(\/?)font[^>]*>',re.I),
#'trimRe': re.compile(r'^\s+|\s+$/'),
#'normalizeRe': re.compile(r'\s{2,}/'),
#'killBreaksRe': re.compile(r'(<br\s*\/?>(\s|&nbsp;?)*){1,}/'),
"videoRe": re.compile(r"https?:\/\/(www\.)?(youtube|vimeo)\.com", re.I),
# skipFootnoteLink: /^\s*(\[?[a-z0-9]{1,2}\]?|^|edit|citation needed)\s*$/i,
}
@ -41,121 +51,170 @@ class Unparseable(ValueError):
pass
def describe(node, depth=1):
if not hasattr(node, 'tag'):
return "[%s]" % type(node)
name = node.tag
if node.get('id', ''):
name += '#' + node.get('id')
if node.get('class', ''):
name += '.' + node.get('class').replace(' ', '.')
if name[:4] in ['div#', 'div.']:
name = name[3:]
if depth and node.getparent() is not None:
return name + ' - ' + describe(node.getparent(), depth - 1)
return name
def to_int(x):
if not x:
return None
x = x.strip()
if x.endswith('px'):
if x.endswith("px"):
return int(x[:-2])
if x.endswith('em'):
if x.endswith("em"):
return int(x[:-2]) * 12
return int(x)
def clean(text):
text = re.sub('\s*\n\s*', '\n', text)
text = re.sub('[ \t]{2,}', ' ', text)
# Many spaces make the following regexes run forever
text = re.sub(r"\s{255,}", " " * 255, text)
text = re.sub(r"\s*\n\s*", "\n", text)
text = re.sub(r"\t|[ \t]{2,}", " ", text)
return text.strip()
def text_length(i):
return len(clean(i.text_content() or ""))
regexp_type = type(re.compile('hello, world'))
def compile_pattern(elements):
if not elements:
return None
if isinstance(elements, regexp_type):
elif isinstance(elements, re._pattern_type):
return elements
if isinstance(elements, basestring):
elements = elements.split(',')
return re.compile(u'|'.join([re.escape(x.lower()) for x in elements]), re.U)
elif isinstance(elements, (str_, bytes_)):
if isinstance(elements, bytes_):
elements = str_(elements, "utf-8")
elements = elements.split(u",")
if isinstance(elements, (list, tuple)):
return re.compile(u"|".join([re.escape(x.strip()) for x in elements]), re.U)
else:
raise Exception("Unknown type for the pattern: {}".format(type(elements)))
# assume string or string like object
class Document:
"""Class to build a etree document out of html."""
TEXT_LENGTH_THRESHOLD = 25
RETRY_LENGTH = 250
def __init__(self, input, positive_keywords=None, negative_keywords=None, **options):
def __init__(
self,
input,
positive_keywords=None,
negative_keywords=None,
url=None,
min_text_length=25,
retry_length=250,
xpath=False,
handle_failures="discard",
):
"""Generate the document
:param input: string of the html content.
kwargs:
- attributes:
- debug: output debug messages
- min_text_length:
- retry_length:
- url: will allow adjusting links to be absolute
- positive_keywords: the list of positive search patterns in classes and ids, for example: ["news-item", "block"]
- negative_keywords: the list of negative search patterns in classes and ids, for example: ["mysidebar", "related", "ads"]
Also positive_keywords and negative_keywords could be a regexp.
:param positive_keywords: regex, list or comma-separated string of patterns in classes and ids
:param negative_keywords: regex, list or comma-separated string in classes and ids
:param min_text_length: Tunable. Set to a higher value for more precise detection of longer texts.
:param retry_length: Tunable. Set to a lower value for better detection of very small texts.
:param xpath: If set to True, adds x="..." attribute to each HTML node,
containing xpath path pointing to original document path (allows to
reconstruct selected summary in original document).
:param handle_failures: Parameter passed to `lxml` for handling failure during exception.
Support options = ["discard", "ignore", None]
Examples:
positive_keywords=["news-item", "block"]
positive_keywords=["news-item, block"]
positive_keywords=re.compile("news|block")
negative_keywords=["mysidebar", "related", "ads"]
The Document class is not re-enterable.
It is designed to create a new Document() for each HTML file to process it.
API methods:
.title() -- full title
.short_title() -- cleaned up title
.content() -- full content
.summary() -- cleaned up content
"""
self.input = input
self.options = options
self.html = None
self.encoding = None
self.positive_keywords = compile_pattern(positive_keywords)
self.negative_keywords = compile_pattern(negative_keywords)
self.url = url
self.min_text_length = min_text_length
self.retry_length = retry_length
self.xpath = xpath
self.handle_failures = handle_failures
def _html(self, force=False):
if force or self.html is None:
self.html = self._parse(self.input)
if self.xpath:
root = self.html.getroottree()
for i in self.html.getiterator():
# print root.getpath(i)
i.attrib["x"] = root.getpath(i)
return self.html
def _parse(self, input):
doc, self.encoding = build_doc(input)
doc = html_cleaner.clean_html(doc)
base_href = self.options.get('url', None)
base_href = self.url
if base_href:
doc.make_links_absolute(base_href, resolve_base_href=True)
# trying to guard against bad links like <a href="http://[http://...">
try:
# such support is added in lxml 3.3.0
doc.make_links_absolute(
base_href,
resolve_base_href=True,
handle_failures=self.handle_failures,
)
except TypeError: # make_links_absolute() got an unexpected keyword argument 'handle_failures'
# then we have lxml < 3.3.0
# please upgrade to lxml >= 3.3.0 if you're failing here!
doc.make_links_absolute(
base_href,
resolve_base_href=True,
handle_failures=self.handle_failures,
)
else:
doc.resolve_base_href()
doc.resolve_base_href(handle_failures=self.handle_failures)
return doc
def content(self):
"""Returns document body"""
return get_body(self._html(True))
def title(self):
"""Returns document title"""
return get_title(self._html(True))
def short_title(self):
"""Returns cleaned up document title"""
return shorten_title(self._html(True))
def get_clean_html(self):
return clean_attributes(tounicode(self.html))
"""
An internal method, which can be overridden in subclasses, for example,
to disable or to improve DOM-to-text conversion in .summary() method
"""
return clean_attributes(tounicode(self.html, method="html"))
def summary(self, html_partial=False):
"""Generate the summary of the html docuemnt
"""
Given a HTML file, extracts the text of the article.
:param html_partial: return only the div of the document, don't wrap
in html and body tags.
in html and body tags.
Warning: It mutates internal DOM representation of the HTML document,
so it is better to call other API methods before this one.
"""
try:
ruthless = True
while True:
self._html(True)
for i in self.tags(self.html, 'script', 'style'):
for i in self.tags(self.html, "script", "style"):
i.drop_tree()
for i in self.tags(self.html, 'body'):
i.set('id', 'readabilityBody')
for i in self.tags(self.html, "body"):
i.set("id", "readabilityBody")
if ruthless:
self.remove_unlikely_candidates()
self.transform_misused_divs_into_paragraphs()
@ -164,29 +223,35 @@ class Document:
best_candidate = self.select_best_candidate(candidates)
if best_candidate:
article = self.get_article(candidates, best_candidate,
html_partial=html_partial)
article = self.get_article(
candidates, best_candidate, html_partial=html_partial
)
else:
if ruthless:
log.debug("ruthless removal did not work. ")
log.info("ruthless removal did not work. ")
ruthless = False
self.debug(
("ended up stripping too much - "
"going for a safer _parse"))
log.debug(
(
"ended up stripping too much - "
"going for a safer _parse"
)
)
# try again
continue
else:
log.debug(
("Ruthless and lenient parsing did not work. "
"Returning raw html"))
article = self.html.find('body')
(
"Ruthless and lenient parsing did not work. "
"Returning raw html"
)
)
article = self.html.find("body")
if article is None:
article = self.html
cleaned_article = self.sanitize(article, candidates)
article_length = len(cleaned_article or '')
retry_length = self.options.get(
'retry_length',
self.RETRY_LENGTH)
article_length = len(cleaned_article or "")
retry_length = self.retry_length
of_acceptable_length = article_length >= retry_length
if ruthless and not of_acceptable_length:
ruthless = False
@ -194,32 +259,38 @@ class Document:
continue
else:
return cleaned_article
except StandardError, e:
log.exception('error getting summary: ')
raise Unparseable(str(e)), None, sys.exc_info()[2]
except Exception as e:
log.exception("error getting summary: ")
if sys.version_info[0] == 2:
from .compat.two import raise_with_traceback
else:
from .compat.three import raise_with_traceback
raise_with_traceback(Unparseable, sys.exc_info()[2], str_(e))
def get_article(self, candidates, best_candidate, html_partial=False):
# Now that we have the top candidate, look through its siblings for
# content that might also be related.
# Things like preambles, content split by ads that we removed, etc.
sibling_score_threshold = max([
10,
best_candidate['content_score'] * 0.2])
sibling_score_threshold = max([10, best_candidate["content_score"] * 0.2])
# create a new html document with a html->body->div
if html_partial:
output = fragment_fromstring('<div/>')
output = fragment_fromstring("<div/>")
else:
output = document_fromstring('<div/>')
best_elem = best_candidate['elem']
for sibling in best_elem.getparent().getchildren():
output = document_fromstring("<div/>")
best_elem = best_candidate["elem"]
parent = best_elem.getparent()
siblings = parent.getchildren() if parent is not None else [best_elem]
for sibling in siblings:
# in lxml there no concept of simple text
# if isinstance(sibling, NavigableString): continue
append = False
if sibling is best_elem:
append = True
sibling_key = sibling # HashableElement(sibling)
if sibling_key in candidates and \
candidates[sibling_key]['content_score'] >= sibling_score_threshold:
if (
sibling_key in candidates
and candidates[sibling_key]["content_score"] >= sibling_score_threshold
):
append = True
if sibling.tag == "p":
@ -229,9 +300,11 @@ class Document:
if node_length > 80 and link_density < 0.25:
append = True
elif node_length <= 80 \
and link_density == 0 \
and re.search('\.( |$)', node_content):
elif (
node_length <= 80
and link_density == 0
and re.search(r"\.( |$)", node_content)
):
append = True
if append:
@ -241,21 +314,21 @@ class Document:
output.append(sibling)
else:
output.getchildren()[0].getchildren()[0].append(sibling)
#if output is not None:
# if output is not None:
# output.append(best_elem)
return output
def select_best_candidate(self, candidates):
sorted_candidates = sorted(candidates.values(), key=lambda x: x['content_score'], reverse=True)
for candidate in sorted_candidates[:5]:
elem = candidate['elem']
self.debug("Top 5 : %6.3f %s" % (
candidate['content_score'],
describe(elem)))
if len(sorted_candidates) == 0:
if not candidates:
return None
sorted_candidates = sorted(
candidates.values(), key=lambda x: x["content_score"], reverse=True
)
for candidate in sorted_candidates[:5]:
elem = candidate["elem"]
log.debug("Top 5 : %6.3f %s" % (candidate["content_score"], describe(elem)))
best_candidate = sorted_candidates[0]
return best_candidate
@ -263,15 +336,13 @@ class Document:
link_length = 0
for i in elem.findall(".//a"):
link_length += text_length(i)
#if len(elem.findall(".//div") or elem.findall(".//p")):
# if len(elem.findall(".//div") or elem.findall(".//p")):
# link_length = link_length
total_length = text_length(elem)
return float(link_length) / max(total_length, 1)
def score_paragraphs(self, ):
MIN_LEN = self.options.get(
'min_text_length',
self.TEXT_LENGTH_THRESHOLD)
def score_paragraphs(self):
MIN_LEN = self.min_text_length
candidates = {}
ordered = []
for elem in self.tags(self._html(), "p", "pre", "td"):
@ -293,20 +364,19 @@ class Document:
ordered.append(parent_node)
if grand_parent_node is not None and grand_parent_node not in candidates:
candidates[grand_parent_node] = self.score_node(
grand_parent_node)
candidates[grand_parent_node] = self.score_node(grand_parent_node)
ordered.append(grand_parent_node)
content_score = 1
content_score += len(inner_text.split(','))
content_score += len(inner_text.split(","))
content_score += min((inner_text_len / 100), 3)
#if elem not in candidates:
# if elem not in candidates:
# candidates[elem] = self.score_node(elem)
#WTF? candidates[elem]['content_score'] += content_score
candidates[parent_node]['content_score'] += content_score
# WTF? candidates[elem]['content_score'] += content_score
candidates[parent_node]["content_score"] += content_score
if grand_parent_node is not None:
candidates[grand_parent_node]['content_score'] += content_score / 2.0
candidates[grand_parent_node]["content_score"] += content_score / 2.0
# Scale the final candidates score based on link density. Good content
# should have a relatively small link density (5% or less) and be
@ -314,24 +384,23 @@ class Document:
for elem in ordered:
candidate = candidates[elem]
ld = self.get_link_density(elem)
score = candidate['content_score']
self.debug("Candid: %6.3f %s link density %.3f -> %6.3f" % (
score,
describe(elem),
ld,
score * (1 - ld)))
candidate['content_score'] *= (1 - ld)
score = candidate["content_score"]
log.debug(
"Branch %6.3f %s link density %.3f -> %6.3f"
% (score, describe(elem), ld, score * (1 - ld))
)
candidate["content_score"] *= 1 - ld
return candidates
def class_weight(self, e):
weight = 0
for feature in [e.get('class', None), e.get('id', None)]:
for feature in [e.get("class", None), e.get("id", None)]:
if feature:
if REGEXES['negativeRe'].search(feature):
if REGEXES["negativeRe"].search(feature):
weight -= 25
if REGEXES['positiveRe'].search(feature):
if REGEXES["positiveRe"].search(feature):
weight += 25
if self.positive_keywords and self.positive_keywords.search(feature):
@ -340,10 +409,10 @@ class Document:
if self.negative_keywords and self.negative_keywords.search(feature):
weight -= 25
if self.positive_keywords and self.positive_keywords.match('tag-'+e.tag):
if self.positive_keywords and self.positive_keywords.match("tag-" + e.tag):
weight += 25
if self.negative_keywords and self.negative_keywords.match('tag-'+e.tag):
if self.negative_keywords and self.negative_keywords.match("tag-" + e.tag):
weight -= 25
return weight
@ -351,107 +420,126 @@ class Document:
def score_node(self, elem):
content_score = self.class_weight(elem)
name = elem.tag.lower()
if name == "div":
if name in ["div", "article"]:
content_score += 5
elif name in ["pre", "td", "blockquote"]:
content_score += 3
elif name in ["address", "ol", "ul", "dl", "dd", "dt", "li", "form"]:
elif name in ["address", "ol", "ul", "dl", "dd", "dt", "li", "form", "aside"]:
content_score -= 3
elif name in ["h1", "h2", "h3", "h4", "h5", "h6", "th"]:
elif name in [
"h1",
"h2",
"h3",
"h4",
"h5",
"h6",
"th",
"header",
"footer",
"nav",
]:
content_score -= 5
return {
'content_score': content_score,
'elem': elem
}
def debug(self, *a):
if self.options.get('debug', False):
log.debug(*a)
return {"content_score": content_score, "elem": elem}
def remove_unlikely_candidates(self):
for elem in self.html.iter():
s = "%s %s" % (elem.get('class', ''), elem.get('id', ''))
for elem in self.html.findall(".//*"):
s = "%s %s" % (elem.get("class", ""), elem.get("id", ""))
if len(s) < 2:
continue
#self.debug(s)
if REGEXES['unlikelyCandidatesRe'].search(s) and (not REGEXES['okMaybeItsACandidateRe'].search(s)) and elem.tag not in ['html', 'body']:
self.debug("Removing unlikely candidate - %s" % describe(elem))
if (
REGEXES["unlikelyCandidatesRe"].search(s)
and (not REGEXES["okMaybeItsACandidateRe"].search(s))
and elem.tag not in ["html", "body"]
):
log.debug("Removing unlikely candidate - %s" % describe(elem))
elem.drop_tree()
def transform_misused_divs_into_paragraphs(self):
for elem in self.tags(self.html, 'div'):
for elem in self.tags(self.html, "div"):
# transform <div>s that do not contain other block elements into
# <p>s
#FIXME: The current implementation ignores all descendants that
# FIXME: The current implementation ignores all descendants that
# are not direct children of elem
# This results in incorrect results in case there is an <img>
# buried within an <a> for example
if not REGEXES['divToPElementsRe'].search(
unicode(''.join(map(tostring, list(elem))))):
#self.debug("Altering %s to p" % (describe(elem)))
if not REGEXES["divToPElementsRe"].search(
str_(b"".join(map(tostring_, list(elem))))
):
# log.debug("Altering %s to p" % (describe(elem)))
elem.tag = "p"
#print "Fixed element "+describe(elem)
# print "Fixed element "+describe(elem)
for elem in self.tags(self.html, 'div'):
for elem in self.tags(self.html, "div"):
if elem.text and elem.text.strip():
p = fragment_fromstring('<p/>')
p = fragment_fromstring("<p/>")
p.text = elem.text
elem.text = None
elem.insert(0, p)
#print "Appended "+tounicode(p)+" to "+describe(elem)
# print "Appended "+tounicode(p)+" to "+describe(elem)
for pos, child in reversed(list(enumerate(elem))):
if child.tail and child.tail.strip():
p = fragment_fromstring('<p/>')
p = fragment_fromstring("<p/>")
p.text = child.tail
child.tail = None
elem.insert(pos + 1, p)
#print "Inserted "+tounicode(p)+" to "+describe(elem)
if child.tag == 'br':
#print 'Dropped <br> at '+describe(elem)
# print "Inserted "+tounicode(p)+" to "+describe(elem)
if child.tag == "br":
# print 'Dropped <br> at '+describe(elem)
child.drop_tree()
def tags(self, node, *tag_names):
for tag_name in tag_names:
for e in node.findall('.//%s' % tag_name):
for e in node.findall(".//%s" % tag_name):
yield e
def reverse_tags(self, node, *tag_names):
for tag_name in tag_names:
for e in reversed(node.findall('.//%s' % tag_name)):
for e in reversed(node.findall(".//%s" % tag_name)):
yield e
def sanitize(self, node, candidates):
MIN_LEN = self.options.get('min_text_length',
self.TEXT_LENGTH_THRESHOLD)
MIN_LEN = self.min_text_length
for header in self.tags(node, "h1", "h2", "h3", "h4", "h5", "h6"):
if self.class_weight(header) < 0 or self.get_link_density(header) > 0.33:
header.drop_tree()
for elem in self.tags(node, "form", "iframe", "textarea"):
for elem in self.tags(node, "form", "textarea"):
elem.drop_tree()
for elem in self.tags(node, "iframe"):
if "src" in elem.attrib and REGEXES["videoRe"].search(elem.attrib["src"]):
elem.text = "VIDEO" # ADD content to iframe text node to force <iframe></iframe> proper output
else:
elem.drop_tree()
allowed = {}
# Conditionally clean <table>s, <ul>s, and <div>s
for el in self.reverse_tags(node, "table", "ul", "div"):
for el in self.reverse_tags(
node, "table", "ul", "div", "aside", "header", "footer", "section"
):
if el in allowed:
continue
weight = self.class_weight(el)
if el in candidates:
content_score = candidates[el]['content_score']
#print '!',el, '-> %6.3f' % content_score
content_score = candidates[el]["content_score"]
# print '!',el, '-> %6.3f' % content_score
else:
content_score = 0
tag = el.tag
if weight + content_score < 0:
self.debug("Cleaned %s with score %6.3f and weight %-3s" %
(describe(el), content_score, weight, ))
log.debug(
"Removed %s with score %6.3f and weight %-3s"
% (describe(el), content_score, weight,)
)
el.drop_tree()
elif el.text_content().count(",") < 10:
counts = {}
for kind in ['p', 'img', 'li', 'a', 'embed', 'input']:
counts[kind] = len(el.findall('.//%s' % kind))
for kind in ["p", "img", "li", "a", "embed", "input"]:
counts[kind] = len(el.findall(".//%s" % kind))
counts["li"] -= 100
counts["input"] -= len(el.findall('.//input[@type="hidden"]'))
# Count the text length excluding any surrounding whitespace
content_length = text_length(el)
@ -459,161 +547,210 @@ class Document:
parent_node = el.getparent()
if parent_node is not None:
if parent_node in candidates:
content_score = candidates[parent_node]['content_score']
content_score = candidates[parent_node]["content_score"]
else:
content_score = 0
#if parent_node is not None:
#pweight = self.class_weight(parent_node) + content_score
#pname = describe(parent_node)
#else:
#pweight = 0
#pname = "no parent"
# if parent_node is not None:
# pweight = self.class_weight(parent_node) + content_score
# pname = describe(parent_node)
# else:
# pweight = 0
# pname = "no parent"
to_remove = False
reason = ""
#if el.tag == 'div' and counts["img"] >= 1:
# if el.tag == 'div' and counts["img"] >= 1:
# continue
if counts["p"] and counts["img"] > counts["p"]:
if counts["p"] and counts["img"] > 1 + counts["p"] * 1.3:
reason = "too many images (%s)" % counts["img"]
to_remove = True
elif counts["li"] > counts["p"] and tag != "ul" and tag != "ol":
elif counts["li"] > counts["p"] and tag not in ("ol", "ul"):
reason = "more <li>s than <p>s"
to_remove = True
elif counts["input"] > (counts["p"] / 3):
reason = "less than 3x <p>s than <input>s"
to_remove = True
elif content_length < (MIN_LEN) and (counts["img"] == 0 or counts["img"] > 2):
reason = "too short content length %s without a single image" % content_length
elif content_length < MIN_LEN and counts["img"] == 0:
reason = (
"too short content length %s without a single image"
% content_length
)
to_remove = True
elif content_length < MIN_LEN and counts["img"] > 2:
reason = (
"too short content length %s and too many images"
% content_length
)
to_remove = True
elif weight < 25 and link_density > 0.2:
reason = "too many links %.3f for its weight %s" % (
link_density, weight)
to_remove = True
reason = "too many links %.3f for its weight %s" % (
link_density,
weight,
)
to_remove = True
elif weight >= 25 and link_density > 0.5:
reason = "too many links %.3f for its weight %s" % (
link_density, weight)
link_density,
weight,
)
to_remove = True
elif (counts["embed"] == 1 and content_length < 75) or counts["embed"] > 1:
reason = "<embed>s with too short content length, or too many <embed>s"
elif (counts["embed"] == 1 and content_length < 75) or counts[
"embed"
] > 1:
reason = (
"<embed>s with too short content length, or too many <embed>s"
)
to_remove = True
# if el.tag == 'div' and counts['img'] >= 1 and to_remove:
# imgs = el.findall('.//img')
# valid_img = False
# self.debug(tounicode(el))
# for img in imgs:
#
# height = img.get('height')
# text_length = img.get('text_length')
# self.debug ("height %s text_length %s" %(repr(height), repr(text_length)))
# if to_int(height) >= 100 or to_int(text_length) >= 100:
# valid_img = True
# self.debug("valid image" + tounicode(img))
# break
# if valid_img:
# to_remove = False
# self.debug("Allowing %s" %el.text_content())
# for desnode in self.tags(el, "table", "ul", "div"):
# allowed[desnode] = True
#find x non empty preceding and succeeding siblings
elif not content_length:
reason = "no content"
to_remove = True
# if el.tag == 'div' and counts['img'] >= 1 and to_remove:
# imgs = el.findall('.//img')
# valid_img = False
# log.debug(tounicode(el))
# for img in imgs:
#
# height = img.get('height')
# text_length = img.get('text_length')
# log.debug ("height %s text_length %s" %(repr(height), repr(text_length)))
# if to_int(height) >= 100 or to_int(text_length) >= 100:
# valid_img = True
# log.debug("valid image" + tounicode(img))
# break
# if valid_img:
# to_remove = False
# log.debug("Allowing %s" %el.text_content())
# for desnode in self.tags(el, "table", "ul", "div"):
# allowed[desnode] = True
# find x non empty preceding and succeeding siblings
i, j = 0, 0
x = 1
siblings = []
for sib in el.itersiblings():
#self.debug(sib.text_content())
# log.debug(sib.text_content())
sib_content_length = text_length(sib)
if sib_content_length:
i =+ 1
i = +1
siblings.append(sib_content_length)
if i == x:
break
for sib in el.itersiblings(preceding=True):
#self.debug(sib.text_content())
# log.debug(sib.text_content())
sib_content_length = text_length(sib)
if sib_content_length:
j =+ 1
j = +1
siblings.append(sib_content_length)
if j == x:
break
#self.debug(str(siblings))
# log.debug(str_(siblings))
if siblings and sum(siblings) > 1000:
to_remove = False
self.debug("Allowing %s" % describe(el))
for desnode in self.tags(el, "table", "ul", "div"):
log.debug("Allowing %s" % describe(el))
for desnode in self.tags(el, "table", "ul", "div", "section"):
allowed[desnode] = True
if to_remove:
self.debug("Cleaned %6.3f %s with weight %s cause it has %s." %
(content_score, describe(el), weight, reason))
#print tounicode(el)
#self.debug("pname %s pweight %.3f" %(pname, pweight))
log.debug(
"Removed %6.3f %s with weight %s cause it has %s."
% (content_score, describe(el), weight, reason)
)
# print tounicode(el)
# log.debug("pname %s pweight %.3f" %(pname, pweight))
el.drop_tree()
for el in ([node] + [n for n in node.iter()]):
if not self.options.get('attributes', None):
#el.attrib = {} #FIXME:Checkout the effects of disabling this
pass
else:
log.debug(
"Not removing %s of length %s: %s"
% (describe(el), content_length, text_content(el))
)
self.html = node
return self.get_clean_html()
class HashableElement():
def __init__(self, node):
self.node = node
self._path = None
def _get_path(self):
if self._path is None:
reverse_path = []
node = self.node
while node is not None:
node_id = (node.tag, tuple(node.attrib.items()), node.text)
reverse_path.append(node_id)
node = node.getparent()
self._path = tuple(reverse_path)
return self._path
path = property(_get_path)
def __hash__(self):
return hash(self.path)
def __eq__(self, other):
return self.path == other.path
def __getattr__(self, tag):
return getattr(self.node, tag)
def main():
VERBOSITY = {1: logging.WARNING, 2: logging.INFO, 3: logging.DEBUG}
from optparse import OptionParser
parser = OptionParser(usage="%prog: [options] [file]")
parser.add_option('-v', '--verbose', action='store_true')
parser.add_option('-u', '--url', default=None, help="use URL instead of a local file")
parser.add_option('-p', '--positive-keywords', default=None, help="positive keywords (separated with comma)", action='store')
parser.add_option('-n', '--negative-keywords', default=None, help="negative keywords (separated with comma)", action='store')
parser.add_option("-v", "--verbose", action="count", default=0)
parser.add_option(
"-b", "--browser", default=None, action="store_true", help="open in browser"
)
parser.add_option(
"-l", "--log", default=None, help="save logs into file (appended)"
)
parser.add_option(
"-u", "--url", default=None, help="use URL instead of a local file"
)
parser.add_option("-x", "--xpath", default=None, help="add original xpath")
parser.add_option(
"-p",
"--positive-keywords",
default=None,
help="positive keywords (comma-separated)",
action="store",
)
parser.add_option(
"-n",
"--negative-keywords",
default=None,
help="negative keywords (comma-separated)",
action="store",
)
(options, args) = parser.parse_args()
if options.verbose:
logging.basicConfig(
level=VERBOSITY[options.verbose],
filename=options.log,
format="%(asctime)s: %(levelname)s: %(message)s (at %(filename)s: %(lineno)d)",
)
if not (len(args) == 1 or options.url):
parser.print_help()
sys.exit(1)
file = None
if options.url:
import urllib
file = urllib.urlopen(options.url)
headers = {"User-Agent": "Mozilla/5.0"}
if sys.version_info[0] == 3:
import urllib.request, urllib.parse, urllib.error
request = urllib.request.Request(options.url, None, headers)
file = urllib.request.urlopen(request)
else:
import urllib2
request = urllib2.Request(options.url, None, headers)
file = urllib2.urlopen(request)
else:
file = open(args[0], 'rt')
enc = sys.__stdout__.encoding or 'utf-8' # XXX: this hack could not always work, better to set PYTHONIOENCODING
file = open(args[0], "rt")
try:
print Document(file.read(),
debug=options.verbose,
doc = Document(
file.read(),
url=options.url,
positive_keywords = options.positive_keywords,
negative_keywords = options.negative_keywords,
).summary().encode(enc, 'replace')
positive_keywords=options.positive_keywords,
negative_keywords=options.negative_keywords,
)
if options.browser:
from .browser import open_in_browser
result = "<h2>" + doc.short_title() + "</h2><br/>" + doc.summary()
open_in_browser(result)
else:
enc = (
sys.__stdout__.encoding or "utf-8"
) # XXX: this hack could not always work, better to set PYTHONIOENCODING
result = "Title:" + doc.short_title() + "\n" + doc.summary()
if sys.version_info[0] == 3:
print(result)
else:
print(result.encode(enc, "replace"))
finally:
file.close()
if __name__ == '__main__':
if __name__ == "__main__":
main()

@ -0,0 +1,6 @@
lxml
chardet
nose
pep8
coverage
timeout_decorator

@ -1,34 +1,78 @@
#!/usr/bin/env python
from setuptools import setup, find_packages
from __future__ import print_function
import codecs
import os
import re
from setuptools import setup
import sys
lxml_requirement = "lxml"
if sys.platform == 'darwin':
if sys.platform == "darwin":
import platform
mac_ver = platform.mac_ver()[0]
if mac_ver < '10.9':
print "Using lxml<2.4"
mac_major, mac_minor = mac_ver.split('.')[:2]
if int(mac_major) == 10 and int(mac_minor) < 9:
print("Using lxml<2.4")
lxml_requirement = "lxml<2.4"
test_deps = [
# Test timeouts
"timeout_decorator",
]
extras = {
"test": test_deps,
}
# Adapted from https://github.com/pypa/pip/blob/master/setup.py
def find_version(*file_paths):
here = os.path.abspath(os.path.dirname(__file__))
# Intentionally *not* adding an encoding option to open, See:
# https://github.com/pypa/virtualenv/issues/201#issuecomment-3145690
with codecs.open(os.path.join(here, *file_paths), "r") as fp:
version_file = fp.read()
version_match = re.search(
r"^__version__ = ['\"]([^'\"]*)['\"]", version_file, re.M,
)
if version_match:
return version_match.group(1)
raise RuntimeError("Unable to find version string.")
setup(
name="readability-lxml",
version="0.3.0.3",
version=find_version("readability", "__init__.py"),
author="Yuri Baburov",
author_email="burchik@gmail.com",
description="fast python port of arc90's readability tool",
test_suite = "tests.test_article_only",
long_description=open("README").read(),
description="fast html to text parser (article readability tool) with python 3 support",
test_suite="tests.test_article_only",
long_description=open("README.rst").read(),
long_description_content_type='text/x-rst',
license="Apache License 2.0",
url="http://github.com/buriy/python-readability",
packages=['readability'],
install_requires=[
"chardet",
lxml_requirement
],
packages=["readability", "readability.compat"],
install_requires=["chardet", lxml_requirement, "cssselect"],
tests_require=test_deps,
extras_require=extras,
classifiers=[
"Environment :: Web Environment",
"Intended Audience :: Developers",
"Operating System :: OS Independent",
"Topic :: Text Processing :: Indexing",
"Topic :: Utilities",
"Topic :: Internet",
"Topic :: Software Development :: Libraries :: Python Modules",
"Programming Language :: Python",
],
"Programming Language :: Python :: 2",
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
],
)

File diff suppressed because one or more lines are too long

@ -0,0 +1,60 @@
<!DOCTYPE html>
<html lang="ja">
<body>
<div>
<article>
<div>
草枕
夏目漱石
 山路を登りながら、こう考えた。
 智に働けば角が立つ。情に棹させば流される。意地を通せば窮屈だ。とかくに人の世は住みにくい。
 住みにくさが高じると、安い所へ引き越したくなる。どこへ越しても住みにくいと悟った時、詩が生れて、画が出来る。
 人の世を作ったものは神でもなければ鬼でもない。やはり向う三軒両隣りにちらちらするただの人である。ただの人が作った人の世が住みにくいからとて、越す国はあるまい。あれば人でなしの国へ行くばかりだ。人でなしの国は人の世よりもなお住みにくかろう。
 越す事のならぬ世が住みにくければ、住みにくい所をどれほどか、寛容て、束の間の命を、束の間でも住みよくせねばならぬ。ここに詩人という天職が出来て、ここに画家という使命が降る。あらゆる芸術の士は人の世を長閑にし、人の心を豊かにするが故に尊とい。
 住みにくき世から、住みにくき煩いを引き抜いて、ありがたい世界をまのあたりに写すのが詩である、画である。あるは音楽と彫刻である。こまかに云えば写さないでもよい。ただまのあたりに見れば、そこに詩も生き、歌も湧く。着想を紙に落さぬとも※(「王膠のつくり」、第3水準1-88-22)鏘の音は胸裏に起る。丹青は画架に向って塗抹せんでも五彩の絢爛は自から心眼に映る。ただおのが住む世を、かく観じ得て、霊台方寸のカメラに澆季溷濁の俗界を清くうららかに収め得れば足る。この故に無声の詩人には一句なく、無色の画家には尺※(「糸賺のつくり」、第3水準1-90-17)なきも、かく人世を観じ得るの点において、かく煩悩を解脱するの点において、かく清浄界に出入し得るの点において、またこの不同不二の乾坤を建立し得るの点において、我利私慾の覊絆を掃蕩するの点において、――千金の子よりも、万乗の君よりも、あらゆる俗界の寵児よりも幸福である。
 世に住むこと二十年にして、住むに甲斐ある世と知った。二十五年にして明暗は表裏のごとく、日のあたる所にはきっと影がさすと悟った。三十の今日はこう思うている。――喜びの深きとき憂いよいよ深く、楽みの大いなるほど苦しみも大きい。これを切り放そうとすると身が持てぬ。片づけようとすれば世が立たぬ。金は大事だ、大事なものが殖えれば寝る間も心配だろう。恋はうれしい、嬉しい恋が積もれば、恋をせぬ昔がかえって恋しかろ。閣僚の肩は数百万人の足を支えている。背中には重い天下がおぶさっている。うまい物も食わねば惜しい。少し食えば飽き足らぬ。存分食えばあとが不愉快だ。……
 余の考がここまで漂流して来た時に、余の右足は突然坐りのわるい角石の端を踏み損くなった。平衡を保つために、すわやと前に飛び出した左足が、仕損じの埋め合せをすると共に、余の腰は具合よく方三尺ほどな岩の上に卸りた。肩にかけた絵の具箱が腋の下から躍り出しただけで、幸いと何の事もなかった。
 立ち上がる時に向うを見ると、路から左の方にバケツを伏せたような峰が聳えている。杉か檜か分からないが根元から頂きまでことごとく蒼黒い中に、山桜が薄赤くだんだらに棚引いて、続ぎ目が確と見えぬくらい靄が濃い。少し手前に禿山が一つ、群をぬきんでて眉に逼る。禿げた側面は巨人の斧で削り去ったか、鋭どき平面をやけに谷の底に埋めている。天辺に一本見えるのは赤松だろう。枝の間の空さえ判然している。行く手は二丁ほどで切れているが、高い所から赤い毛布が動いて来るのを見ると、登ればあすこへ出るのだろう。路はすこぶる難義だ。
 土をならすだけならさほど手間も入るまいが、土の中には大きな石がある。土は平らにしても石は平らにならぬ。石は切り砕いても、岩は始末がつかぬ。掘崩した土の上に悠然と峙って、吾らのために道を譲る景色はない。向うで聞かぬ上は乗り越すか、廻らなければならん。巌のない所でさえ歩るきよくはない。左右が高くって、中心が窪んで、まるで一間幅を三角に穿って、その頂点が真中を貫いていると評してもよい。路を行くと云わんより川底を渉ると云う方が適当だ。固より急ぐ旅でないから、ぶらぶらと七曲りへかかる。
 たちまち足の下で雲雀の声がし出した。谷を見下したが、どこで鳴いてるか影も形も見えぬ。ただ声だけが明らかに聞える。せっせと忙しく、絶間なく鳴いている。方幾里の空気が一面に蚤に刺されていたたまれないような気がする。あの鳥の鳴く音には瞬時の余裕もない。のどかな春の日を鳴き尽くし、鳴きあかし、また鳴き暮らさなければ気が済まんと見える。その上どこまでも登って行く、いつまでも登って行く。雲雀はきっと雲の中で死ぬに相違ない。登り詰めた揚句は、流れて雲に入って、漂うているうちに形は消えてなくなって、ただ声だけが空の裡に残るのかも知れない。
 巌角を鋭どく廻って、按摩なら真逆様に落つるところを、際どく右へ切れて、横に見下すと、菜の花が一面に見える。雲雀はあすこへ落ちるのかと思った。いいや、あの黄金の原から飛び上がってくるのかと思った。次には落ちる雲雀と、上る雲雀が十文字にすれ違うのかと思った。最後に、落ちる時も、上る時も、また十文字に擦れ違うときにも元気よく鳴きつづけるだろうと思った。
 春は眠くなる。猫は鼠を捕る事を忘れ、人間は借金のある事を忘れる。時には自分の魂の居所さえ忘れて正体なくなる。ただ菜の花を遠く望んだときに眼が醒める。雲雀の声を聞いたときに魂のありかが判然する。雲雀の鳴くのは口で鳴くのではない、魂全体が鳴くのだ。魂の活動が声にあらわれたもののうちで、あれほど元気のあるものはない。ああ愉快だ。こう思って、こう愉快になるのが詩である。
 たちまちシェレーの雲雀の詩を思い出して、口のうちで覚えたところだけ暗誦して見たが、覚えているところは二三句しかなかった。その二三句のなかにこんなのがある。
  We look before and after
    And pine for what is not:
  Our sincerest laughter
    With some pain is fraught;
Our sweetest songs are those that tell of saddest thought.
「前をみては、後えを見ては、物欲しと、あこがるるかなわれ。腹からの、笑といえど、苦しみの、そこにあるべし。うつくしき、極みの歌に、悲しさの、極みの想、籠るとぞ知れ」
 なるほどいくら詩人が幸福でも、あの雲雀のように思い切って、一心不乱に、前後を忘却して、わが喜びを歌う訳には行くまい。西洋の詩は無論の事、支那の詩にも、よく万斛の愁などと云う字がある。詩人だから万斛で素人なら一合で済むかも知れぬ。して見ると詩人は常の人よりも苦労性で、凡骨の倍以上に神経が鋭敏なのかも知れん。超俗の喜びもあろうが、無量の悲も多かろう。そんならば詩人になるのも考え物だ。
 しばらくは路が平で、右は雑木山、左は菜の花の見つづけである。足の下に時々蒲公英を踏みつける。鋸のような葉が遠慮なく四方へのして真中に黄色な珠を擁護している。菜の花に気をとられて、踏みつけたあとで、気の毒な事をしたと、振り向いて見ると、黄色な珠は依然として鋸のなかに鎮座している。呑気なものだ。また考えをつづける。
 詩人に憂はつきものかも知れないが、あの雲雀を聞く心持になれば微塵の苦もない。菜の花を見ても、ただうれしくて胸が躍るばかりだ。蒲公英もその通り、桜も――桜はいつか見えなくなった。こう山の中へ来て自然の景物に接すれば、見るものも聞くものも面白い。面白いだけで別段の苦しみも起らぬ。起るとすれば足が草臥れて、旨いものが食べられぬくらいの事だろう。
 しかし苦しみのないのはなぜだろう。ただこの景色を一幅の画として観、一巻の詩として読むからである。画であり詩である以上は地面を貰って、開拓する気にもならねば、鉄道をかけて一儲けする了見も起らぬ。ただこの景色が――腹の足しにもならぬ、月給の補いにもならぬこの景色が景色としてのみ、余が心を楽ませつつあるから苦労も心配も伴わぬのだろう。自然の力はここにおいて尊とい。吾人の性情を瞬刻に陶冶して醇乎として醇なる詩境に入らしむるのは自然である。
 恋はうつくしかろ、孝もうつくしかろ、忠君愛国も結構だろう。しかし自身がその局に当れば利害の旋風に捲き込まれて、うつくしき事にも、結構な事にも、目は眩んでしまう。したがってどこに詩があるか自身には解しかねる。
 これがわかるためには、わかるだけの余裕のある第三者の地位に立たねばならぬ。三者の地位に立てばこそ芝居は観て面白い。小説も見て面白い。芝居を見て面白い人も、小説を読んで面白い人も、自己の利害は棚へ上げている。見たり読んだりする間だけは詩人である。
 それすら、普通の芝居や小説では人情を免かれぬ。苦しんだり、怒ったり、騒いだり、泣いたりする。見るものもいつかその中に同化して苦しんだり、怒ったり、騒いだり、泣いたりする。取柄は利慾が交らぬと云う点に存するかも知れぬが、交らぬだけにその他の情緒は常よりは余計に活動するだろう。それが嫌だ。
 苦しんだり、怒ったり、騒いだり、泣いたりは人の世につきものだ。余も三十年の間それを仕通して、飽々した。飽き飽きした上に芝居や小説で同じ刺激を繰り返しては大変だ。余が欲する詩はそんな世間的の人情を鼓舞するようなものではない。俗念を放棄して、しばらくでも塵界を離れた心持ちになれる詩である。いくら傑作でも人情を離れた芝居はない、理非を絶した小説は少かろう。どこまでも世間を出る事が出来ぬのが彼らの特色である。ことに西洋の詩になると、人事が根本になるからいわゆる詩歌の純粋なるものもこの境を解脱する事を知らぬ。どこまでも同情だとか、愛だとか、正義だとか、自由だとか、浮世の勧工場にあるものだけで用を弁じている。いくら詩的になっても地面の上を馳けてあるいて、銭の勘定を忘れるひまがない。シェレーが雲雀を聞いて嘆息したのも無理はない。
 うれしい事に東洋の詩歌はそこを解脱したのがある。採菊東籬下、悠然見南山。ただそれぎりの裏に暑苦しい世の中をまるで忘れた光景が出てくる。垣の向うに隣りの娘が覗いてる訳でもなければ、南山に親友が奉職している次第でもない。超然と出世間的に利害損得の汗を流し去った心持ちになれる。独坐幽篁裏、弾琴復長嘯、深林人不知、明月来相照。ただ二十字のうちに優に別乾坤を建立している。この乾坤の功徳は「不如帰」や「金色夜叉」の功徳ではない。汽船、汽車、権利、義務、道徳、礼義で疲れ果てた後に、すべてを忘却してぐっすり寝込むような功徳である。
 二十世紀に睡眠が必要ならば、二十世紀にこの出世間的の詩味は大切である。惜しい事に今の詩を作る人も、詩を読む人もみんな、西洋人にかぶれているから、わざわざ呑気な扁舟を泛べてこの桃源に溯るものはないようだ。余は固より詩人を職業にしておらんから、王維や淵明の境界を今の世に布教して広げようと云う心掛も何もない。ただ自分にはこう云う感興が演芸会よりも舞踏会よりも薬になるように思われる。ファウストよりも、ハムレットよりもありがたく考えられる。こうやって、ただ一人絵の具箱と三脚几を担いで春の山路をのそのそあるくのも全くこれがためである。淵明、王維の詩境を直接に自然から吸収して、すこしの間でも非人情の天地に逍遥したいからの願。一つの酔興だ。
 もちろん人間の一分子だから、いくら好きでも、非人情はそう長く続く訳には行かぬ。淵明だって年が年中南山を見詰めていたのでもあるまいし、王維も好んで竹藪の中に蚊帳を釣らずに寝た男でもなかろう。やはり余った菊は花屋へ売りこかして、生えた筍は八百屋へ払い下げたものと思う。こう云う余もその通り。いくら雲雀と菜の花が気に入ったって、山のなかへ野宿するほど非人情が募ってはおらん。こんな所でも人間に逢う。じんじん端折りの頬冠りや、赤い腰巻の姉さんや、時には人間より顔の長い馬にまで逢う。百万本の檜に取り囲まれて、海面を抜く何百尺かの空気を呑んだり吐いたりしても、人の臭いはなかなか取れない。それどころか、山を越えて落ちつく先の、今宵の宿は那古井の温泉場だ。
 ただ、物は見様でどうでもなる。レオナルド・ダ・ヴィンチが弟子に告げた言に、あの鐘の音を聞け、鐘は一つだが、音はどうとも聞かれるとある。一人の男、一人の女も見様次第でいかようとも見立てがつく。どうせ非人情をしに出掛けた旅だから、そのつもりで人間を見たら、浮世小路の何軒目に狭苦しく暮した時とは違うだろう。よし全く人情を離れる事が出来んでも、せめて御能拝見の時くらいは淡い心持ちにはなれそうなものだ。能にも人情はある。七騎落でも、墨田川でも泣かぬとは保証が出来ん。しかしあれは情三分芸七分で見せるわざだ。我らが能から享けるありがた味は下界の人情をよくそのままに写す手際から出てくるのではない。そのままの上へ芸術という着物を何枚も着せて、世の中にあるまじき悠長な振舞をするからである。
 しばらくこの旅中に起る出来事と、旅中に出逢う人間を能の仕組と能役者の所作に見立てたらどうだろう。まるで人情を棄てる訳には行くまいが、根が詩的に出来た旅だから、非人情のやりついでに、なるべく節倹してそこまでは漕ぎつけたいものだ。南山や幽篁とは性の違ったものに相違ないし、また雲雀や菜の花といっしょにする事も出来まいが、なるべくこれに近づけて、近づけ得る限りは同じ観察点から人間を視てみたい。芭蕉と云う男は枕元へ馬が尿するのをさえ雅な事と見立てて発句にした。余もこれから逢う人物を――百姓も、町人も、村役場の書記も、爺さんも婆さんも――ことごとく大自然の点景として描き出されたものと仮定して取こなして見よう。もっとも画中の人物と違って、彼らはおのがじし勝手な真似をするだろう。しかし普通の小説家のようにその勝手な真似の根本を探ぐって、心理作用に立ち入ったり、人事葛藤の詮議立てをしては俗になる。動いても構わない。画中の人間が動くと見れば差し支ない。画中の人物はどう動いても平面以外に出られるものではない。平面以外に飛び出して、立方的に働くと思えばこそ、こっちと衝突したり、利害の交渉が起ったりして面倒になる。面倒になればなるほど美的に見ている訳に行かなくなる。これから逢う人間には超然と遠き上から見物する気で、人情の電気がむやみに双方で起らないようにする。そうすれば相手がいくら働いても、こちらの懐には容易に飛び込めない訳だから、つまりは画の前へ立って、画中の人物が画面の中をあちらこちらと騒ぎ廻るのを見るのと同じ訳になる。間三尺も隔てていれば落ちついて見られる。あぶな気なしに見られる。言を換えて云えば、利害に気を奪われないから、全力を挙げて彼らの動作を芸術の方面から観察する事が出来る。余念もなく美か美でないかと鑒識する事が出来る。
 ここまで決心をした時、空があやしくなって来た。煮え切れない雲が、頭の上へ靠垂れ懸っていたと思ったが、いつのまにか、崩れ出して、四方はただ雲の海かと怪しまれる中から、しとしとと春の雨が降り出した。菜の花は疾くに通り過して、今は山と山の間を行くのだが、雨の糸が濃かでほとんど霧を欺くくらいだから、隔たりはどれほどかわからぬ。時々風が来て、高い雲を吹き払うとき、薄黒い山の背が右手に見える事がある。何でも谷一つ隔てて向うが脈の走っている所らしい。左はすぐ山の裾と見える。深く罩める雨の奥から松らしいものが、ちょくちょく顔を出す。出すかと思うと、隠れる。雨が動くのか、木が動くのか、夢が動くのか、何となく不思議な心持ちだ。
 路は存外広くなって、かつ平だから、あるくに骨は折れんが、雨具の用意がないので急ぐ。帽子から雨垂れがぽたりぽたりと落つる頃、五六間先きから、鈴の音がして、黒い中から、馬子がふうとあらわれた。
「ここらに休む所はないかね」
「もう十五丁行くと茶屋がありますよ。だいぶ濡れたね」
 まだ十五丁かと、振り向いているうちに、馬子の姿は影画のように雨につつまれて、またふうと消えた。
 糠のように見えた粒は次第に太く長くなって、今は一筋ごとに風に捲かれる様までが目に入る。羽織はとくに濡れ尽して肌着に浸み込んだ水が、身体の温度で生暖く感ぜられる。気持がわるいから、帽を傾けて、すたすた歩行く。
 茫々たる薄墨色の世界を、幾条の銀箭が斜めに走るなかを、ひたぶるに濡れて行くわれを、われならぬ人の姿と思えば、詩にもなる、句にも咏まれる。有体なる己れを忘れ尽して純客観に眼をつくる時、始めてわれは画中の人物として、自然の景物と美しき調和を保つ。ただ降る雨の心苦しくて、踏む足の疲れたるを気に掛ける瞬間に、われはすでに詩中の人にもあらず、画裡の人にもあらず。依然として市井の一豎子に過ぎぬ。雲煙飛動の趣も眼に入らぬ。落花啼鳥の情けも心に浮ばぬ。蕭々として独り春山を行く吾の、いかに美しきかはなおさらに解せぬ。初めは帽を傾けて歩行た。後にはただ足の甲のみを見詰めてあるいた。終りには肩をすぼめて、恐る恐る歩行た。雨は満目の樹梢を揺かして四方より孤客に逼る。非人情がちと強過ぎたようだ。
</div>
</article>
</div>
<div>
<a href="https://www.aozora.gr.jp/cards/000148/card776.html">青空文庫 - 図書カードNo.776</a>
</div>
</html>

@ -2,14 +2,17 @@ import os
import unittest
from readability import Document
import timeout_decorator
SAMPLES = os.path.join(os.path.dirname(__file__), 'samples')
SAMPLES = os.path.join(os.path.dirname(__file__), "samples")
def load_sample(filename):
"""Helper to get the content out of the sample files"""
return open(os.path.join(SAMPLES, filename)).read()
with open(os.path.join(SAMPLES, filename)) as f:
html = f.read()
return html
class TestArticleOnly(unittest.TestCase):
@ -23,17 +26,101 @@ class TestArticleOnly(unittest.TestCase):
def test_si_sample(self):
"""Using the si sample, load article with only opening body element"""
sample = load_sample('si-game.sample.html')
sample = load_sample("si-game.sample.html")
doc = Document(
sample,
url='http://sportsillustrated.cnn.com/baseball/mlb/gameflash/2012/04/16/40630_preview.html')
url="http://sportsillustrated.cnn.com/baseball/mlb/gameflash/2012/04/16/40630_preview.html",
)
res = doc.summary()
self.assertEqual('<html><body><div><div class', res[0:27])
self.assertEqual("<html><body><div><div class", res[0:27])
def test_si_sample_html_partial(self):
"""Using the si sample, make sure we can get the article alone."""
sample = load_sample('si-game.sample.html')
doc = Document(sample, url='http://sportsillustrated.cnn.com/baseball/mlb/gameflash/2012/04/16/40630_preview.html')
sample = load_sample("si-game.sample.html")
doc = Document(
sample,
url="http://sportsillustrated.cnn.com/baseball/mlb/gameflash/2012/04/16/40630_preview.html",
)
res = doc.summary(html_partial=True)
self.assertEqual('<div><div class="', res[0:17])
def test_too_many_images_sample_html_partial(self):
"""Using the too-many-images sample, make sure we still get the article."""
sample = load_sample("too-many-images.sample.html")
doc = Document(sample)
res = doc.summary(html_partial=True)
self.assertEqual('<div><div class="post-body', res[0:26])
def test_wrong_link_issue_49(self):
"""We shouldn't break on bad HTML."""
sample = load_sample("the-hurricane-rubin-carter-denzel-washington.html")
doc = Document(sample)
res = doc.summary(html_partial=True)
self.assertEqual('<div><div class="content__article-body ', res[0:39])
def test_best_elem_is_root_and_passing(self):
sample = (
'<html class="article" id="body">'
" <body>"
" <p>1234567890123456789012345</p>"
" </body>"
"</html>"
)
doc = Document(sample)
doc.summary()
def test_correct_cleanup(self):
sample = """
<html>
<body>
<section>test section</section>
<article class="">
<p>Lot of text here.</p>
<div id="advertisement"><a href="link">Ad</a></div>
<p>More text is written here, and contains punctuation and dots.</p>
</article>
<aside id="comment1"/>
<div id="comment2">
<a href="asd">spam</a>
<a href="asd">spam</a>
<a href="asd">spam</a>
</div>
<div id="comment3"/>
<aside id="comment4">A small comment.</aside>
<div id="comment5"><p>The comment is also helpful, but it's
still not the correct item to be extracted.</p>
<p>It's even longer than the article itself!"</p></div>
</body>
</html>
"""
doc = Document(sample)
s = doc.summary()
# print(s)
assert "punctuation" in s
assert not "comment" in s
assert not "aside" in s
# Many spaces make some regexes run forever
@timeout_decorator.timeout(seconds=3, use_signals=False)
def test_many_repeated_spaces(self):
long_space = " " * 1000000
sample = "<html><body><p>foo" + long_space + "</p></body></html>"
doc = Document(sample)
s = doc.summary()
assert "foo" in s
def test_not_self_closing(self):
sample = '<h2><a href="#"></a>foobar</h2>'
doc = Document(sample)
assert (
'<body id="readabilityBody"><h2><a href="#"></a>foobar</h2></body>'
== doc.summary()
)
def test_utf8_kanji(self):
"""Using the UTF-8 kanji sample, load article which is written in kanji"""
sample = load_sample("utf-8-kanji.sample.html")
doc = Document(sample)
res = doc.summary()

@ -0,0 +1,33 @@
# Tox (http://tox.testrun.org/) is a tool for running tests
# in multiple virtualenvs. This configuration file will run the
# test suite on all supported python versions. To use it, "pip install tox"
# and then run "tox" from this directory.
[tox]
envlist =
py{27,35,36,37,38,py,py3}, doc
skip_missing_interpreters =
True
[testenv]
deps =
pytest
doc: sphinx
doc: sphinx_rtd_theme
doc: recommonmark
# This creates the virtual envs with --site-packages so already packages
# that are already installed will be reused. This is especially useful on
# Windows. Since we use lxml instead of compiling it locally (which in turn
# requires a Compiler and the build dependencies), you can download
# it from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml and install it via
# $PYTHONDIR\Scripts\pip.exe install *.whl
sitepackages=
True
commands =
pip install -r requirements.txt -e ".[test]"
py.test
[testenv:doc]
commands =
python setup.py build_sphinx
Loading…
Cancel
Save