python-readability

Commit Graph

Author	SHA1	Message	Date
Yuri Baburov	638f73f6a2	Fix for #52 : <input type="hidden"> are not counted any more for "form removal" heuristic.	10 years ago
Mark Perdomo	3a43a3fe7e	Added code to check declared encodings first and check them from kennethreitz/requests/utils.py. Also I added some superset encodings I have found in Chinese pages that are mishandled by chardet/character declarations.	10 years ago
Yuri Baburov	d8595b7103	Quickfix for #41	11 years ago
Yuri Baburov	318f25c577	Minor fix in encoding guessing. Claiming it v0.3.0.1	11 years ago
Yuri Baburov	08658d1d31	Released v 0.3, and uploaded to the pypi.	11 years ago
hush-hush	e2e78e4d55	Make lxml clean tree available for user modifications.	12 years ago
Drew Vogel	fdba8d9e11	Added check on title.text to avoid a TypeError on None.	12 years ago
Zach Denton	0843d9cdf2	Explicitly check if title is None. fixes #22 This fixes #22 which caused all titles to be blank.	12 years ago
Andrey Popp	95852d5c18	readability.htmls: some docs do not have title elem	12 years ago
Richard Harding	e9a5cbfe7f	Remove pdb dummy	12 years ago
Richard Harding	f1a79fb8f8	Update to make sure we don't drop the html tag when ditching elements	12 years ago
Richard Harding	46f0302ebc	rename the document_only flag to html_partial	12 years ago
Richard Harding	a46dc14251	Try to pep8 all the things but give up when I got close.	12 years ago
Richard Harding	5a98e2c1b8	Correct appending and allow for document only - Fix the appending of siblings to the correct nested element - Add a document only flag so that you can get a dom tree you can nest yourself without html/body tags.	12 years ago
Richard Harding	edccec5d3b	Work on why we have an empty <body/> tag - Seems to come because the sanitizer ends up with two nodes, not one. The first is an empty body, the second is the article div. - Fix up the tabs so we can work with the file. Needs lots of pep8 love. - Implement an initial hack that at least gets it working atm. - Start to add test cases, sample html files we can test against, etc.	12 years ago
Jan Weiß	3cdc3d67af	Adding comment about oversight in transform_misused_divs_into_paragraphs().	12 years ago
Jan Weiß	960f885edf	Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.	12 years ago
Jan Weiß	6b3961cd30	Fixing gap in node_length coverage.	12 years ago
facundo	bb93ae1e5f	fixed a small issue on the Document score_paragraphs method	12 years ago
Yuri Baburov	11c4d95411	Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3	13 years ago
Yuri Baburov	61715dca0a	Bump to version 0.2	13 years ago
Yuri Baburov	c2ec1d1c38	Sorted out unicode issues, thanks to Lee Semel.	13 years ago
Yuri Baburov	97ba2a0369	Debug utilities.	13 years ago
Lee Semel	f3d0a8d842	Allow passing unicode objects	13 years ago
Jerry Charumilind	8c1adc5141	Expose Document in readability package	13 years ago
Yuri Baburov	43c34bacc1	Renamed encodings to encoding to avoid conflicts with system module.	13 years ago
Yuri Baburov	f55f16baa1	Updated scoring algorithm to match readability.js v1.7.1	13 years ago
Yuri Baburov	96f476181c	Improved title shortener method, and added it to the Document class.	13 years ago
Yuri Baburov	dada82099b	Moved to lxml (based on decruft version); better encoding recognition.	13 years ago
gfxmonk	2b6a2d3db4	removing empty paragraphs is not very useful, and can break some (stupid) websites	14 years ago
gfxmonk	1d862a00c3	fixed bug where only immediate text was being considered for weights, instead of all nested text	14 years ago
gfxmonk	0eacd959a4	failsafe parsing and more logging	14 years ago
gfxmonk	87ad057706	unicode, dammit!	14 years ago
gfxmonk	a224c5b759	minor	14 years ago
gfxmonk	f73b5f05c4	split out into content and summary methods	14 years ago
gfxmonk	c952f421b7	clean up content method and debug	14 years ago
gfxmonk	c0ca60ee26	use a more leniant parser	14 years ago
gfxmonk	ad3d52ade4	initial	14 years ago

1 2

88 Commits (master)