Commit Graph

143 Commits (0.3.0.dev)
 

Author SHA1 Message Date
Richard Harding d708744822 Clean up tests/changes to merge into 0.3.0.dev 12 years ago
Jerry Charumilind eefb8e1125 Implement duplicate page detection
This adds detection of duplicate pages to avoid adding duplicate pages to a
multi-page article.  It adds a simple unit test and regenerates the nytimes
regression test with the new, and more correct, result.  Previously, we were
including page 2 again after page 5.

Conflicts:

	src/readability_lxml/readability.py
12 years ago
Richard Harding c931a80ba8 Tweak tests post merging 12 years ago
Jerry Charumilind 883a02ad5d Add a regression for a multi-page nytimes article
It does not quite work yet, as we wrongly pull in page 2 at the end of the
article due to yet-to-be-implemented duplicate avoidance.

Conflicts:

	src/readability_lxml/readability.py
	src/tests/gen_test.py
	src/tests/regression.py
12 years ago
Richard Harding cfc6f94634 Fix test for the multipage test with actual content 12 years ago
Jerry Charumilind 816c66482e Improve unit test for basic multi-page handling
The test now actually asserts something instead of just printing some stuff out
for manual inspection.

Conflicts:

	src/readability_lxml/readability.py
12 years ago
Richard Harding 99d5fc0a87 Update for merge with Jerry Checkpoint multi-page readability work 12 years ago
Jerry Charumilind f02fe79840 Checkpoint multi-page readability work
Restructured code to better support multi-page readability.  Improved tests.

Conflicts:

	src/readability_lxml/readability.py
	src/tests/regression.py
12 years ago
Richard Harding 5cb4b8b8c0 Tweaks after the code reorg 12 years ago
Jerry Charumilind f8315d011c Checkpoint multi-page readability work
Restructured code to better support multi-page readability.  Improved tests.

Rick:
This generally works and the tests pass, but there are some broken cases with
the multipage bits that are causing me grief. It does pass the one test case.
I made the multipage an option vs doing it by default. The more I change the
code the harder future merges will be, but man it needs some cleanup, reorg,
and comments.

Conflicts:

	src/readability_lxml/readability.py
	src/tests/regression.py
12 years ago
Richard Harding 99efa5c10b PEP8 again ... 12 years ago
Richard Harding a012fd2362 urlfetch is in src 12 years ago
Jerry Charumilind 3fe416a5d1 Refactor code for easier testing
Conflicts:

	src/readability_lxml/readability.py
12 years ago
Richard Harding 8cadc4a958 Fix links in the regression test set 12 years ago
Richard Harding 9765d13e90 Garden 12 years ago
Jerry Charumilind 32d1764e83 Add scoring of next page link ancestry and href
This adds the scoring of next page link candidates' ancestry and href values
from the readability algorithm.
12 years ago
Richard Harding 0951647c8e Complete move from test_data/output to regression_test* 12 years ago
Richard Harding ace51a6819 Combine our tests with the new regresssion_test stuff 12 years ago
Jerry Charumilind 2505c78e5b Jerry Merge: First working find_next_page_link case 12 years ago
Richard Harding edc0e4d4c6 Move tests to testfile 12 years ago
Jerry Charumilind 6abc6f7ef2 Add cleaning of short segments
Conflicts:

	src/readability_lxml/readability.py
12 years ago
Jerry Charumilind 1e30e33302 Move the tests to the testfile 12 years ago
Richard Harding e8a6250605 Clean up merge, put tests in right place, adjust imports 12 years ago
Jerry Charumilind 62df35570d Checkpoint of multi-page article work
This implements some basic tools needed by the multi-page article algorithm.

Conflicts:

	src/readability_lxml/readability.py
12 years ago
Richard Harding 29fceeb4b1 Fix regression to run with metadata 12 years ago
Richard Harding 6f8184be27 Doh, move the tests to the right dir 12 years ago
Richard Harding 9aef5e36b7 Move the test data into the tests/test_data dir 12 years ago
Jerry Charumilind 8988b6b767 Add comment for read_orig 12 years ago
Jerry Charumilind 7d097d5f11 Add subcommand parsing to gen_test
There are now subcommands to generate new tests or just regenerate readable
versions of old tests.

Conflicts:

	src/tests/gen_test.py
12 years ago
Jerry Charumilind b04f75239c Add option to not generate yaml file
Sometimes you just want to generate the data files without the YAML
specification.  This change lets you do that.  In doing so, I switched to use
the argparse module for argument parsing.

Conflicts:

	src/tests/gen_test.py
12 years ago
Jerry Charumilind c21f00b1ee Reorganize constants
Conflicts:

	src/tests/regression.py
12 years ago
Richard Harding 9fec245ae4 garden 12 years ago
Jerry Charumilind 6af808bc14 Add docstring briefly describing gen_test program 12 years ago
Jerry Charumilind 7980ca84c9 Add regression tests for readability results
These test cases provide a baseline from which we can start improving the
readability algorithm and making sure that we do not horribly break anything.

Conflicts:

	src/tests/regression.py
12 years ago
Richard Harding a700bb8bd4 Update makefile regression test helper to open html results 12 years ago
Jerry Charumilind bf203b5a4b Add summary page for test results
Conflicts:

	src/tests/regression.py
12 years ago
Jerry Charumilind 65989b538a Remove obsolete code
Conflicts:

	src/tests/regression.py
12 years ago
Jerry Charumilind 9b7e5bb327 Jerry Merge: Remove obsolete code 12 years ago
Jerry Charumilind 068eba19ae Jerry Merge: Add reading of test information from YAML file 12 years ago
Richard Harding 6d3ad559f6 Move test_data, add regression_test make command 12 years ago
Jerry Charumilind 5222ed0628 Jerry Merge: Initial regression test data 12 years ago
Richard Harding 6454fb3f37 Clean up merge bits a little bit 12 years ago
Richard Harding 9366436861 Merge Jerry: pull in initial set of regression tests 12 years ago
Richard Harding 7dc373e9c5 Add the title and the short title to the metadata set.
- Tested for perf. hit, 100 iterations add .03s total time.
- Added the -m flag to the cmd line client to get all metadata output.
- Added test for making sure title/short title come back as well.
12 years ago
Richard Harding b1966df1c3 Fix docs for changed method 12 years ago
Richard Harding 57694cb352 Remove the get_ in method name, doesn't fit rest of api 12 years ago
Jerry Charumilind b78d7e8501 Merge Jerry: pull in the ability to get back confidence score as well as the processed html 12 years ago
Richard Harding a2b17e757c Update readme for the build location 12 years ago
Richard Harding 3347f16d93 Fix the flipped nature of the <html> wrapping setting 12 years ago
Richard Harding 93ac1111a1 Add try it out to the readable server 12 years ago