feat: implemented extractBestNode functionality

Squashed commit of the following:

commit 9af554dd975ff1778ed70c71fa9bde667fc5f880
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 15:19:32 2016 -0400

    feat: add cleanHeaders

commit 0dfea98eedc4f97fcbd78866322595c705e20521
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 14:30:49 2016 -0400

    fix: scoring parent nodes recursively

commit b6e5897a694adeb81e25a905aba72c0f45a8cc94
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 12:47:24 2016 -0400

    feat: extract clean node up and running

commit fb652c5db13db6bce7271efd68ba4b20515e9549
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 09:57:21 2016 -0400

    chore: added test for p tags with nested tags (e.g., img, iframe)

commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 17:50:33 2016 -0400

    feat: extact clean node integrates most functions

commit 322bc6534d30feb7c1c08d3813132badc6286b40
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:46:04 2016 -0400

    feat: removing empty nodes as defined in constants

commit f1d38932ea12a865814d2326970031fcb8515baa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:33:31 2016 -0400

    feat: cleaning attributes from nodes

commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:09:56 2016 -0400

    feat: cleaning h1s from text

commit 12d4a309246285c278ce7765e4fbaa8271bb5889
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:52:03 2016 -0400

    feat: removing spacer images

commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:38:49 2016 -0400

    feat: stripping unwanted html from doc

commit c774166e90169fd0c1aa89898d3f7a975e82bf0a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:17:32 2016 -0400

    feat: removing small images, height attribute from images

commit 3a8642f42cda451669c832482c5e1611b1ff2ea9
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 12:57:45 2016 -0400

    feat: rewrite top level

commit a1c03e779234b0aea02206d92ec3dcc15758507e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Aug 26 17:34:36 2016 -0400

    in a weird place rn
pull/1/head
Adam Pash 8 years ago
parent 9da7a6f2a9
commit 93e844cdfe

@ -1,3 +1,7 @@
{
"presets": ["es2015"]
"presets": ["es2015"],
"plugins": [
"transform-es2015-destructuring",
"transform-object-rest-spread"
]
}

@ -1,8 +1,18 @@
Next: Work on score-content, making sure it's working as intended (seems to be)
TODO:
- Test re-initializing $ if/when it needs to loop again
- Make sure weightNodes flag is being passed properly
- Get better sense of when cheerio returns a raw node and when a cheerio object
- Remove $ from function calls to getScore
- Remove $ whenever possible
- Test if .is method is faster than regex methods
- Separate constants into activity-specific folders (dom, scoring)
- `extract` (this kicks it all off)
DONE:
x `cleanHeaders` Remove any headers that are before any p tags, matching title, etc
x `extract` (this kicks it all off)
x `node_is_sufficient`
- `_extract_best_node`
x `_extract_best_node`
x `get_weight`
x `_strip_unlikely_candidates`
x `_convert_to_paragraphs`
@ -21,13 +31,6 @@ x `_score_paragraph`
## Top Candidate
x `_find_top_candidate`
- `extract_clean_node`
- `_clean_conditionally`
x `extract_clean_node`
x `_clean_conditionally`
Make sure weightNodes flag is being passed properly
Get better sense of when cheerio returns a raw node and when a cheerio object
Remove $ from function calls to getScore
Remove $ whenever possible
Test if .is method is faster than regex methods
Separate constants into activity-specific folders (dom, scoring)

@ -10,6 +10,8 @@
"author": "",
"license": "ISC",
"devDependencies": {
"babel-plugin-transform-es2015-destructuring": "^6.9.0",
"babel-plugin-transform-object-rest-spread": "^6.8.0",
"babel-preset-es2015": "^6.13.2",
"babel-register": "^6.11.6",
"mocha": "^3.0.2"

@ -1,831 +1,116 @@
import cheerio from 'cheerio'
import CONSTANTS from './constants'
import extractBestNode from './extract-best-node'
import nodeIsSufficient from '../utils/node-is-sufficient'
import extractCleanNode from './extract-clean-node'
import { normalizeSpaces } from './utils/text'
const GenericContentExtractor = {
flags: {},
// Entry point for parsing html
parse(html, flags={}) {
let $ = cheerio.load(html)
if (flags) {
this.flags = flags
} else {
this.flags = {
"strip_unlikely_candidates": True,
"weight_nodes": True,
"clean_conditionally": True,
}
}
this.extract($)
defaultOpts: {
stripUnlikelyCandidates: true,
weightNodes: true,
cleanConditionally: true,
},
extract($) {
`Extract the content for this resource - initially, pass in our
most restrictive flags which will return the highest quality
content. On each failure, retry with slightly more lax flags.
:param return_type: string. If "node", should return the content
as an LXML node rather than as an HTML string.
Flags:
strip_unlikely_candidates: Remove any elements that match
non-article-like criteria first.(Like, does this element
have a classname of "comment")
weight_nodes: Modify an elements score based on whether it has
certain classNames or IDs. Examples: Subtract if a node has
a className of 'comment', Add if a node has an ID of
'entry-content'.
// Entry point for parsing html
parse(html, opts={}) {
let $ = cheerio.load(html)
opts = { ...this.defaultOpts, ...opts }
clean_conditionally: Clean the node to return of some
superfluous content. Things like forms, ads, etc.
`
const extraction_flags = [
'strip_unlikely_candidates',
'weight_nodes',
'clean_conditionally'
]
// TODO: Title is used to clean headers.
// Should be passed from title extraction.
const title = ''
return this.extract($, opts, title)
},
// Extract the content for this resource - initially, pass in our
// most restrictive flags which will return the highest quality
// content. On each failure, retry with slightly more lax flags.
//
// :param return_type: string. If "node", should return the content
// as a cheerio node rather than as an HTML string.
//
// Flags:
// stripUnlikelyCandidates: Remove any elements that match
// non-article-like criteria first.(Like, does this element
// have a classname of "comment")
//
// weightNodes: Modify an elements score based on whether it has
// certain classNames or IDs. Examples: Subtract if a node has
// a className of 'comment', Add if a node has an ID of
// 'entry-content'.
//
// cleanConditionally: Clean the node to return of some
// superfluous content. Things like forms, ads, etc.
extract($, opts, title) {
// Cascade through our extraction-specific flags in an ordered fashion,
// turning them off as we try to extract content.
{/* node = this.extractCleanNode( */}
{/* this.extractBestNode(), */}
{/* flags.cleanConditionally) */}
console.log('hi')
},
let node = extractCleanNode(
extractBestNode($, opts),
$,
opts.cleanConditionally)
if (nodeIsSufficient(node)) {
console.log("success on first run!!!!!")
return this.cleanAndReturnNode(node, $)
} else {
// We didn't succeed on first pass, one by one disable our
// extraction flags and try again.
console.log("no success doing again!!!!!")
for (key in Reflect.ownKeys(opts).filter(key => opts[key] === true)) {
opts[key] = false
node = extractCleanNode(
extractBestNode($, opts),
opts.cleanConditionally)
if (nodeIsSufficient(node)) {
break
}
}
}
extractBestNode($) {
` Using a variety of scoring techniques, extract the content most
likely to be article text.
If strip_unlikely_candidates is True, remove any elements that
match certain criteria first. (Like, does this element have a
classname of "comment")
return node
},
If weight_nodes is True, use classNames and IDs to determine the
worthiness of nodes.
// Once we got here, either we're at our last-resort node, or
// we broke early. Make sure we at least have -something- before we
// move forward.
cleanAndReturnNode(node, $) {
if (!node) {
return null
}
// Remove our scoring information from our content
node.removeAttr('score')
node.find('[score]').removeAttr('score')
Returns cheerio instance $
`
return normalizeSpaces($.html(node))
// deep clone the node so we can get back to our initial parsed state
// if needed
// TODO: Performance improvements here? Deepcopy is known to be slow.
// Can we avoid this somehow?
{/* root = deepcopy(self.resource) */}
{/* */}
{/* if self.flags['strip_unlikely_candidates']: */}
{/* self._strip_unlikely_candidates(root) */}
{/* */}
{/* self._convert_to_paragraphs(root) */}
{/* self._score_content(root, weight_nodes=self.flags['weight_nodes']) */}
{/* */}
{/* # print structure(root) */}
{/* */}
{/* top_candidate = self._find_top_candidate(root) */}
{/* */}
{/* return top_candidate */}
// if return_type == "html":
// return normalize_spaces(node_to_html(node))
// else:
// return node
},
}
}
export default GenericContentExtractor
// if node is None:
// return None
//
// if not self.node_is_sufficient(node):
// # We didn't succeed on first pass, one by one disable our
// # extraction flags and try again.
// for flag in extraction_flags:
// self.flags[flag] = False
// clean_conditionally = self.flags.get(
// 'clean_conditionally',
// False
// )
// node = self.extract_clean_node(
// self._extract_best_node(),
// clean_conditionally=clean_conditionally
// )
//
// # If we found a good node, break out of the flag loop.
// if self.node_is_sufficient(node):
// break
//
// # Once we got here, either we're at our last-resort node, or
// # we broke early. Make sure we at least have -something- before we
// # move forward.
// if node is None:
// return None
//
// # Remove our scoring information from our content
// if 'score' in node.attrib:
// del node.attrib['score']
// for scored_node in node.xpath('./#<{(|[@score]'):
// del scored_node.attrib['score']
//
// if return_type == "html":
// return normalize_spaces(node_to_html(node))
// else:
// return node
//
//
// def _extract_best_node(self):
// """ Using a variety of scoring techniques, extract the content most
// likely to be article text.
//
// If strip_unlikely_candidates is True, remove any elements that
// match certain criteria first. (Like, does this element have a
// classname of "comment")
//
// If weight_nodes is True, use classNames and IDs to determine the
// worthiness of nodes.
//
// Returns an lxml node.
// """
//
// # deep clone the node so we can get back to our initial parsed state
// # if needed
// # TODO: Performance improvements here? Deepcopy is known to be slow.
// # Can we avoid this somehow?
// root = deepcopy(self.resource)
//
// if self.flags['strip_unlikely_candidates']:
// self._strip_unlikely_candidates(root)
//
// self._convert_to_paragraphs(root)
// self._score_content(root, weight_nodes=self.flags['weight_nodes'])
//
// # print structure(root)
//
// top_candidate = self._find_top_candidate(root)
//
// return top_candidate
//
// def get_weight(self, node):
// """ Get the score of a node based on its className and id. """
// score = 0
//
// if node.get('id'):
// if constants.NEGATIVE_SCORE_RE.search(node.get('id')):
// score -= 25
// if constants.POSITIVE_SCORE_RE.search(node.get('id')):
// score += 25
//
// if node.get('class'):
// # Only score classes on negative/positive if the ID didn't match.
// if score == 0:
// if constants.NEGATIVE_SCORE_RE.search(node.get('class')):
// score -= 25
// if constants.POSITIVE_SCORE_RE.search(node.get('class')):
// score += 25
//
// # Try to keep photos if we can.
// if constants.PHOTO_HINTS_RE.search(node.get('class')):
// score += 10
//
// # Bonus for entry-content-asset, which is explicitly denoted to be
// # more valuable to Readability in the publisher guidelines.
// if 'entry-content-asset' in node.get('class'):
// score += 25
//
// return score
//
//
// # The removal is implemented as a blacklist and whitelist, this test finds
// # blacklisted elements that aren't whitelisted. We do this all in one
// # expression-both because it's only one pass, and because this skips the
// # serialization for whitelisted nodes.
// candidates_blacklist = '|'.join(constants.UNLIKELY_CANDIDATES_BLACKLIST)
// candidates_whitelist = '|'.join(constants.UNLIKELY_CANDIDATES_WHITELIST)
//
// # Note: Regular expressions appear to be about 3 times as fast as looping
// # over each key and matching with contains().
// #
// # TODO: Consider mapping all classnames and ids to hashes and using set
// # intersections for performance.
// candidates_xpath = (
// './#<{(|['
// 'not(self::a) and '
// 're:test(concat(@id, " ", @class), "%s", "i") and '
// 'not( re:test(concat(@id, " ", @class), "%s", "i"))'
// ']'
// ) % (candidates_blacklist, candidates_whitelist)
// def _strip_unlikely_candidates(self, doc):
// """ Loop through the provided document and remove any non-link nodes
// that are unlikely candidates for article content.
//
// Links are ignored because there are very often links to content
// that are identified as non-body-content, but may be inside
// article-like content.
//
// :param doc: an LXML doc to strip nodes from
// :return node: The node itself (even though the conversion happens
// by-reference)
// """
// unlikely_candidates = doc.xpath(self.candidates_xpath,
// namespaces=RE_NAMESPACE)
//
// for node in unlikely_candidates:
// node.drop_tree()
//
// return doc
//
// def _convert_to_paragraphs(self, doc):
// """ Loop through the provided doc, and convert any p-like elements to
// actual paragraph tags.
//
// Things fitting this criteria:
// * Multiple consecutive <br /> tags.
// * <div /> tags without block level elements inside of them
// * <span /> tags who are not children of <p /> or <div /> tags.
//
// :param doc: An LXML node to search through.
// :return an LXML node of the element, cleaned up.
// (By-reference mutation, though. Returned just for convenience.)
// """
//
// # Convert every doubled-<br /> to a paragraph tag.
// self._brs_to_paragraphs(doc)
//
// # Convert every shallow <div /> to a paragraph tag. Ignore divs that
// # contain other block level elements.
// inner_block_tags = './/' + ' or .//'.join(constants.DIV_TO_P_BLOCK_TAGS)
// shallow_divs = doc.xpath('.//div[not(%s)]' % inner_block_tags)
//
// for div in shallow_divs:
// div.tag = 'p'
//
// # Convert every span tag who has no ancestor p or div tag within their
// # family tree to a P as well.
// p_like_spans = doc.xpath('.//span[not(ancestor::p or ancestor::div)]')
// for span in p_like_spans:
// span.tag = 'p'
//
// # If, after all of this, we have no P tags at all, we are probably
// # dealing with some very ugly content that is separated by single BR
// # tags. Convert them individually to P tags.
// if int(doc.xpath('count(//p)')) == 0:
// self._brs_to_paragraphs(doc, min_consecutive=1)
//
// # Remove font and center tags, which are ugly and annoying
// for fonttag in doc.xpath('.//font | .//center'):
// fonttag.drop_tag()
//
//
// ### DO WE EVEN NEED THIS?? -Chris ###
//
// # # Due to the way the paras are inserted, the first paragraph does not
// # # get captured. Since this first para can contain all sorts of random
// # # junk (links, drop caps, images) it's not easy to regex our way to
// # # victory so we do it via dom. - Karl G
// # try:
// # first = node.xpath('.//p[@class = "rdb_br"][position() = 1]')[0]
// # except IndexError:
// # pass
// # else:
// # parent = first.getparent()
// # breaker = None
// # if parent is None:
// # parent = node
// # para = E.P({'class':'rdb_br firstp'})
// # has_predecessors = False
// # for sibling in first.itersiblings(preceding = True):
// # has_predecessors = True
// # if sibling.tag in ['p', 'div']:
// # breaker = sibling
// # break
// # para.insert(0,sibling)
// #
// # if (not has_predecessors and parent.text is not None and
// # parent.text.strip() != ""):
// # para.text = parent.text
// # parent.text = ''
// # else:
// # para.text = (para.text or '') + (parent.tail or '')
// #
// # parent.tail = ''
// # if breaker is None:
// # parent.insert(0,para)
// # else:
// # parent.insert(parent.index(breaker)+1,para)
//
// return doc
//
// def _brs_to_paragraphs(self, doc, min_consecutive=2):
// """ Given an LXML document, convert consecutive <br /> tags into
// <p /> tags instead.
//
// :param doc: An LXML document to convert within.
// :param min_consecutive: Integer, the minimum number of consecutive
// <br /> tags that must exist for them to be converted to <p />
// tags. Must be at least 1.
//
// A word to the wise: This is deceptively tricky, as break tags
// don't behave like normal XML should. Make sure you test
// thoroughly if you make any changes to this code.
// """
// brs = doc.xpath('.//br')
//
// # Loop through all of our break tags, looking for consecutive
// # <br />s with no content in between them. If found, replace them
// # with a single P tag.
// for br in brs:
// # Generate a list of all the breaks in a row, with no text in
// # between them.
// joined_brs = []
// cur_br = br
// while True:
// joined_brs.append(cur_br)
//
// if cur_br.tail:
// break
//
// next = cur_br.getnext()
// next_is_br = next is not None and next.tag.lower() == 'br'
//
// if next_is_br:
// cur_br = next
// else:
// break
//
// if len(joined_brs) < min_consecutive:
// continue
//
// last_br = joined_brs[-1]
//
// # Now loop through following siblings, until we hit a block
// # tag or the end, and append them to this P if they are not a
// # block tag that is not a BR.
// self._paragraphize(last_br)
//
// # Drop every break that we no longer need because of the P.
// # The first BR has been turned into a P tag.
// for joined_br in joined_brs:
// if joined_br is not last_br:
// joined_br.drop_tag()
//
// # If we had any new p tags that are already inside a P tag, resolve
// # those by paragraphizing them, which will append their block level
// # contents.
// for fix_count in xrange(1000):
// # Find the first p that contains another p, and paragraphize it.
// # We do this in a loop because we're modifying the dom as we go.
// try:
// parent_p = doc.xpath('//p[./p][1]')[0]
// self._paragraphize(parent_p)
// except IndexError:
// break
// else:
// # We exhausted our loop, which means we've looped too many times
// # such that it's unreasonable. Log a warning.
// logger.warning("Bailing on p parent fix due to crazy "
// "looping for url %s" % self.resource.url)
//
// def _paragraphize(self, node):
// """ Given a node, turn it into a P if it is not already a P, and
// make sure it conforms to the constraints of a P tag (I.E. does
// not contain any other block tags.)
//
// If the node is a <br />, it treats the following inline siblings
// as if they were its children.
//
// :param node: The node to paragraphize
// """
// is_br = (node.tag.lower() == 'br')
//
// if is_br and node.tail:
// node.text = node.tail
// node.tail = None
//
// node.tag = 'p'
//
// if is_br:
// sibling = node.getnext()
// while True:
// if (sibling is None or (
// sibling.tag in constants.BLOCK_LEVEL_TAGS and
// sibling.tag != 'br'
// )):
// break
//
// next_sibling = sibling.getnext()
// node.append(sibling)
// sibling = next_sibling
//
// else:
// children = node.getchildren()
// i = 0
// il = len(children)
// # Ghetto looping so we have access to the iterator afterward
// while i < il:
// child = children[i]
// if (child is None or
// (child.tag in constants.BLOCK_LEVEL_TAGS and
// child.tag != 'br')
// ):
// break
// i = i+1
//
// # This means we encountered a block level tag within our P,
// # so we should pop the rest down to siblings.
// if i < il:
// for j in xrange(i, il):
// node.addnext(children[j])
//
//
// ### --- SCORING --- ###
//
// def _get_score(self, node, weight_nodes=True):
// """Get a node's score. If weight_nodes is true, weight classes when
// getting the score as well.
//
// """
// score = node.get('score')
// if score is None:
// score = self._score_node(node)
// if weight_nodes:
// score += self.get_weight(node)
// parent = node.getparent()
// if parent is not None:
// self._set_score(parent, self._get_score(parent) + .25 * score)
// else:
// score = float(score)
// return score
//
// def _set_score(self, node, val):
// """Set the score of a node to val"""
// return node.set('score', str(val))
//
// def _add_score(self, node, val, weight_nodes=True):
// return self._set_score(node, self._get_score(node, weight_nodes) + val)
//
// def _score_content(self, doc, weight_nodes=True):
// """score content. Parents get the full value of their children's
// content score, grandparents half
// """
//
// # First, look for special hNews based selectors and give them a big
// # boost, if they exist
// for selector in constants.HNEWS_CONTENT_SELECTORS:
// # Not self.resource.extract_by_selector because our doc is a copy
// # of the resource doc.
// nodes = extract_by_selector(doc, selector,
// AttribMap(doc))
// for node in nodes:
// self._add_score(node, 80)
//
// paras = doc.xpath('.//p | .//pre')
//
// # If we don't have any paragraphs at all, we can't score based on
// # paragraphs, so return without modifying anything else.
// if len(paras) == 0:
// return doc
//
// for para in paras:
// # Don't score invalid tags
// if not isinstance(para.tag, basestring):
// continue
//
// # The raw score for this paragraph, before we add any parent/child
// # scores.
// raw_score = self._score_node(para)
// self._set_score(para, self._get_score(para, weight_nodes))
//
// parent = para.getparent()
// if parent is not None:
// if parent.tag == 'span':
// parent.tag = 'div'
//
// # Add the individual content score to the parent node
// self._add_score(parent, raw_score, weight_nodes=weight_nodes)
//
// grandparent = parent.getparent()
// if grandparent is not None:
// if grandparent.tag == 'span':
// grandparent.tag = 'div'
//
// # Add half of the individual content score to the
// # grandparent
// gp_score = raw_score / 2.0
// self._add_score(grandparent, gp_score, weight_nodes=weight_nodes)
//
// return doc
//
//
// def _score_node(self, node):
// """Score an individual node. Has some smarts for paragraphs, otherwise
// just scores based on tag currently.
//
// """
// score = 0
//
// if node.tag in ['p', 'li', 'span', 'pre']:
// score += self._score_paragraph(node)
// if node.tag in ['div']:
// score += 5
// elif node.tag in ['pre', 'td', 'blockquote', 'ol', 'ul', 'dl']:
// score += 3
// elif node.tag in ['address', 'form']:
// score -= 3
// elif node.tag in ['th']:
// score -= 5
//
// return score
//
// def _score_paragraph(self, node):
// """Score a paragraph using various methods. Things like number of
// commas, etc. Higher is better.
//
// """
//
// # Start with a point for the paragraph itself as a base.
// score = 1
// text = inner_text(node)
// text_len = len(text)
//
// if text_len == 0:
// if node.getparent() is not None and len(node.getchildren()) == 0:
// node.drop_tree()
// return 0
//
// # If this paragraph is less than 25 characters, don't count it.
// if text_len < 25:
// return 0
//
// # Add points for any commas within this paragraph
// score += text.count(',')
//
// # For every 50 characters in this paragraph, add another point. Up
// # to 3 points.
// chunk_count = (text_len / 50)
// if chunk_count > 0:
// length_bonus = 0
// if node.tag in ('pre', 'p'):
// length_bonus += chunk_count - 2
// else:
// length_bonus += chunk_count - 1.25
// score += min(max(length_bonus, 0), 3)
//
// # Articles can end with short paragraphs when people are being clever
// # but they can also end with short paragraphs setting up lists of junk
// # that we strip. This negative tweaks junk setup paragraphs just below
// # the cutoff threshold.
// if text.endswith(':'):
// score -= 1
//
// return score
//
// ### ------- TOP CANDIDATE EXTRACTION ------ ###
//
// def _find_top_candidate(self, root):
// # After we've calculated scores, loop through all of the possible
// # candidate nodes we found and find the one with the highest score.
// top_candidate = None
// top_candidate_score = 0
// # Note: ./#<{(| is faster than ./#<{(|[@score], believe it or not.
// for candidate in root.xpath('./#<{(|'):
//
// if candidate.tag in constants.NON_TOP_CANDIDATE_TAGS:
// continue
//
// candidate_score = self._get_score(candidate)
// if top_candidate is None or candidate_score > top_candidate_score:
// top_candidate = candidate
// top_candidate_score = self._get_score(top_candidate)
//
//
// # If we still have no candidate, just use the body
// if top_candidate is None or len(inner_text(top_candidate)) < 250:
// to_ret = root.find('body')
// if to_ret is None:
// to_ret = root.xpath('.')[0]
// elif top_candidate.getparent() is not None:
// # Now that we have a top_candidate, look through the siblings of
// # it to see if any of them are decently scored. If they are, they
// # may be split parts of the content (Like two divs, a preamble and
// # a body.) Example:
// # http://articles.latimes.com/2009/oct/14/business/fi-bigtvs14
// to_ret = E.DIV()
// sibling_score_threshold = max(10, top_candidate_score * 0.2)
// for child in top_candidate.getparent().iterchildren():
// if not isinstance(child.tag, basestring):
// continue
//
// if self._get_score(child):
// append = False
//
// if child == top_candidate:
// to_ret.append(child)
// continue
//
// density = link_density(child)
// content_bonus = 0
//
// # If the sibling has a very low link density, give a small
// # bonus.
// if density < 0.05:
// content_bonus += 20
//
// # If it's high, give it a penalty
// if density >= 0.5:
// content_bonus -= 20
//
// # If sibling nodes and top candidates have the exact same
// # className, give a bonus
// if child.get('class', False) == top_candidate.get('class'):
// content_bonus += top_candidate_score * 0.2
//
// sibling_score = self._get_score(child) + content_bonus
// if sibling_score >= sibling_score_threshold:
// append = True
// elif child.tag == 'p':
// child_content = child.text_content()
// child_content_len = len(child_content)
//
// if child_content_len > 80 and density < 0.25:
// append = True
// elif (child_content_len <= 80 and density == 0 and
// has_sentence_end(child_content)):
// append = True
//
// if append:
// to_ret.append(child)
// else:
// to_ret = top_candidate
//
// return to_ret
//
// def extract_clean_node(self, article, clean_conditionally=False):
// """ Clean our article content, returning a new, cleaned node. """
// doc = deepcopy(article)
//
// # Rewrite the tag name to div if it's a top level node like body or
// # html to avoid later complications with multiple body tags.
// if doc.tag in ['html','body']:
// doc.tag = 'div'
//
// for img in doc.xpath('.//img'):
// try:
// img_height = int(img.attrib.get('height', 20))
// img_width = int(img.attrib.get('width', 20))
// if img_height < 10 or img_width < 10:
// # Remove images that explicitly have very small heights or
// # widths, because they are most likely shims or icons,
// # which aren't very useful for reading.
// img.drop_tree()
// elif 'height' in img.attrib:
// # Don't ever specify a height on images, so that we can
// # scale with respect to width without screwing up the
// # aspect ratio.
// del img.attrib['height']
// except:
// pass
//
// # Drop certain tags like <title>, etc
// # This is -mostly- for cleanliness, not security. The lxml Cleaner
// # method in Resource does most of the security stuff for us.
// for tag in doc.xpath('.//' + ' | .//'.join(constants.STRIP_OUTPUT_TAGS)):
// tag.drop_tree()
//
// # Drop spacer images
// spacer_path = './/img[re:test(@src, "trans|transparent|spacer|blank", "i")]'
// for tag in doc.xpath(spacer_path, namespaces={'re': constants.RE_NS}):
// tag.drop_tree()
//
// # H1 tags are typically the article title, which should be extracted
// # by the title extractor instead. If there's less than 3 of them (<3),
// # strip them. Otherwise, turn 'em into H2s.
// hOnes = doc.xpath('.//h1')
// if len(hOnes) < 3:
// for e in hOnes:
// e.drop_tree()
// else:
// for e in hOnes:
// e.tag = 'h2'
//
// headers = doc.xpath('.//h2 | .//h3 | .//h4 | .//h5 | .//h6')
// for header in headers:
// drop_header = False
//
// # Remove any headers that are before any p tags in the
// # document. This probably means that it was part of the title, a
// # subtitle or something else extraneous like a datestamp or byline,
// # all of which should be handled by other metadata handling.
// no_previous_ps = int(header.xpath("count(preceding::p[1])")) == 0
// if no_previous_ps:
// similar_header_count = int(doc.xpath('count(.//%s)' % header.tag))
// if similar_header_count < 3:
// drop_header = True
//
// # Remove any headers that match the title exactly.
// if inner_text(header) == self.title:
// drop_header = True
//
// # If this header has a negative weight, it's probably junk.
// # Get rid of it.
// if self.get_weight(header) < 0:
// drop_header = True
//
// if drop_header:
// try:
// header.drop_tree()
// except AssertionError:
// # No parent exists for this node, so just blank it out.
// header.text = ''
//
// for tag in doc.xpath('./#<{(|[@style or @align]'):
// try:
// del tag.attrib['style']
// except KeyError:
// pass
// try:
// del tag.attrib['align']
// except KeyError:
// pass
//
// for para in doc.xpath('.//p'):
// # We have a blank tag
// if (len(inner_text(para)) < 3 and
// len(para.xpath('.//img')) == 0 and
// len(para.xpath('.//iframe')) == 0):
// para.drop_tree()
//
// if clean_conditionally:
// # We used to clean UL's and OL's here, but it was leading to
// # too many in-article lists being removed. Consider a better
// # way to detect menus particularly and remove them.
// self._clean_conditionally(doc, ['ul', 'ol', 'table', 'div'])
//
// return doc
//
// def _clean_conditionally(self, doc, tags):
// """Given a doc, clean it of some superfluous content specified by
// tags. Things like forms, ads, etc.
//
// Tags is an array of tag name's to search through. (like div, form,
// etc)
//
// Return this same doc.
// """
// for node in doc.xpath('.//' + ' | .//'.join(tags)):
//
// node_is_list = node.tag in ('ul', 'ol')
//
// weight = self._get_score(node)
// if node.getparent() is None:
// continue
// if weight < 0:
// node.drop_tree()
// else:
// node_content = inner_text(node)
// if node_content.count(',') < 10:
// remove_node = False
// p_count = int(node.xpath('count(.//p)'))
// img_count = int(node.xpath('count(.//img)'))
// input_count = int(node.xpath('count(.//input)'))
// script_count = int(node.xpath('count(.//script)'))
// density = link_density(node)
// content_length = len(inner_text(node))
//
// # Looks like a form, too many inputs.
// if input_count > (p_count / 3):
// remove_node = True
//
// # Content is too short, and there are no images, so
// # this is probably junk content.
// elif content_length < 25 and img_count == 0:
// remove_node = True
//
// # Too high of link density, is probably a menu or
// # something similar.
// elif (weight < 25 and
// density > 0.2 and
// content_length > 75):
// remove_node = True
//
// # Too high of a link density, despite the score being
// # high.
// elif weight >= 25 and density > 0.5:
// remove_node = True
// # Don't remove the node if it's a list and the
// # previous sibling starts with a colon though. That
// # means it's probably content.
// if node_is_list:
// previous_sibling = node.getprevious()
// if (previous_sibling is not None and
// inner_text(previous_sibling)[-1:] == ':'):
// remove_node = False
//
// # Too many script tags, not enough content.
// elif script_count > 0 and len(node_content) < 150:
// remove_node = True
// print "#######SCORE########"
// print self.high_score
// print self.top_node.tag
// # Remove our scoring information from our content
// if 'score' in node.attrib:
// del node.attrib['score']
// for scored_node in node.xpath('./#<{(|[@score]'):
// del scored_node.attrib['score']
//
// # Explicitly save entry-content-asset tags, which are
// # noted as valuable in the Publisher guidelines. For now
// # this works everywhere. We may want to consider making
// # this less of a sure-thing later.
// if 'entry-content-asset' in node.get('class', ''):
// remove_node = False
// if return_type == "html":
// return normalize_spaces(node_to_html(node))
// else:
// return node
//
// if remove_node:
// node.drop_tree()
// return doc
export default GenericContentExtractor

@ -0,0 +1,18 @@
import assert from 'assert'
import cheerio from 'cheerio'
import fs from 'fs'
import { clean } from './utils/dom/test-helpers'
import GenericContentExtractor from './content-extractor'
describe('GenericContentExtractor', () => {
describe('parse(html, opts)', () => {
it("parses html and returns the article", () => {
const html = fs.readFileSync('../fixtures/latimes.html', 'utf-8')
const result = clean(GenericContentExtractor.parse(html))
// console.log(result)
})
})
})

@ -0,0 +1,36 @@
import {
scoreContent,
findTopCandidate,
} from './utils/scoring'
import {
stripUnlikelyCandidates,
convertToParagraphs,
} from './utils/dom'
// Using a variety of scoring techniques, extract the content most
// likely to be article text.
//
// If strip_unlikely_candidates is True, remove any elements that
// match certain criteria first. (Like, does this element have a
// classname of "comment")
//
// If weight_nodes is True, use classNames and IDs to determine the
// worthiness of nodes.
//
// Returns a cheerio object $
export default function extractBestNode($, opts) {
// clone the node so we can get back to our
// initial parsed state if needed
// TODO Do I need this? AP
// let $root = $.root().clone()
if (opts.stripUnlikelyCandidates) {
$ = stripUnlikelyCandidates($)
}
$ = convertToParagraphs($)
$ = scoreContent($, opts.weightNodes)
const topCandidate = findTopCandidate($)
return topCandidate
}

@ -0,0 +1,24 @@
import assert from 'assert'
import cheerio from 'cheerio'
import fs from 'fs'
// import HTML from './fixtures/html'
import extractBestNode from './extract-best-node'
describe('extractBestNode($, flags)', () => {
it("scores the dom nodes and returns the best option", () => {
const html = fs.readFileSync('../fixtures/latimes.html', 'utf-8')
const opts = {
stripUnlikelyCandidates: true,
weightNodes: true,
}
let $ = cheerio.load(html)
const bestNode = extractBestNode($, opts)
// console.log(bestNode.html())
// assert.equal($(bestNode).text().length, 3652)
})
})

@ -0,0 +1,88 @@
import {
convertNodeTo,
rewriteTopLevel,
cleanImages,
stripJunkTags,
cleanHOnes,
cleanHeaders,
cleanTags,
cleanAttributes,
removeEmpty,
} from './utils/dom'
// Clean our article content, returning a new, cleaned node.
export default function extractCleanNode(article, $, cleanConditionally=true, title='') {
// do I need to copy/clone?
// Can't I just start over w/fresh html if I need to?
// Look into this
// let doc = article
// Rewrite the tag name to div if it's a top level node like body or
// html to avoid later complications with multiple body tags.
rewriteTopLevel(article, $)
// Drop small images and spacer images
cleanImages(article, $)
// Drop certain tags like <title>, etc
// This is -mostly- for cleanliness, not security.
stripJunkTags(article, $)
// H1 tags are typically the article title, which should be extracted
// by the title extractor instead. If there's less than 3 of them (<3),
// strip them. Otherwise, turn 'em into H2s.
cleanHOnes(article, $)
// Clean headers
cleanHeaders(article, $, title)
// Remove style or align attributes
cleanAttributes(article, $)
// We used to clean UL's and OL's here, but it was leading to
// too many in-article lists being removed. Consider a better
// way to detect menus particularly and remove them.
cleanTags(article, $, cleanConditionally)
// Remove empty paragraph nodes
removeEmpty(article, $)
return article
}
// headers = doc.xpath('.//h2 | .//h3 | .//h4 | .//h5 | .//h6')
// for header in headers:
// drop_header = False
//
// # Remove any headers that are before any p tags in the
// # document. This probably means that it was part of the title, a
// # subtitle or something else extraneous like a datestamp or byline,
// # all of which should be handled by other metadata handling.
// no_previous_ps = int(header.xpath("count(preceding::p[1])")) == 0
// if no_previous_ps:
// similar_header_count = int(doc.xpath('count(.//%s)' % header.tag))
// if similar_header_count < 3:
// drop_header = True
//
// # Remove any headers that match the title exactly.
// if inner_text(header) == self.title:
// drop_header = True
//
// # If this header has a negative weight, it's probably junk.
// # Get rid of it.
// if self.get_weight(header) < 0:
// drop_header = True
//
// if drop_header:
// try:
// header.drop_tree()
// except AssertionError:
// # No parent exists for this node, so just blank it out.
// header.text = ''
//
// if clean_conditionally:
// # We used to clean UL's and OL's here, but it was leading to
// # too many in-article lists being removed. Consider a better
// # way to detect menus particularly and remove them.
// self._clean_conditionally(doc, ['ul', 'ol', 'table', 'div'])
//
// return doc

@ -0,0 +1,34 @@
import assert from 'assert'
import cheerio from 'cheerio'
import fs from 'fs'
// import HTML from './fixtures/html'
import extractCleanNode from './extract-clean-node'
import extractBestNode from './extract-best-node'
describe('extractCleanNode(article, $, { cleanConditionally })', () => {
it("cleans cruft out of a DOM node", () => {
const html = fs.readFileSync('../fixtures/wired.html', 'utf-8')
let $ = cheerio.load(html)
const opts = {
stripUnlikelyCandidates: true,
weightNodes: true,
cleanConditionally: true,
}
const bestNode = extractBestNode($, opts)
let result = $.html(bestNode)
// console.log(result)
console.log(result.length)
const cleanNode = extractCleanNode(bestNode, $, opts)
result = $.html(cleanNode)
console.log(result.length)
// console.log(result)
// console.log(bestNode.html())
// assert.equal($(bestNode).text().length, 3652)
})
})

@ -559,6 +559,7 @@ export const UNLIKELY_CANDIDATES_BLACKLIST = [
'presence_control_external', // lifehacker.com container full of false positives
'popup',
'printfriendly',
'related',
'remove',
'remark',
'rss',
@ -834,6 +835,9 @@ export const STRIP_OUTPUT_TAGS = [
'hr',
]
// Spacer images to be removed
export const SPACER_RE = new RegExp("trans|transparent|spacer|blank", "i")
// XPath to try to determine if a page is wordpress. Not always successful.
export const IS_WP_XPATH = "//meta[@name='generator'][starts-with(@value,'WordPress')]"
@ -972,3 +976,17 @@ export const PARAGRAPH_SCORE_TAGS = new RegExp('^(p|li|span|pre)$', 'i')
export const CHILD_CONTENT_TAGS = new RegExp('^(td|blockquote|ol|ul|dl)$', 'i')
export const BAD_TAGS = new RegExp('^(address|form)$', 'i')
export const HTML_OR_BODY_RE = new RegExp('^(html|body)$', 'i')
export const REMOVE_ATTRS = ['style', 'align']
export const REMOVE_ATTR_SELECTORS = REMOVE_ATTRS.map(selector => `[${selector}]`)
export const REMOVE_ATTR_LIST = REMOVE_ATTRS.join(',')
export const REMOVE_EMPTY_TAGS = ['p']
export const REMOVE_EMPTY_SELECTORS = REMOVE_EMPTY_TAGS.map(tag => `${tag}:empty`).join(',')
export const CLEAN_CONDITIONALLY_TAGS = ['ul', 'ol', 'table', 'div'].join(',')
const HEADER_TAGS = ['h2', 'h3', 'h4', 'h5', 'h6']
export const HEADER_TAG_LIST = HEADER_TAGS.join(',')

@ -0,0 +1,14 @@
import {
REMOVE_ATTR_SELECTORS,
REMOVE_ATTR_LIST,
REMOVE_ATTRS,
} from '../constants'
// Remove attributes like style or align
export default function cleanAttributes(article, $) {
REMOVE_ATTRS.forEach((attr) => {
$(`[${attr}]`, article).removeAttr(attr)
})
return $
}

@ -0,0 +1,24 @@
import cheerio from 'cheerio'
import assert from 'assert'
import HTML from '../fixtures/html'
import { assertClean } from './test-helpers'
import { cleanAttributes } from './index'
describe('cleanAttributes($)', () => {
it("removes style attributes from nodes", () => {
let $ = cheerio.load(HTML.removeStyle.before)
let result = cleanAttributes($('*').first(), $)
assertClean(result.html(), HTML.removeStyle.after)
})
it("removes align attributes from nodes", () => {
let $ = cheerio.load(HTML.removeAlign.before)
let result = cleanAttributes($('*').first(), $)
assertClean(result.html(), HTML.removeAlign.after)
})
})

@ -0,0 +1,18 @@
import { convertNodeTo } from './index'
// H1 tags are typically the article title, which should be extracted
// by the title extractor instead. If there's less than 3 of them (<3),
// strip them. Otherwise, turn 'em into H2s.
export default function cleanHOnes(article, $) {
// const hOnes = $.find('h1')
const hOnes = $('h1', article)
if (hOnes.length < 3) {
hOnes.each((index, node) => $(node).remove())
} else {
hOnes.each((index, node) => {
convertNodeTo(node, $, 'h2')
})
}
return $
}

@ -0,0 +1,28 @@
import assert from 'assert'
import cheerio from 'cheerio'
import HTML from '../fixtures/html'
import { assertClean } from './test-helpers'
import { cleanHOnes } from './index'
describe('cleanHOnes($)', () => {
it("removes H1s if there are less than 3 of them", () => {
let $ = cheerio.load(HTML.removeTwoHOnes.before)
let result = cleanHOnes($('*').first(), $)
assertClean(result.html(), HTML.removeTwoHOnes.after)
})
it("converts H1s to H2s if there are 3 or more of them", () => {
let $ = cheerio.load(HTML.convertThreeHOnes.before)
let result = cleanHOnes($('*').first(), $)
assertClean(result.html(), HTML.convertThreeHOnes.after)
})
})

@ -0,0 +1,38 @@
import { HEADER_TAG_LIST } from '../constants'
import { normalizeSpaces } from '../text'
import { getWeight } from '../scoring'
export default function cleanHeaders(article, $, title='') {
$(HEADER_TAG_LIST, article).each((index, header) => {
// Remove any headers that appear before all other p tags in the
// document. This probably means that it was part of the title, a
// subtitle or something else extraneous like a datestamp or byline,
// all of which should be handled by other metadata handling.
if ($(header, article).prevAll('p').length === 0) {
return $(header).remove()
}
// Remove any headers that match the title exactly.
if (normalizeSpaces($(header).text()) === title) {
return $(header).remove()
}
// If this header has a negative weight, it's probably junk.
// Get rid of it.
if (getWeight($(header)) < 0) {
return $(header).remove()
}
})
return $
}
// # If this header has a negative weight, it's probably junk.
// # Get rid of it.
// if self.get_weight(header) < 0:
// drop_header = True
//
// if drop_header:
// try:
// header.drop_tree()
// except AssertionError:
// # No parent exists for this node, so just blank it out.
// header.text = ''

@ -0,0 +1,31 @@
import assert from 'assert'
import cheerio from 'cheerio'
import HTML from '../fixtures/html'
import { assertClean } from './test-helpers'
import { cleanHeaders } from './index'
describe('cleanHeaders(article, $)', () => {
it("parses html and returns the article", () => {
let $ = cheerio.load(HTML.cleanFirstHeds.before)
let result = cleanHeaders($('*').first(), $)
assertClean(result.html(), HTML.cleanFirstHeds.after)
})
it("removes headers when the header text matches the title", () => {
let $ = cheerio.load(HTML.cleanTitleMatch.before)
let result = cleanHeaders($('*').first(), $, 'Title Match')
assertClean(result.html(), HTML.cleanTitleMatch.after)
})
it("removes headers with a negative weight", () => {
let $ = cheerio.load(HTML.dropWithNegativeWeight.before)
let result = cleanHeaders($('*').first(), $)
assertClean(result.html(), HTML.dropWithNegativeWeight.after)
})
})

@ -0,0 +1,41 @@
import { SPACER_RE } from '../constants'
export default function cleanImages(article, $) {
$(article).find('img').each((index, img) => {
img = $(img)
cleanForHeight(img, $)
removeSpacers(img, $)
})
return $
}
function cleanForHeight(img, $) {
const height = parseInt(img.attr('height'))
const width = parseInt(img.attr('width')) || 20
// Remove images that explicitly have very small heights or
// widths, because they are most likely shims or icons,
// which aren't very useful for reading.
if ((height || 20) < 10 || width < 10) {
$(img).remove()
} else if (height) {
// Don't ever specify a height on images, so that we can
// scale with respect to width without screwing up the
// aspect ratio.
img.removeAttr('height')
}
return $
}
// Cleans out images where the source string matches transparent/spacer/etc
// TODO This seems very aggressive - AP
function removeSpacers(img, $) {
if (SPACER_RE.test(img.attr('src'))) {
$(img).remove()
}
return $
}

@ -0,0 +1,33 @@
import assert from 'assert'
import cheerio from 'cheerio'
import HTML from '../fixtures/html'
import { assertClean } from './test-helpers'
import { cleanImages } from './index'
describe('cleanImages($)', () => {
it("removes images with small heights/widths", () => {
let $ = cheerio.load(HTML.cleanSmallImages.before)
let result = cleanImages($('*').first(), $)
assertClean(result.html(), HTML.cleanSmallImages.after)
})
it("removes height attribute from images that remain", () => {
let $ = cheerio.load(HTML.cleanHeight.before)
let result = cleanImages($('*').first(), $)
assertClean(result.html(), HTML.cleanHeight.after)
})
it("removes spacer/transparent images", () => {
let $ = cheerio.load(HTML.cleanSpacer.before)
let result = cleanImages($('*').first(), $)
assertClean(result.html(), HTML.cleanSpacer.after)
})
})

@ -0,0 +1,101 @@
import { CLEAN_CONDITIONALLY_TAGS } from '../constants'
import {
getScore,
setScore,
getOrInitScore,
scoreCommas,
} from '../scoring'
import { normalizeSpaces } from '../text'
import { linkDensity } from './index'
// Given an article, clean it of some superfluous content specified by
// tags. Things like forms, ads, etc.
//
// Tags is an array of tag name's to search through. (like div, form,
// etc)
//
// Return this same doc.
export default function cleanConditionally(article, $) {
$(CLEAN_CONDITIONALLY_TAGS, article).each((index, node) => {
let weight = getScore($(node))
if (!weight) {
weight = getOrInitScore($(node), $)
setScore(weight, $)
}
// drop node if its weight is < 0
if (weight < 0) {
$(node).remove()
} else {
// deteremine if node seems like content
removeUnlessContent(node, $, weight)
}
})
return $
}
function removeUnlessContent(node, $, weight) {
// Explicitly save entry-content-asset tags, which are
// noted as valuable in the Publisher guidelines. For now
// this works everywhere. We may want to consider making
// this less of a sure-thing later.
if ($(node).hasClass('entry-content-asset')) {
return
}
const content = normalizeSpaces($(node).text())
if (scoreCommas(content) < 10) {
const pCount = $('p', node).length
const inputCount = $('input', node).length
// Looks like a form, too many inputs.
if (inputCount > (pCount / 3)) {
return $(node).remove()
}
const contentLength = content.length
const imgCount = $('img', node).length
// Content is too short, and there are no images, so
// this is probably junk content.
if (contentLength < 25 && imgCount === 0) {
return $(node).remove()
}
const density = linkDensity($(node))
// Too high of link density, is probably a menu or
// something similar.
// console.log(weight, density, contentLength)
if (weight < 25 && density > 0.2 && contentLength > 75) {
return $(node).remove()
}
// Too high of a link density, despite the score being
// high.
if (weight >= 25 && density > 0.5) {
// Don't remove the node if it's a list and the
// previous sibling starts with a colon though. That
// means it's probably content.
const nodeIsList = node.tagName === 'ol' || node.tagName === 'ul'
if (nodeIsList) {
const previousNode = $(node).prev()
if (previousNode && normalizeSpaces(previousNode.text()).slice(-1) === ':') {
return
}
}
return $(node).remove()
}
const scriptCount = $('script', node).length
// Too many script tags, not enough content.
if (scriptCount > 0 && contentLength < 150) {
return $(node).remove()
}
}
}

@ -0,0 +1,70 @@
import assert from 'assert'
import cheerio from 'cheerio'
import HTML from '../fixtures/html'
import { assertClean } from './test-helpers'
import { cleanTags } from './index'
describe('cleanTags($)', () => {
it("drops a matching node with a negative score", () => {
let $ = cheerio.load(HTML.dropNegativeScore.before)
let result = cleanTags($('*').first(), $)
assertClean(result.html(), HTML.dropNegativeScore.after)
})
it("removes a node with too many inputs", () => {
let $ = cheerio.load(HTML.removeTooManyInputs.before)
let result = cleanTags($('*').first(), $)
$('[score]').each((i, e) => $(e).removeAttr('score'))
assertClean(result.html(), HTML.removeTooManyInputs.after)
})
it("removes a div with no images and very little text", () => {
let $ = cheerio.load(HTML.removeShortNoImg.before)
let result = cleanTags($('*').first(), $)
$('[score]').each((i, e) => $(e).removeAttr('score'))
assertClean(result.html(), HTML.removeShortNoImg.after)
})
it("removes a node with a link density that is too high", () => {
let $ = cheerio.load(HTML.linkDensityHigh.before)
let result = cleanTags($('*').first(), $)
$('[score]').each((i, e) => $(e).removeAttr('score'))
assertClean(result.html(), HTML.linkDensityHigh.after)
})
it("removes a node with a good score but link density > 0.5", () => {
let $ = cheerio.load(HTML.linkDensityHigh.before)
let result = cleanTags($('*').first(), $)
$('[score]').each((i, e) => $(e).removeAttr('score'))
assertClean(result.html(), HTML.linkDensityHigh.after)
})
it("keeps node with a good score but link density > 0.5 if preceding text ends in colon", () => {
let $ = cheerio.load(HTML.previousEndsInColon.before)
let result = cleanTags($('*').first(), $)
assertClean(result.html(), HTML.previousEndsInColon.before)
})
it("keeps anything with a class of entry-content-asset", () => {
let $ = cheerio.load(HTML.cleanEntryContentAsset.before)
let result = cleanTags($('*').first(), $)
assertClean(result.html(), HTML.cleanEntryContentAsset.before)
})
})

@ -25,7 +25,7 @@ function convertDivs($) {
const convertable = $(div).children()
.not(DIV_TO_P_BLOCK_TAGS).length == 0
if (convertable) {
convertNodeTo(div, $)
convertNodeTo(div, $, 'p')
}
})
@ -36,7 +36,7 @@ function convertSpans($) {
$('span').each((index, span) => {
const convertable = $(span).parents('p, div').length == 0
if (convertable) {
convertNodeTo(span, $)
convertNodeTo(span, $, 'p')
}
})

@ -2,5 +2,13 @@
export { default as stripUnlikelyCandidates } from './strip-unlikely-candidates'
export { default as brsToPs } from './brs-to-ps'
export { default as paragraphize } from './paragraphize'
export { default as rewriteTopLevel } from './rewrite-top-level'
export { default as cleanImages } from './clean-images'
export { default as stripJunkTags } from './strip-junk-tags'
export { default as cleanHOnes } from './clean-h-ones'
export { default as cleanAttributes } from './clean-attributes'
export { default as removeEmpty } from './remove-empty'
export { default as cleanTags } from './clean-tags'
export { default as cleanHeaders } from './clean-headers'
export { textLength, linkDensity } from './link-density'
export { convertToParagraphs, convertNodeTo } from './convert-to-paragraphs'

@ -0,0 +1,7 @@
import { REMOVE_EMPTY_SELECTORS } from '../constants'
export default function removeEmpty(article, $) {
$(REMOVE_EMPTY_SELECTORS, article).remove()
return $
}

@ -0,0 +1,25 @@
import assert from 'assert'
import cheerio from 'cheerio'
import HTML from '../fixtures/html'
import { assertClean } from './test-helpers'
import { removeEmpty } from './index'
describe('removeEmpty($)', () => {
it("removes empty P tags", () => {
let $ = cheerio.load(HTML.removeEmptyP.before)
let result = removeEmpty($('*').first(), $)
assertClean(result.html(), HTML.removeEmptyP.after)
})
it("does not remove empty DIV tags", () => {
let $ = cheerio.load(HTML.removeEmptyP.before)
let result = removeEmpty($('*').first(), $)
assertClean(result.html(), HTML.removeEmptyP.after)
})
})

@ -0,0 +1,13 @@
import { convertNodeTo } from './index'
// Rewrite the tag name to div if it's a top level node like body or
// html to avoid later complications with multiple body tags.
export default function rewriteTopLevel(article, $) {
// I'm not using context here because
// it's problematic when converting the
// top-level/root node - AP
$ = convertNodeTo($('html'), $, 'div')
$ = convertNodeTo($('body'), $, 'div')
return $
}

@ -0,0 +1,18 @@
import assert from 'assert'
import cheerio from 'cheerio'
import HTML from '../fixtures/html'
import { assertClean } from './test-helpers'
import { rewriteTopLevel } from './index'
describe('rewriteTopLevel(node, $)', () => {
it("turns html and body tags into divs", () => {
let $ = cheerio.load(HTML.rewriteHTMLBody.before)
let result = rewriteTopLevel($('html').first(), $)
assertClean(result.html(), HTML.rewriteHTMLBody.after)
})
})

@ -0,0 +1,9 @@
import {
STRIP_OUTPUT_TAGS
} from '../constants'
export default function stripJunkTags(article, $) {
$(STRIP_OUTPUT_TAGS.join(','), article).remove()
return $
}

@ -0,0 +1,20 @@
import assert from 'assert'
import cheerio from 'cheerio'
import HTML from '../fixtures/html'
import { assertClean } from './test-helpers'
import { stripJunkTags } from './index'
describe('stripJunkTags($)', () => {
it("strips script and other junk tags", () => {
let $ = cheerio.load(HTML.stripsJunk.before)
let result = stripJunkTags($('*').first(), $)
assertClean(result.html(), HTML.stripsJunk.after)
})
})

@ -235,6 +235,436 @@ const HTML = {
linkDensity0: `
<div><p><a href=""></a></p></div>
`,
// rewriteTopLevel
rewriteHTMLBody: {
before: `
<html><body><div><p><a href="">Wow how about that</a></p></div></body></html>
`,
after: `
<div><div><div><p><a href="">Wow how about that</a></p></div></div></div>
`
},
// cleanImages
cleanSmallImages: {
before: `
<div>
<img width="5" height="5" />
<img width="50" />
</div>
`,
after: `
<div>
<img width="50">
</div>
`
},
cleanHeight: {
before: `
<div>
<img width="50" height="50" />
</div>
`,
after: `
<div>
<img width="50">
</div>
`
},
cleanSpacer: {
before: `
<div>
<img src="/foo/bar/baz/spacer.png" />
<img src="/foo/bar/baz/normal.png" />
<p>Some text</p>
</div>
`,
after: `
<div>
<img src="/foo/bar/baz/normal.png">
<p>Some text</p>
</div>
`
},
// stripJunkTags
stripsJunk: {
before: `
<div>
<style>.red { color: 'red'; }</style>
<title>WOW</title>
<link rel="asdflkjawef" />
<p>What an article</p>
<script type="text/javascript">alert('hi!');</script>
<noscript>Don't got it</noscript>
<hr />
</div>
`,
after: `
<div>
<p>What an article</p>
</div>
`
},
// stripHOnes
removeTwoHOnes: {
before: `
<div>
<h1>Look at this!</h1>
<p>What do you think?</p>
<h1>Can you believe it?!</h1>
</div>
`,
after: `
<div>
<p>What do you think?</p>
</div>
`
},
convertThreeHOnes: {
before: `
<div>
<h1>Look at this!</h1>
<p>What do you think?</p>
<h1>Can you believe it?!</h1>
<p>What do you think?</p>
<h1>Can you believe it?!</h1>
</div>
`,
after: `
<div>
<h2>Look at this!</h2>
<p>What do you think?</p>
<h2>Can you believe it?!</h2>
<p>What do you think?</p>
<h2>Can you believe it?!</h2>
</div>
`
},
// cleanAttributes
removeStyle: {
before: `
<div>
<p style="color: red;">What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
</div>
`
},
removeAlign: {
before: `
<div>
<p style="color: red;" align="center">What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
</div>
`
},
// removeEmpty
removeEmptyP: {
before: `
<div>
<p>What do you think?</p>
<p></p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
</div>
`
},
doNotRemoveBr: {
before: `
<div>
<p>What do you think?</p>
<p></p>
<div></div>
<p>What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
<div></div>
<p>What do you think?</p>
</div>
`
},
doNotNested: {
before: `
<div>
<p>What do you think?</p>
<p><img src="foo/bar.jpg" /></p>
<p><iframe src="foo/bar.jpg" /></p>
<p>What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
<p><img src="foo/bar.jpg" /></p>
<p>What do you think?</p>
</div>
`
},
// cleanConditionally
dropNegativeScore: {
before: `
<div>
<p>What do you think?</p>
<p>
<ul score="-10">
<li>Foo</li>
<li>Bar</li>
</ul>
</p>
<p>What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
<p>
</p>
<p>What do you think?</p>
</div>
`
},
removeTooManyInputs: {
before: `
<div>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<div>
<p>What is your name?</p>
<input type="text"></input>
<p>What is your name?</p>
<input type="text"></input>
<p>What is your name?</p>
<input type="text"></input>
</div>
<p>What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
<p>What do you think?</p>
</div>
`
},
removeShortNoImg: {
before: `
<div>
<p>What do you think?</p>
<div>
<p>Keep this one</p>
<img src="asdf" />
</div>
<div>
<p>Lose this one</p>
</div>
</div>
`,
after: `
<div>
<p>What do you think?</p>
<div>
<p>Keep this one</p>
<img src="asdf">
</div>
</div>
`
},
linkDensityHigh: {
before: `
<div score="0">
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu.</p>
<ul>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
</ul>
<ul score="20">
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
</ul>
</div>
`,
after: `
<div>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu.</p>
<ul>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
</ul>
</div>
`
},
goodScoreTooDense: {
before: `
<div>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu.</p>
<ul>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
</ul>
<ul score="30">
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
</ul>
</div>
`,
after: `
<div>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu.</p>
<ul>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
<li>Keep this one</li>
</ul>
</div>
`
},
previousEndsInColon: {
before: `
<div weight="40">
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu.</p>
<p>Now read these links: </p>
<ul score="30">
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
</ul>
</div>
`,
},
cleanEntryContentAsset: {
before: `
<div score="100">
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu.</p>
<ul score="20" class="entry-content-asset">
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
<li><a href="#">Lose this one</a></li>
</ul>
</div>
`,
},
// normalizeSpaces
normalizeSpaces: {
before: `
<div>
<p>What do you think?</p>
</div>
`,
after: `What do you think?`
},
// cleanHeaders
cleanFirstHeds: {
before: `
<div>
<h2>Lose me</h2>
<p>What do you think?</p>
<h2>Keep me</h2>
<p>What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
<h2>Keep me</h2>
<p>What do you think?</p>
</div>
`
},
cleanTitleMatch: {
before: `
<div>
<p>What do you think?</p>
<h2>Title Match</h2>
<p>What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
<p>What do you think?</p>
</div>
`
},
dropWithNegativeWeight: {
before: `
<div>
<p>What do you think?</p>
<h2 class="advert">Bad Class, Bad Weight</h2>
<p>What do you think?</p>
</div>
`,
after: `
<div>
<p>What do you think?</p>
<p>What do you think?</p>
</div>
`
},
}
export default HTML

@ -1,10 +1,16 @@
import {
getScore,
getOrInitScore,
setScore,
} from './index'
export default function addScore(node, $, amount) {
const score = getScore(node, $) + amount
setScore(node, $, score)
return node
try {
const score = getOrInitScore(node, $) + amount
setScore(node, $, score)
} catch(e) {
console.debug(e)
} finally {
return node
}
}

@ -85,8 +85,8 @@ export function mergeSiblings(candidate, topScore, $) {
if (newScore >= siblingScoreThreshold) {
return wrappingDiv.append(child)
} else if (node.tagName === 'p') {
childContentLength = textLength(child.text())
} else if (child.tagName === 'p') {
const childContentLength = textLength($(child).text())
if (childContentLength > 80 && density < .25) {
return wrappingDiv.append(child)

@ -47,13 +47,11 @@ describe('findTopCandidate($)', () => {
it("appends a sibling with a good enough score", () => {
const html = fs.readFileSync('../fixtures/latimes.html', 'utf-8')
.replace(/<!--[\s\S]*?-->/g, '')
let $ = cheerio.load(html)
$ = scoreContent($)
const topCandidate = findTopCandidate($)
assert.equal($(topCandidate).text().length, 3652)
})
})

@ -55,7 +55,7 @@ describe('getOrInitScore(node, $)', () => {
const score = getOrInitScore(node, $)
assert.equal(getScore(node.parent(), $), score/4)
assert.equal(getScore(node.parent(), $), 16)
})
})
})

@ -18,7 +18,7 @@ describe('scoreContent($, weightNodes)', () => {
const $ = cheerio.load(HTML.hNews.before)
const result = scoreContent($).html()
assert.equal(getScore($('div').first(), $), 110)
assert.equal(getScore($('div').first(), $), 140)
// assert.equal(getScore($('div').first(), $), 99)
})
@ -27,7 +27,7 @@ describe('scoreContent($, weightNodes)', () => {
const result = scoreContent($).html()
// assert.equal(getScore($('div').first(), $), 38)
assert.equal(getScore($('div').first(), $), 60)
assert.equal(getScore($('div').first(), $), 65)
})
it("scores this Wired article the same", () => {
@ -36,7 +36,7 @@ describe('scoreContent($, weightNodes)', () => {
const result = scoreContent($).html()
// assert.equal(getScore($('article').first(), $), 63.75)
assert.equal(getScore($('article').first(), $), 70.5)
assert.equal(getScore($('article').first(), $), 65.5)
})
// it("scores this NYT article the same", () => {

@ -0,0 +1,2 @@
export { default as normalizeSpaces } from './normalize-spaces'

@ -0,0 +1,5 @@
const NORMALIZE_RE = new RegExp('\s{2,}')
export default function normalizeSpaces(text) {
return text.replace(NORMALIZE_RE, ' ').trim()
}

@ -0,0 +1,16 @@
import assert from 'assert'
import cheerio from 'cheerio'
import HTML from '../fixtures/html'
import { normalizeSpaces } from './index'
describe('normalizeSpaces(text)', () => {
it("normalizes spaces from text", () => {
let $ = cheerio.load(HTML.normalizeSpaces.before)
let result = normalizeSpaces($('*').first().text())
assert.equal(result, HTML.normalizeSpaces.after)
})
})

@ -0,0 +1 @@
<div><div id="area-article-first-block" cla ="area" score="48.25"><div id="mod-a-body-first-para" class="mod-latarticlesarticletext mod-articletext" score="56.875"><p score="3.5">SACRAMENTO &#x2014; The influential lobby group Consumer Electronics Assn. is fighting what appears to be a losing battle to dissuade California regulators from passing the nation&apos;s first ban on energy-hungry big-screen televisions.</p><p score="7">On Tuesday, executives and consultants for the Arlington, Va., trade group asked members of the California Energy Commission to instead let consumers use their wallets to decide whether they want to buy the most energy-saving new models of liquid-crystal display and plasma high-definition TVs.</p><p score="7">&quot;Voluntary efforts are succeeding without regulations,&quot; said Doug Johnson, the association&apos;s senior director for technology policy. Too much government interference could hamstring industry innovation and prove expensive to manufacturers and consumers, he warned.</p><p score="4">But those pleas didn&apos;t appear to elicit much support from commissioners at a public hearing on the proposed rules that would set maximum energy-consumption standards for televisions to be phased in over two years beginning in January 2011. A vote could come as early as Nov. 4.</p></div></div><div id="mod-a-body-after-first-para" class="mod-latarticlesarticletextwithadcpc mod-latarticlesarticletext mod-articletext" score="109.725"><p score="6">The association&apos;s views weren&apos;t shared by everyone in the TV business. Representatives of some TV makers, including top-seller Vizio Inc. of Irvine, said they would have little trouble complying with tighter state standards without substantially increasing prices.</p><p score="4.02">&quot;We&apos;re comfortable with our ability to meet the proposed levels and implementation dates,&quot; said Kenneth R. Lowe, Vizio&apos;s co-founder and vice president.</p><p score="5.64">Last month, the commission formally unveiled its proposal to require manufacturers to limit television energy consumption in a way that has been done with refrigerators, air conditioners and dozens of other products since the 1970s.</p><p score="7">&quot;We would not propose TV efficiency standards if we thought there was any evidence in the record that they will hurt the economy,&quot; said Commissioner Julia Levin, who has been in charge of the two-year rule-making procedure. &quot;This will actually save consumers money and help the California economy grow and create new clean, sustainable jobs.&quot;</p><p score="5.04">Tightening efficiency ratings by using new technology and materials should result in &quot;zero increase in cost to consumers,&quot; said Harinder Singh, an Energy Commission staffer on the TV regulation project.</p><p score="5">California&apos;s estimated 35 million TVs and related electronic devices account for about 10% of all household electricity consumption, the Energy Commission staff reported. But manufacturers quickly are coming up with new technologies that are making even 50-inch-screen models much more economical to operate.</p><p score="7.88">New features, such as light-emitting diodes that consume tiny amounts of power, special reflective films and sensors that automatically adjust TV brightness to a room&apos;s viewing conditions, are driving down electricity consumption, experts said.</p><p score="6.2">The payoff could be big for TV owners, said Ken Rider, a commission staff engineer. Average first-year savings from reduced electricity use would be an estimated $30 per set and $912 million statewide, he said.</p><p score="7">If all TVs met state standards, Rider added, California could avoid the $600-million cost of building a natural-gas-fired power plant. Switching to more-efficient TVs could have an estimated net benefit to the state of $8.1 billion, the commission staff reported.</p><p score="10">Consumer Electronics Assn. officials disputed that figure, arguing that it was based on out-of-date numbers that fail to account for recent industry innovations. &quot;With voluntary compliance, manufacturers can meet the targets over time, managing the cost impact, yet not in any way impeding innovation,&quot; said Seth Greenstein, an association consultant.</p><p score="0">--</p><p score="0"><a href="mailto:marc.lifsher@latimes.com">marc.lifsher@latimes.com</a></p></div></div>
Loading…
Cancel
Save