Bug fix: many pages only grab partial content (dirty.ru, nytimes.com)
1) Avoid conversion of whitespace text nodes into paragraphs. They create a lot of noise and actually prevent sibling joining logic from working in many pages. 2) Handle case when adjacent content is actually located in parent's sibling node instead of top candidate’s sibling.pull/338/head
parent
a58913d975
commit
486927ebd9
Loading…
Reference in New Issue