mercury-parser

Commit Graph

Author	SHA1	Message	Date
Sarah Doire	c0364ec52b	feat: update all fixtures and custom parsers to match (#713 ) * feat: Refactor and update fixtures This patch changes how fixtures are stored. Previously, a fixture's folder identified its domain and its filename identified when it was fetched. This has been changed so that the filename indicates the domain and the modified time of the file indicates how recently it was fetched. A fixture's filename can optionally include a modifier to distinguish between two different page types on the same domain, for example. Also included here are changes to the update-fixture script, both to accomodate the new filename scheme as well as to actually update all fixtures. The functionality for running automatically and opening PRs has been removed but will likely be reintroduced. Finally, all fixtures have been updated. * Remove reference to deleted extractor * feat: first batch of test and parser updates due to new fixtures * feat: update more custom parsers and unit tests * feat: update more custom parsers and unit tests and remove unnecessary parser * feat: update more custom parsers and unit tests * feat: update more parsers and add correct bloomberg html files * fix: remove console statement * feat: all parsers updated and tests passing * fix: update date_published tests to account for test server time difference * fix: cleanup remaining fixtures in folders * feat: move fixtures for newest custom parsers * feat: remove script changes * fix: update dist files to account for reverting script changes * adding .DS_Store to .gitignore * adding .DS_Store to .gitignore -- 2 * adding .DS_Store to .gitignore -- 3 lol * cleaning up some tests * fix: ran build:generator command to update generate-custom-parser dist file * fix: update rollup configs to generate source maps and update source maps * fix: use underscore in place of unused error variable * fix: remove unused fixture Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com> Co-authored-by: flbn <overasc@gmail.com>	1 year ago
John Holdun	112846f74f	chore: Inline test fixtures (#683 ) Not to be confused with extractor fixtures, which are snapshots of a webpage. This change removes the pattern of separate JS files that provide "fixtures" for tests, which are used as provided or expected strings in tests. They were inconsistent and disorganized, and generally just served to add indirection to test files. So now all those strings are defined where they are used in their respective tests.	2 years ago
John Holdun	f259d13753	feat: Add figcaption to list of non-convertible span parents (#682 ) Based on this comment: https://github.com/postlight/mercury-parser/issues/530#issuecomment-580105171	2 years ago
Nate Weaver	de314a9728	Add li to the list of non-convertible parents for spans (#531 ) Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
Toufic Mouallem	3f46859d14	fix: skip absolutizing invalid srcsets (#386 ) * fix: skip absolutizing empty srcsets * test: empty srcsets are handled properly	5 years ago
Toufic Mouallem	3614e31abc	fix: skip absolutizing empty hrefs (#372 )	5 years ago
Drew Bell	b3e2a0ffd1	feat: extract custom types with extend option (#313 ) * feat: extract custom types with extend option Adds an `extend` option that lets you add custom types to be extracted and returned alongside the defaults, either in a call to `parse()` or in a custom extractor. ``` Mercury.parse( url, extend: { last_edited: { selectors: ['#last-edited'], defaultCleaner: false } } ) ``` * chore: use Reflect.ownKeys * feat: add CLI options * doc: add extend param to cli help * refactor: extract selectExtendedTypes * feat: only overwrite null extended results * feat: add allowMultiple extraction option * feat: accept extendList CLI args * feat: allow attribute selectors in extends on CLI * test: update extend tests * fix: don't invoke cleaner for custom types * feat: always return array if allowMultiple * test: add test for array of single result * refactor: extract extractHtml * refactor: destructure allowMultiple * fix: wrap multiple matches in $ for cheerio shim * fix: find extended types before any other munging * feat: absolutize all links * fix: clean content more directly * doc: Update CLI docs in README * chore: update dist * doc: Document extend in custom extractor README	5 years ago
Toufic Mouallem	136d6df798	feat: Return specific errors on failed parse attempts	5 years ago
Toufic Mouallem	a250f403f5	fix: Preserve whitespace in certain HTML elements (#333 )	5 years ago
Toufic Mouallem	0940971069	fix: better handling for responsive images (#312 )	5 years ago
Toufic Mouallem	7844129fda	feat: Add custom parser for Reddit (#307 )	5 years ago
Ben Ubois	0e27448866	feat: Various Character Encoding Improvements (#270 ) * Support HTML5 charset tag In HTML5 `<meta charset="">` is shorthand for `<meta http-equiv="content-type" content="">` https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta * Handle more character encoding declaration methods.	5 years ago
Adam Pash	663cc45bf4	fresh run of prettier; remove NOTES.md (#233 )	5 years ago
Toufic Mouallem	bb6ad2682b	fix: Transform relative URLs in srcset attributes to absolute URLs (#190 )	5 years ago
Jad Termsani	15a5229998	fix: womansay.net image urls (#196 )	5 years ago
Adam Pash	76d333f0be	deps: upgrade (#218 )	5 years ago
George Haddad	56badb51f5	dx: remove unnec comments in source (#205 ) * dx: remove commented code and obvious comments that can be looked up * dx: remove commented out eslint options * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove test block as all its code was commented out * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove regex example comments * dx: remove commented out code * dx: remove commented out code * dx: remove commented out import * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * dx: remove commented out code * chore: remove empty files * chore: re-prettier code that may have missed it * added back nec comments	5 years ago
Adam Pash	e4b057f9ea	chore: update node and some deps (#209 ) * chore: update .nvmrc * added prettier and pre-commit hooks * update docker image to new node * add karma-cli to get web tests working * explictly install karma... seems to fix problem * remove pre-built phantomjs * swap install order	5 years ago
Adam Pash	61f0f4e1af	fix: kept elements being removed (#166 ) Elements marked to keep were removeable under specific circumstances. This PR fixes these edge cases.	7 years ago
Adam Pash	453419de72	feat: improve wh.gov parser (#163 ) * feat: support youtube-nocookie domain * feat: updated wh.gov parser to support speeches	7 years ago
Kevin Ngao	afbef9bc39	Fix Encoding on Body (#143 ) * fix: check encoding on body	7 years ago
Adam Pash	3297ab079d	feat: bloomberg extractor (#59 ) Bloomberg has several templates. I'm supporting three different templates here, but I'm not sure that this is complete by any means. It's also worth noting that SVGs don't make it through the parser terribly well for many reasons. One, for example, is that a lot of SVGs require custom CSS in order for them to make sense. I'm not sure this is something we can expect to address in the parser.	8 years ago
Adam Pash	783a9cfb2f	fix: changed overly liberal regex for removing transparent images	8 years ago
Adam Pash	7411922c55	feat: encoding response body based on content-type charset (#21 ) Also some small code organization	8 years ago
Adam Pash	60a6861e18	Feat: browser support (#19 ) Big undertaking to support Mercury in the browser. Builds are working and all tests are passing both for web and node builds. Most code is closely shared.	8 years ago
Adam Pash	65c641a879	feat: enforcing line break rules in linter	8 years ago
Adam Pash	de5b120b79	feat: allowing extractors to support multiple domains	8 years ago
Adam Pash	007ddec8ac	feat: allowing iframes from src domain	8 years ago
Adam Pash	17317823de	fix: bug that stopped proper attr cleaning in certain cases	8 years ago
Adam Pash	38c90d239e	fix: removeEmpty shouldn't remove elements with images or iframes inside	8 years ago
Adam Pash	d3b11be473	feat: keeping youtube and vimeo iframe embeds (#14 ) * feat: keeping youtube and vimeo iframe embeds * fix: removing class from article correctly	8 years ago
Adam Pash	3b87b557be	feat: pulling score from whitelist	8 years ago
Adam Pash	422deb4600	feat: generator generates potential selectors for all custom selectable fields	8 years ago
Adam Pash	c314e3befa	feat: dek returns null if it's basically the same as the excerpt Squashed commit of the following: commit 0ee7d51ce609ad23d2deca1af41e7b4e56681bd7 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Oct 10 15:44:28 2016 -0700 feat: dek does not return if it's basically the same as the excerpt commit 6ad27f994fff3652e04ffe7c81f1ae0b1647e941 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Oct 10 14:35:54 2016 -0700 feat: added excerpt util	8 years ago
Adam Pash	63c06c8a00	fix: babel-polyfill mess (I think)	8 years ago
Adam Pash	173f885674	feat: custom parser + generator + detailed readme instructions Squashed commit of the following: commit 02563daa67712c3679258ebebac60dfa9568dffb Author: Adam Pash <adam.pash@gmail.com> Date: Fri Sep 30 12:25:44 2016 -0400 updated readme, added newyorker parser for readme guide commit 0ac613ef823efbffbf4cc9a89e5cb2489d1c4f6f Author: Adam Pash <adam.pash@gmail.com> Date: Fri Sep 30 11:16:52 2016 -0400 feat: updated parser so the saved fixture absolutizes urls commit 85c7a2660b21f95c2205ca4a4378a7570687fed0 Author: Adam Pash <adam.pash@gmail.com> Date: Fri Sep 30 10:15:26 2016 -0400 refactor: attribute selectors must be an array for custom extractors commit f60f93d5d3d9b2f2d9ec6f28d27ae9dcf16ef01e Author: Adam Pash <adam.pash@gmail.com> Date: Thu Sep 29 10:13:14 2016 -0400 fix: whitelisting srcset and alt attributes commit e31cb1f4e8a9fc9c3d9b20ef9f40ca6c8d6ad51a Author: Adam Pash <adam.pash@gmail.com> Date: Thu Sep 29 09:44:21 2016 -0400 some housekeeping for coverage tests commit 39eafe420c776a1fe7f9fea634fb529a3ed75a71 Author: Adam Pash <adam.pash@gmail.com> Date: Wed Sep 28 17:52:08 2016 -0400 fix: word count for multi-page articles commit b04e0066b52f190481b1b604c64e3d0b1226ff02 Author: Adam Pash <adam.pash@gmail.com> Date: Thu Sep 22 10:40:23 2016 -0400 major improvements to output commit 3f3a880b63b47fe21953485da670b6e291ac60e5 Author: Adam Pash <adam.pash@gmail.com> Date: Wed Sep 21 17:27:53 2016 -0400 updated test command commit 14503426557a870755453572221d95c92cff4bd2 Author: Adam Pash <adam.pash@gmail.com> Date: Wed Sep 21 16:00:30 2016 -0400 shortened generator command commit 5ebd8343cd4b87b3f5787dab665bff0de96846e1 Author: Adam Pash <adam.pash@gmail.com> Date: Wed Sep 21 15:59:14 2016 -0400 feat: can disable fallback to generic parser (this will be useful for testing custom parsers)	8 years ago
Adam Pash	8dc6042dc9	build for comparisons	8 years ago
Adam Pash	cbd0636dcf	chore: cleaned up python and other unneeded comments	8 years ago
Adam Pash	bf13b38a9b	feat: some basic error handling for bad urls	8 years ago
Adam Pash	7c375aded7	chore: cleanup	8 years ago
Adam Pash	6263e505d5	fix: handling case where node.get(0) returns null	8 years ago
Adam Pash	daa9266182	feat: generic extractor for word count Squashed commit of the following: commit 0aba26ef9efba71a72c76fa351a9037e97fc1e9e Author: Adam Pash <adam.pash@gmail.com> Date: Wed Sep 14 14:56:45 2016 -0400 fix: normalizeSpaces regex fix broke a test commit 07d60c1c8c6599d6c94d92e5a70649c28d03d6ea Author: Adam Pash <adam.pash@gmail.com> Date: Wed Sep 14 14:52:41 2016 -0400 feat: generic extractor for word count	8 years ago
Adam Pash	76df30e303	chore: cleanup	8 years ago
Adam Pash	b325a4acdd	chore: clean up junk tests	8 years ago
Adam Pash	62ae330db2	fix: bug in scoring and converting to paragraphs	8 years ago
Adam Pash	7e2a34945f	chore: refactored and linted	8 years ago
Adam Pash	9906bd36a4	chore: moved content scoring out of utils, removed no-longer-necessary utils	8 years ago
Adam Pash	7ec0ed0d31	feat: nextPageUrl handles multi-page articles Squashed commit of the following: commit b5070c0967a7f1a0c0c449ba7ea40aebe8fe4bb8 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 13 10:03:00 2016 -0400 root extractor includes next page url commit 79be83127d5342d89eef33665586fabea227d6b3 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 13 09:58:20 2016 -0400 small score adjustment commit 0f00507dbff43401145a892e849311518edec68a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 12 18:17:38 2016 -0400 feat: nextPageUrl generic parser up and running commit be91c589fc0c6d6f9b573080a76c9b1ac7af710c Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 12 11:53:58 2016 -0400 feat: pageNumFromUrl extracts the pagenum of the current url commit ad879d7aabedadfd051c01b42d841703bf4763fa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 12 11:52:37 2016 -0400 feat: isWordpress checks if a page is generated by wordpress	8 years ago
Adam Pash	a89b9b785e	feat: small improvement to author selectors	8 years ago
Adam Pash	74694ba8e2	debugging: cheerio isn't always consistent in setting scores	8 years ago

1 2

57 Commits (fix-remove-moment-js)