Commit Graph

174 Commits (fix-remove-moment-js)

Author SHA1 Message Date
Sarah Doire c0364ec52b
feat: update all fixtures and custom parsers to match (#713)
* feat: Refactor and update fixtures

This patch changes how fixtures are stored. Previously, a fixture's folder identified its domain and its filename identified when it was fetched. This has been changed so that the filename indicates the domain and the modified time of the file indicates how recently it was fetched. A fixture's filename can optionally include a modifier to distinguish between two different page types on the same domain, for example.

Also included here are changes to the update-fixture script, both to accomodate the new filename scheme as well as to actually update all fixtures. The functionality for running automatically and opening PRs has been removed but will likely be reintroduced.

Finally, all fixtures have been updated.

* Remove reference to deleted extractor

* feat: first batch of test and parser updates due to new fixtures

* feat: update more custom parsers and unit tests

* feat: update more custom parsers and unit tests and remove unnecessary parser

* feat: update more custom parsers and unit tests

* feat: update more parsers and add correct bloomberg html files

* fix: remove console statement

* feat: all parsers updated and tests passing

* fix: update date_published tests to account for test server time difference

* fix: cleanup remaining fixtures in folders

* feat: move fixtures for newest custom parsers

* feat: remove script changes

* fix: update dist files to account for reverting script changes

* adding .DS_Store to .gitignore

* adding .DS_Store to .gitignore -- 2

* adding .DS_Store to .gitignore -- 3 lol

* cleaning up some tests

* fix: ran build:generator command to update generate-custom-parser dist file

* fix: update rollup configs to generate source maps and update source maps

* fix: use underscore in place of unused error variable

* fix: remove unused fixture

Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
Co-authored-by: flbn <overasc@gmail.com>
1 year ago
Sarah Doire 7b68bcd94c
feat: remove obsolete custom extractors (#712) 2 years ago
Andrei Zhemaituk 4981355628
fixed and improved extraction for latest layout of politico.com (#701)
* fixed and improved extraction for latest layout of politico.com

* explicit timezone for politico.com extractor

* handling more layout of politico.com

Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com>
Co-authored-by: Sarah Doire <sarah.doire@postlight.com>
2 years ago
Andrei Zhemaituk 45bb28e217
custom parser for www.investmentexecutive.com (#700)
Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com>
Co-authored-by: Sarah Doire <sarah.doire@postlight.com>
2 years ago
Andrei Zhemaituk 6532316973
custom parser for cbc.ca (#699)
Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com>
Co-authored-by: Sarah Doire <sarah.doire@gmail.com>
Co-authored-by: Sarah Doire <sarah.doire@postlight.com>
2 years ago
Sarah Doire 8ca8a5f7e5
feat: add postlight.com custom extractor (#695) 2 years ago
Simon Reinhardt 035aa65dbc
Added custom extractor for www.spektrum.de (#677)
Co-authored-by: Simon Reinhardt <simon.reinhardt@hype.de>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Brayton 9a961aa595
feat: Add a custom extractor for www.ndtv.com. (#554)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

* feat: Add a custom extractor for www.ndtv.com.

* Works, but I need to figure how to make pagination work correctly.

* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.

* rolled back { fallback: false } option removal

* Clarified comments.

* rolling back yarn.lock changes

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Brayton 143631b4b7
feat: arstechnica.com extractor (#553)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

* Works, but I need to figure how to make pagination work correctly.

* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.

* rolled back { fallback: false } option removal

* Clarified comments.

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Brayton 3c5c0bdba9
feat: Add a custom extractor for www.engadget.com. (#552)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
Sven Wiegand 13dfe720bd
Custom extractor for www.gruene.de (#485)
* Implemented custom extractor gruene.de

* Cleaner output of custom extracter www.gruene.de

* Updated fixture for www.gruene.de from real page

* Trying to pick image from og:image -- doesn't work ...

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
Marco Wiedemeyer d0c78911e6
Add a new custom extractor for www.abendblatt.de (#559)
* Add custom extractor for www.abendblatt.de

* update

Co-authored-by: Marco Wiedemeyer <marco.wiedemeyer@ottogroup.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
Felipe Canejo 6014016283
feat: Add a custom extractor for pastebin.com (#556)
* feat: Add a custom extractor for pastebin.com

* feat: transforms <li> to <p> in pastebin.com

Co-authored-by: Felipe Canejo <felipecanejo@gmail.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Brayton e217648c0b
feat: ma.ttias.be extractor (#551)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
James Shakespeare 70e99d56cf
Feat: update qz.com selectors and tests (#538)
* feat: update qz.com selectors and tests

* chore: remove out of date fixture
2 years ago
Joe Moon fb44ab0244
Bugfix new yorker wired extractors (#604)
* www.newyorker.com: add updated fixtures and fix extractors

* www.wired.com: add updated fixtures and fix extractors

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
Nitin Khanna 8c9982247b
feat: Ladbible.com extractor (#624)
* Ladbible.com extractors and test

* CircleCI says timezone needs to be Europe/London aka BST

Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
Co-authored-by: Jad Termsani <32297675+JadTermsani@users.noreply.github.com>
3 years ago
Nitin Khanna 30d6f472ee
feat: Times of India extractor (#503)
* Adding custom parser for Times of India

* moved transforms to clean

The transforms were just working as cleans. Moved things around as per recommendations.

Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
3 years ago
Wajeeh Zantout b0e708aac6
feat: update nytimes extractor (#506)
* feat: update custom extractor for nytimes.com
5 years ago
Michael Ashley e12c916499
feat: ability to add custom extractors via api (#484)
* feat: ability to add custom extractors via api

* docs: updating readme

* fix: example.com was being used in another test

* fix: timezone was messing up date_published test

* fix: using a unique site for testing

* fix: updated custom extractor api

* docs: updating readme

* fix: removing unused fixture

* fix: updating test description

* feat: ability to add custom extractors via cli
5 years ago
Sven Wiegand f95947fe88 Implemented custom extractor epaper.zeit.de (#488) 5 years ago
Michael Ashley 2422e4717d
fix: incorrect parsing on medium.com (#477)
* fix: medium extractor now pulls content

* fix: remove youtube caption if no preview available

* fix: remove youtube node if no image

* fix: removing dek from medium.com extractor
5 years ago
Michael Ashley 0686ee7956
fix: incorrect parsing on theatlantic.com (#475)
* fix: incorrect parsing on theatlantic.com

* chore: updating theatlantic.com tests & fixtures

* chore: removing script data from minified fixture
5 years ago
Michael Ashley 5e33263d25
chore: minifying biorxiv.com fixture (#478) 5 years ago
david0leong 911b0f87c8 Add custom extractor for biorxiv.org (#467)
* Add custom extractor for biorxiv.org

* Fix content selector

* Improve content selector
5 years ago
Ben Ubois 0942c37876 feat: custom parser for phoronix.com. (#431) 5 years ago
Michael P. Geraci 571a913745 feat: pitchfork extractor (#439)
* generate the custom extractor and get the first test to pass

* add the basic extractors (title, author, date, etc)

* select the score as well as the review text, and break the content test

* prepend the score to the content

* get the date from the datetime attribute

* mangle this test a little, but just a little (it does work properly)

* move from prepending the score to the review text to adding it as a custom field in the extractor
5 years ago
david0leong 694ea820aa Custom Extractor for clinicaltrials.gov (#305)
* Add prototype of custom extractor for clinicaltrials.gov

* Add .DS_Store to gitignore

* Make tests for title, author and date_published selectors pass

* Make content selector test pass

* Fix date_published test

* Rebuild

* Remove .DS-Store from gitignore

* Improve extractor and text/fixture of clinicaltrials.gov
5 years ago
Wajeeh Zantout 7c8de71c52 fix: new yorker extractor (#414)
* fix: new yorker extractor

* fix: date_published selector

* fix: remove footer from content

* feat: add additional selector for title

* feat: support article with multiple authors
5 years ago
Wajeeh Zantout e66ad8b81c feat: add le monde extractor (#415) 5 years ago
kik0220 f81dc63617 feat: add rbbtoday.com custom parser (#411)
* feat: add rbbtoday.com custom parser

* fix: content test

* fix: dek and content
5 years ago
kik0220 5e1113b3a9 feat: add japan.zdnet.com custom parser (#410)
* feat: add japan.zdnet.com custom parser

* fix: author and date_published selector
5 years ago
kik0220 77e3bc00e2 feat: add wired.jp custom parser (#409)
* feat: add wired.jp custom parser

* fix: author test

* fix: date_published selector

* test: fix dek and contest

* test: fix content (without clean dek)
5 years ago
kik0220 0b36c96de0 feat: add techlog.iij.ad.jp custom parser (#405)
* feat: add techlog.iij.ad.jp custom parser

* fix: date_published and content selector
5 years ago
kik0220 406bf1b1a9 feat: add weekly.ascii.jp custom parser (#401)
* feat: add weekly.ascii.jp custom parser

* fix: title and date_published selector
5 years ago
kik0220 216bfade00 feat: add www.ipa.go.jp custom parser (#408) 5 years ago
kik0220 3ae8f3bde3 feat: add www.oreilly.co.jp custom parser (#407) 5 years ago
kik0220 7396e81b72 feat: add sect.iij.ad.jp custom parser (#404) 5 years ago
kik0220 3f1d9030ee feat: add www.lifehacker.jp custom parser (#403) 5 years ago
kik0220 b077000c4a feat: add getnews.jp custom parser (#402) 5 years ago
kik0220 b5425c3e8a feat: add www.gizmodo.jp custom parser (#400) 5 years ago
kik0220 a38c727a0a feat: add deadline.com custom parser (#383)
* feat: add deadline.com custom parser

* fix: timezone

* fix: date_published selectors

* fix: title and author selector

* test: transform .embed-twitter

* fix: regenerate the fixture and fix content selector
5 years ago
kik0220 74a3c49a3c feat: add japan.cnet.com custom parser (#382)
* feat: add japan.cnet.com custom parser

* fix: remove transform
5 years ago
kik0220 7b07f88448 feat: add www.yomiuri.co.jp custom parser (#381) 5 years ago
kik0220 8ca2894751 feat: add bookwalker.jp custom parser (#374) 5 years ago
kik0220 a5f06ce27a feat: add takagi-hiromitsu.jp custom parser (#364) 5 years ago
kik0220 b9c57dbc2f feat: add www.publickey1.jp custom parser (#365)
* feat: add www.publickey1.jp custom parser

* fix: date_published selector
5 years ago
kik0220 d7dbea8a95 feat: add www.itmedia.co.jp custom parser (#366)
* feat: add www.itmedia.co.jp custom parser

* feat: add nlab.itmedia.co.jp support

* fix: title selectors
5 years ago
kik0220 9218f80da6 feat: add www.moongift.jp custom parser (#367)
* feat: add www.moongift.jp custom parser

* fix: date_published selectors

* fix: pass test

* fix: add timezone
5 years ago
kik0220 4eb73dffb0 feat: add www.infoq.com custom parser (#368)
* feat: add www.infoq.com custom parser

* fix: date_published selector
5 years ago