Instaparser

Cleanly pull content from any website

14 followers

Cleanly pull content from any website

14 followers

Visit website

Launch tags:

Web App•API•Tech

Launch Team

Brian Donohue

Publish Current Page to Medium

Maker

Brian from Instapaper here! Over the past few years we've gotten a significant number of requests from developers to have access to Instapaper's parser. Yesterday we launched Instaparser, an API to access Instapaper's parser. Instaparser is a paid service, but there's a free tier under https://www.instaparser.com/sign... that can be used for testing or just quick weekend hacks. Personally, this is the first developer-focused product I've launched, and I'm very excited to get it out into the community and see what people will do with it.

Report

9yr ago

PJ Camillieri

aiden.ai

@bthdonohue This looks very interesting. I am not trying to be negative here, but I am just curious (as a potential customer): how do you guys compare to open source (and frankly: popular) solutions such as Newspaper? https://github.com/codelucas/new...

Report

9yr ago

Brian Donohue

Publish Current Page to Medium

Maker

@cam_pj Hi PJ! I'm unfamiliar with Newspaper, so I just took a look through the source code to get a feel for how they're doing the article parsing. It looks like a great tool for an open source parsing framework, and also appears to be at least somewhat influenced by the Readability parser (similar paragraph scoring, checking sibling nodes, etc). I think the major difference here is that, in order to have a large coverage for as many domains as possible, you need to implement and maintain a flexible system for domain-by-domain parser configurations. We have a dedicated support/community person that's trained to resolve parsing issues on a domain-by-domain basis when they do come up, and we use a variety of signals in order to make sure the parser is up-to-date. We have signals coming from the "Report a Problem" button in the Instapaper app, scheduled integration tests against our most popular domains, recorded failures from the Instaparser API, and we use a combination of those signals and domain popularity to prioritize fixes in parsing issues both on a proactive and reactive basis. Creating an accurate parser requires constant maintenance from a dedicated team and while I'm sure there are open source projects out there that will come up with 65%-75% accuracy, getting to 90%+ accuracy is the really tricky bit. Hope that's helpful!

Report

9yr ago

PJ Camillieri

aiden.ai

@bthdonohue Understood. It makes sense. Like you said - the last 20% are always tricky with data extraction. Thanks for clarifying this.

9yr ago

9yr ago

Save This!

@bthdonohue I'm curious the reason to build in-house vs. use something like Embedly for the video section? This looks like a direct compete with them (at a 2.5x price increase per call) - what are the main reasons someone would use Instaparser over Embedly? No official association to Embedly, albeit a happy user of both services.

Report

9yr ago

Brian Donohue

Publish Current Page to Medium

Maker

@parterburn hi Paul! We did outsource some of our parsing for a while to Diffbot (https://www.diffbot.com), but we ran into a lot of issues using their service. Everything from inaccurate parses, returning elements that were incompatible with Instapaper, and slow speeds. When we re-wrote Instapaper's original parser for a new more modern web and replaced all of our parsing with the new parser we saw a 10x drop in parsing time (

) among other benefits like increased accuracy and better integration with Instapaper (e.g. inline video support). I'm not sure how you're figuring the 2.5x price increase per call for Embedly. We're pretty competitive in pricing although slightly more expensive on the lower end and slightly less expensive on the upper end: https://s3-us-west-2.amazonaws.c... Thanks for your questions!

Report

9yr ago

Paul Arterburn

Save This!

@bthdonohue Great explanation. I was only looking at the lowest tier, and I applaud you for "choosing your customers" via price. It's not something every company is comfortable doing. @Shpigford just published a great article on this.

Report

9yr ago

Jamie

@bthdonohue this is really awesome! Thanks for building this. Now that you no longer use Diffbot (and they have a competing product) you should probably request that they remove Instapaper from their website.

Report

9yr ago