Product looks good, but I'm gonna roast you for having too little stuff on your landing page while also asking for a signup. I probably will sign up, I just have a reflexive aversion to doing so and generating yet another telemetry stream and set of incoming marketing emails.
The knowledge base and API documentation is good to me, but maybe not ideal for your target customer, the person looking for a no-code solution and probably somewhat intimidated by anything beyond a CSV. I think you should add a step-by-step or maybe a video showing how the HTML selectors and rules work in outline. When I first got interested in this topic there were two main stumbling blocks: cursors/pagination, and how to identify selectors on a page with multiple similar but distinct items (social media mutuals lists, product catalogs etc.). Since you're aiming at a non-technical audience, I think you need to give them feel of a walkthrough before downloading the app.
We use real browser instances to perform fast but human web scrapings, resulting in a much lower block ratio."
"won't be blocked" implies a zero block ratio. (I do a lot of work with Puppeteer and Playwright, and some larger websites are pretty advanced at their heuristics at catching automation, so true zero really isn't a defensible claim)
It's obviously an exaggeration, but I think the point is to suggest that you'll have much higher success (as opposed to being blocked) with this service vs rolling your own.
Anyway, if you want to be technical about it, the marking is correct. YOU won't be blocked. The agent running on your behalf might be blocked, however...
But from a marketing perspective, this "you won't be blocked" falls into the acceptable simplification category. Maybe they could add a * footnote, giving some more detail elsewhere. But at this point in the landing page, it wouldn't make sense to try to state it more accurately as that would require too many words.
There’s a difference between acceptable simplification and misleading, and while the line is not stark landing on the wrong side of it won’t build as much trust over time.
How block you’ll be “blocked less” or some variation of that form?
Still simple, less risk of disappointment/trust issues.
Oftentimes being "blocked" is more nuanced than whether the site returns a 200 vs a 4xx. The site may render, but the backend API may respond differently based on the behavior it sees.
Removing “you won’t be blocked.” should be sufficient then.
It looks interesting. I tried puppeteer and playwright but never got the hang of it, so I might be a client for one of these scraper services one day. The first time I tried it I got immediately blocked (probably because it had no agent, which was a raspberry pi)
Additions to libraries like Puppeteer that help ensure that the browser being used looks more "organic", often by returning fake data that a normal browser would have (browsers have APIs with things like plugins and fonts installed etc)
> anti-blocking measures are implemented ethically
Your assumption that blocking is somehow ethical by default is not unproblematic.
There's a world wide web built by academics for free exchange of information and there's a closed garden web built by major capitalists.
Just how free that exchange of information should be is not a settled problem. Some very libertarians argue along the lines of information "wanting to be free". Some commercial entities seem to identify copyright and trademark law with moral doctrines. There are plenty of arguments for in-between positions as well.
If we look at less democratic societies, the efforts made to circumnavigate state censorship are publicly lauded as morally good actions by the international community. Could an analogy be drawn to large corporations censoring the less fortunate in a economically uneven societies too, for instance?
Your assumption that blocking is somehow ethical by default is not unproblematic.
is itself an assumption.
The problem I'm concerned with is aggressive (either deliberately or ignorantly) crawling/scraping of non-commercial sites which often lack the financial resources to defend against activities enabled without apparent concern by tools like the site here.
If a site allows reasonable access in good faith, then subverting those limits and constraints for self-serving reasons is ethically dubious at best, and any service not addressing that while promising to enable that subversion should be questioned.
I've had a look at a number of these "simple" (i.e ones where I don't have to write a complex script) scraping tools recently and none of them seem to support what I consider to be a fairly common scenario of navigating to sub pages.
In my case I have a landing page (with pagination) with a list of records I want to extract. However, to extract the full information I need for each record, I need to click on each item and navigate to a detail page to extract further info.
Looking at your app and docs you don't seem to support this either. Is this something you are considering?
I'm currently working on standard pagination (click next page button) and click button + infinite scroll.
What you comment is not currently possible with a single scraper, you would need to send one to collect links and then scrape those links. But I'm also working on "nesting data" feature, and what you comment should be possible in an ETA 2-3 weeks max.
I had a nice experience with https://simplescraper.io for a similar use-case. Was able to scrape a few thousand URLs without too much fuss.
The biggest complication with visual scrapers is all the edge cases. The selector algorithms usually become a mess on any complex website especially if there's uneven data.
Then you have css selectors no longer working and so on. Very brittle.
for an (unlimited) free local option, https://webscraper.io/ may do what you want. It is simpler than this one (no proxy/scheduling/API...) but the scraping rules are quite elaborate.
"""
What happens if my scraping fails?
Not to worry! We will make every effort to determine the cause of the problem and assist you in resolving any issues with your scraper.
Additionally, please note that unsuccessful scrapings will not be included in your monthly quota.
"""
I'm curious the feedback mechanism for failed scrapes. Is there any validation configuration or an email notification I can configure in the event the target changes their page layout or DOM or whatever happens to cause interference.
totally, that's going to be a huge piece of functionality that will add a ton of value to the user. It's what I always worry about in maintaining my scrapers, updating xpaths, all the rigmarole parsing and what not. Stable pages are much appreciated but we all know web designers are gonna design. Thanks for making this product, being in a post https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn world is gonna make for some great adventures I think. Keep up the good work.
I signed up but on the setup wizard (https://app.mrscraper.com/onboarding) I can't seem to edit any of the input boxes ("Give your scraper a name", "Enter the URLs you want to scrape"). I'm on Chrome on Mac with uBlock Origin.
Sorry for not being more clear. The onboarding process is a simplified version of the actual scraper builder. The fields are not editable, it's just to get used to the scraping flow.
I've noted down your suggestion and I'll make this more clear or add field edition.
Yeah, this took me a minute to figure out, too. I'd change it to make it clear those are just static slides. Even better, remove it entirely, and when the user lands on the home page (after verifying), open a wizard that guides them through setting up their first scraper.
I also had a real problem naming the "Store as" field for my data extractor. It didn't seem to like things in the format "foo_bar_baz56" (i.e. ending with digits). This page https://mrscraper.freshdesk.com/support/solutions/articles/1... says "The variable name can not contain special signs" but doesn't explain what special signs are. Anything other than [A-Z_]?
Now that I've finally set up and tested my first scraper, I'm really impressed. It was much easier to set up than I would have guessed, and specifying a selector made it dead simple. Results worked out of the box, on a site that is super touch about being scraped.
However, now that I'm viewing my scraper, I see no way of editing the scraper or data extractor. What's the trick to editing a scraper once you've saved it and gone back to view it?
I've noted everything down and I'll improve the onboarding experience, fix and make variable names more understandable and improve the edit button.
A scraper is not editable once is queued to run or currently running, then, you need to reload the page for the edit button to appear again. I will improve this.
Honestly, I've been working on it for two months and didn't reach this part of the roadmap yet. I was planning to use a 2captcha integration for the first approach.
Congrats on the launch, I think the description of what MrScraper does vs what you'd have to do yourself really nails it, and that is the value prop. Having experience in that area myself I can say that this looks like a great product and great pricing as well.
I'd have no doubt about the demand given the tools we see daily.
However, what beg the question is why are we still interesting in pushing this tread to the top? I'm surprise to see yet another web scraper be featured on the front page on HN
I always wonder what web scraping tools use as their proxy solution, because afaict they tend to be quite expensive, especially for residential IPs. How are you handling that?
I was hoping for something that would allow me to load a page to be scraped, mark the things I'm interested in, and have it help with the selection expressions.
I'm experimenting with something like you described, but there are lots of edge cases and needs a bit of polishing. But for now, I made a Chrome extension that helps tinkering with CSS selectors.
I could depending on the proxy I got from my provider. I'm currently working on adding the ability to select a higher quality proxy for difficult to scrape websites and to add captcha solvers as well.
But if I have to be honest, I can not guarantee it at the present time.
This app is the side-project I starter 2 months ago, it's evolving fast but I still need to add some key features for enterprise customers.
scrapeninja.net /scrape-js endpoint scrapes company pages of g2 without big troubles (with "us"/"eu" proxy geo in their online sandbox: https://scrapeninja.net/scraper-sandbox ).
They also have /scrape which is much faster because it does not bootstrap real browser, and bypasses CloudFlare TLS fingeprint check: https://pixeljets.com/blog/bypass-cloudflare/
What would the expression language for that even look like, given that PDFs are basically "canvas as a service"?
I'm aware there are pdf2html toys, and sometimes they do something reasonable, but just like with web scraping the markup of the target matters a lot and so, too, would the "markup" of the target PDF
Further, just like often it is better to go after the underlying XHR instead of trying to de-React the HTML, I'll offer that when possible it would be far better to try and identify the upstream source of the information in the PDF than trying to reverse engineer a postscript VM
Just how many web scraping tools do we need? It seems like every month there's a new web scraper on HN, is that really such a common task that we need dozens of tools?
The knowledge base and API documentation is good to me, but maybe not ideal for your target customer, the person looking for a no-code solution and probably somewhat intimidated by anything beyond a CSV. I think you should add a step-by-step or maybe a video showing how the HTML selectors and rules work in outline. When I first got interested in this topic there were two main stumbling blocks: cursors/pagination, and how to identify selectors on a page with multiple similar but distinct items (social media mutuals lists, product catalogs etc.). Since you're aiming at a non-technical audience, I think you need to give them feel of a walkthrough before downloading the app.