Show HN: MrScraper – A visual web-scraping tool

anigbrowl · on Feb 10, 2023

Product looks good, but I'm gonna roast you for having too little stuff on your landing page while also asking for a signup. I probably will sign up, I just have a reflexive aversion to doing so and generating yet another telemetry stream and set of incoming marketing emails.

The knowledge base and API documentation is good to me, but maybe not ideal for your target customer, the person looking for a no-code solution and probably somewhat intimidated by anything beyond a CSV. I think you should add a step-by-step or maybe a video showing how the HTML selectors and rules work in outline. When I first got interested in this topic there were two main stumbling blocks: cursors/pagination, and how to identify selectors on a page with multiple similar but distinct items (social media mutuals lists, product catalogs etc.). Since you're aiming at a non-technical audience, I think you need to give them feel of a walkthrough before downloading the app.

buffer_overflow · on Feb 10, 2023

Thanks for this amazing feedback. I can get some action items from this.

I've noted down your comment and I'll be improving things for next week!

jonatron · on Feb 10, 2023

I scraped https://bot.incolumitas.com/ . Results do not look good, sorry!

{ "new_tests": "{\n \"puppeteerEvaluationScript\": \"OK\",\n \"webdriverPresent\": \"FAIL\",\n \"connectionRTT\": \"FAIL\",\n \"overrideTest\": \"OK\",\n \"puppeteerExtraStealthUsed\": \"OK\",\n \"inconsistentServiceWorkerNavigatorPropery\": \"OK\",\n \"inconsistentWebWorkerNavigatorPropery\": \"OK\"\n}", "detection_tests": "{\n \"intoli\": {\n \"userAgent\": \"OK\",\n \"webDriver\": \"FAIL\",\n \"webDriverAdvanced\": \"FAIL\",\n \"pluginsLength\": \"FAIL\",\n \"pluginArray\": \"FAIL\",\n \"languages\": \"OK\"\n },\n \"fpscanner\": {\n \"PHANTOM_UA\": \"OK\",\n \"PHANTOM_PROPERTIES\": \"OK\",\n \"PHANTOM_ETSL\": \"OK\",\n \"PHANTOM_LANGUAGE\": \"OK\",\n \"PHANTOM_WEBSOCKET\": \"OK\",\n \"MQ_SCREEN\": \"OK\",\n \"PHANTOM_OVERFLOW\": \"OK\",\n \"PHANTOM_WINDOW_HEIGHT\": \"OK\",\n \"HEADCHR_UA\": \"OK\",\n \"WEBDRIVER\": \"FAIL\",\n \"HEADCHR_CHROME_OBJ\": \"FAIL\",\n \"HEADCHR_PERMISSIONS\": \"FAIL\",\n \"HEADCHR_PLUGINS\": \"WARN\",\n \"HEADCHR_IFRAME\": \"FAIL\",\n \"CHR_DEBUG_TOOLS\": \"OK\",\n \"SELENIUM_DRIVER\": \"OK\",\n \"CHR_BATTERY\": \"OK\",\n \"CHR_MEMORY\": \"OK\",\n \"TRANSPARENT_PIXEL\": \"OK\",\n \"SEQUENTUM\": \"OK\",\n \"VIDEO_CODECS\": \"OK\"\n }\n}" }

jetter · on Feb 10, 2023

Interesting test suite, thanks! I have tested scrapeninja.net via https://scrapeninja.net/scraper-sandbox and I got { "puppeteerEvaluationScript": "OK", "webdriverPresent": "OK", "connectionRTT": "OK", "refMatch": "OK", "overrideTest": "OK", "overflowTest": "OK", "puppeteerExtraStealthUsed": "OK", "inconsistentWebWorkerNavigatorPropery": "OK", "inconsistentServiceWorkerNavigatorPropery": "OK" }

and ip range of "us" geo proxy gives is_abuse: true. Consider this to be okayish though, given that this is a default proxy pool.

buffer_overflow · on Feb 10, 2023

Thanks for reporting. I'll review this!

hummus_bae · on Feb 10, 2023

wow, you actually went through the trouble of running all this detection tests? that’s super cool :)

bdcravens · on Feb 10, 2023

This marketing bit seems a bit conflicting:

"With MrScraper, you won't be blocked.

We use real browser instances to perform fast but human web scrapings, resulting in a much lower block ratio."

"won't be blocked" implies a zero block ratio. (I do a lot of work with Puppeteer and Playwright, and some larger websites are pretty advanced at their heuristics at catching automation, so true zero really isn't a defensible claim)

michaelteter · on Feb 10, 2023

It's obviously an exaggeration, but I think the point is to suggest that you'll have much higher success (as opposed to being blocked) with this service vs rolling your own.

Anyway, if you want to be technical about it, the marking is correct. YOU won't be blocked. The agent running on your behalf might be blocked, however...

But from a marketing perspective, this "you won't be blocked" falls into the acceptable simplification category. Maybe they could add a * footnote, giving some more detail elsewhere. But at this point in the landing page, it wouldn't make sense to try to state it more accurately as that would require too many words.

WhitneyLand · on Feb 10, 2023

There’s a difference between acceptable simplification and misleading, and while the line is not stark landing on the wrong side of it won’t build as much trust over time.

How block you’ll be “blocked less” or some variation of that form?

Still simple, less risk of disappointment/trust issues.

bdcravens · on Feb 10, 2023

Oftentimes being "blocked" is more nuanced than whether the site returns a 200 vs a 4xx. The site may render, but the backend API may respond differently based on the behavior it sees.

prox · on Feb 10, 2023

Removing “you won’t be blocked.” should be sufficient then.

It looks interesting. I tried puppeteer and playwright but never got the hang of it, so I might be a client for one of these scraper services one day. The first time I tried it I got immediately blocked (probably because it had no agent, which was a raspberry pi)

bdcravens · on Feb 10, 2023

The best results always come when you run the browser in full GUI mode, rather than headless.

buffer_overflow · on Feb 10, 2023

Thanks for sharing your point of view!

I will rewrite the copy to make better statements. Thank you so much

jefozabuss · on Feb 10, 2023

"It won't be blocked" = they imported the stealth plugin most likely

bdcravens · on Feb 10, 2023

The stealth plugin is good, but not 100%. Some sites rely on heuristics other than what the browser reports.

jamestimmins · on Feb 10, 2023

What is a stealth plugin?

robjan · on Feb 10, 2023

https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

bdcravens · on Feb 10, 2023

Additions to libraries like Puppeteer that help ensure that the browser being used looks more "organic", often by returning fake data that a normal browser would have (browsers have APIs with things like plugins and fonts installed etc)

jamestimmins · on Feb 10, 2023

Cool thanks for explaining!

mxkopy · on Feb 10, 2023

> so true zero really isn't a defensible claim

I feel like this is saying your systems have perfect security, which itself is not a defensible claim

asdadsdad · on Feb 10, 2023

also considering tests above - \"webDriver\": \"FAIL\" - seems like you'll totally get blocked by any anti-bot

SturgeonsLaw · on Feb 11, 2023

My actual browser that I use as a human failed that test so it's probably more on them than anything.

Or I might have some kinda of addin/setting configured from hacking around on something over the years.

mellosouls · on Feb 10, 2023

One would hope that anti-blocking measures are implemented ethically and the documentation clarified to reflect that.

sigg3 · on Feb 10, 2023

> anti-blocking measures are implemented ethically

Your assumption that blocking is somehow ethical by default is not unproblematic.

There's a world wide web built by academics for free exchange of information and there's a closed garden web built by major capitalists.

Just how free that exchange of information should be is not a settled problem. Some very libertarians argue along the lines of information "wanting to be free". Some commercial entities seem to identify copyright and trademark law with moral doctrines. There are plenty of arguments for in-between positions as well.

If we look at less democratic societies, the efforts made to circumnavigate state censorship are publicly lauded as morally good actions by the international community. Could an analogy be drawn to large corporations censoring the less fortunate in a economically uneven societies too, for instance?

mellosouls · on Feb 10, 2023

That is a good reply generally, but this

Your assumption that blocking is somehow ethical by default is not unproblematic.

is itself an assumption.

The problem I'm concerned with is aggressive (either deliberately or ignorantly) crawling/scraping of non-commercial sites which often lack the financial resources to defend against activities enabled without apparent concern by tools like the site here.

If a site allows reasonable access in good faith, then subverting those limits and constraints for self-serving reasons is ethically dubious at best, and any service not addressing that while promising to enable that subversion should be questioned.

jerriep · on Feb 10, 2023

I've had a look at a number of these "simple" (i.e ones where I don't have to write a complex script) scraping tools recently and none of them seem to support what I consider to be a fairly common scenario of navigating to sub pages.

In my case I have a landing page (with pagination) with a list of records I want to extract. However, to extract the full information I need for each record, I need to click on each item and navigate to a detail page to extract further info.

Looking at your app and docs you don't seem to support this either. Is this something you are considering?

buffer_overflow · on Feb 10, 2023

Hi there,

I'm currently working on standard pagination (click next page button) and click button + infinite scroll.

What you comment is not currently possible with a single scraper, you would need to send one to collect links and then scrape those links. But I'm also working on "nesting data" feature, and what you comment should be possible in an ETA 2-3 weeks max.

Thanks for commenting!

factsaresacred · on Feb 10, 2023

I had a nice experience with https://simplescraper.io for a similar use-case. Was able to scrape a few thousand URLs without too much fuss.

The biggest complication with visual scrapers is all the edge cases. The selector algorithms usually become a mess on any complex website especially if there's uneven data.

Then you have css selectors no longer working and so on. Very brittle.

t_a_v_i_s · on Feb 10, 2023

You might want to try https://www.kadoa.com (disclaimer: I'm one of the founders)

CrypticShift · on Feb 10, 2023

for an (unlimited) free local option, https://webscraper.io/ may do what you want. It is simpler than this one (no proxy/scheduling/API...) but the scraping rules are quite elaborate.

martinsbalodis · on Feb 10, 2023

I'm the founder of webscraper.io. The paid version includes proxy, scheduling, data export, data parsing, data quality notifications and much more.

idjdhdhddjaj · on Feb 10, 2023

Try browserflow.

jerriep · on Feb 10, 2023

Ooh yeah, works great, thanks! It's a pity I have to buy a subscription as my needs are more of a once-off.

pocket_cheese · on Feb 10, 2023

I 100% recommend browserflow. It's fucking awesome!

jxramos · on Feb 10, 2023

Glad to see this text

""" What happens if my scraping fails? Not to worry! We will make every effort to determine the cause of the problem and assist you in resolving any issues with your scraper.

Additionally, please note that unsuccessful scrapings will not be included in your monthly quota. """

I'm curious the feedback mechanism for failed scrapes. Is there any validation configuration or an email notification I can configure in the event the target changes their page layout or DOM or whatever happens to cause interference.

buffer_overflow · on Feb 11, 2023

If the target page or DOM changes, it's not considered a failed scraping, just the data you wanted will return empty.

But you gave me the idea to send an alert or email notification if a scraper stops returning data or content changes.

Thanks

jxramos · on Feb 12, 2023

totally, that's going to be a huge piece of functionality that will add a ton of value to the user. It's what I always worry about in maintaining my scrapers, updating xpaths, all the rigmarole parsing and what not. Stable pages are much appreciated but we all know web designers are gonna design. Thanks for making this product, being in a post https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn world is gonna make for some great adventures I think. Keep up the good work.

marcopicentini · on Feb 10, 2023

On this page text field are disabled (Chrome, MacOsx)

https://app.mrscraper.com/onboarding

buffer_overflow · on Feb 11, 2023

The onboarding process is not a fillable form, it's just a simplification of the scraper builder to show how it works. You just have to click next.

Other users have also reported that this is a bit confusing, so I'm going to start working on improving this.

Thanks for you feedback!

somsak2 · on Feb 10, 2023

I signed up but on the setup wizard (https://app.mrscraper.com/onboarding) I can't seem to edit any of the input boxes ("Give your scraper a name", "Enter the URLs you want to scrape"). I'm on Chrome on Mac with uBlock Origin.

buffer_overflow · on Feb 10, 2023

Sorry for not being more clear. The onboarding process is a simplified version of the actual scraper builder. The fields are not editable, it's just to get used to the scraping flow.

I've noted down your suggestion and I'll make this more clear or add field edition.

Thanks!

kmoser · on Feb 11, 2023

Yeah, this took me a minute to figure out, too. I'd change it to make it clear those are just static slides. Even better, remove it entirely, and when the user lands on the home page (after verifying), open a wizard that guides them through setting up their first scraper.

I also had a real problem naming the "Store as" field for my data extractor. It didn't seem to like things in the format "foo_bar_baz56" (i.e. ending with digits). This page https://mrscraper.freshdesk.com/support/solutions/articles/1... says "The variable name can not contain special signs" but doesn't explain what special signs are. Anything other than [A-Z_]?

Now that I've finally set up and tested my first scraper, I'm really impressed. It was much easier to set up than I would have guessed, and specifying a selector made it dead simple. Results worked out of the box, on a site that is super touch about being scraped.

However, now that I'm viewing my scraper, I see no way of editing the scraper or data extractor. What's the trick to editing a scraper once you've saved it and gone back to view it?

buffer_overflow · on Feb 11, 2023

Thanks for your precious feedback!

I've noted everything down and I'll improve the onboarding experience, fix and make variable names more understandable and improve the edit button.

A scraper is not editable once is queued to run or currently running, then, you need to reload the page for the edit button to appear again. I will improve this.

Thank you again

samanator · on Feb 10, 2023

How do you approach pages that sometimes, non deterministically, present captcha challenges?

Are you using a service like 2captcha to auto-solve captchas?

buffer_overflow · on Feb 10, 2023

Honestly, I've been working on it for two months and didn't reach this part of the roadmap yet. I was planning to use a 2captcha integration for the first approach.

heipei · on Feb 10, 2023

Congrats on the launch, I think the description of what MrScraper does vs what you'd have to do yourself really nails it, and that is the value prop. Having experience in that area myself I can say that this looks like a great product and great pricing as well.

buffer_overflow · on Feb 10, 2023

Thank you so much. Appreciate it!

designaas · on Feb 18, 2023

Hey if you're looking for a roast we just launched https://Roastd.io to do exactly this!

moneywoes · on Feb 10, 2023

How big is the market for no code scrapers? Seems I see new tools daily

alvis · on Feb 10, 2023

I'd have no doubt about the demand given the tools we see daily. However, what beg the question is why are we still interesting in pushing this tread to the top? I'm surprise to see yet another web scraper be featured on the front page on HN

Taig · on Feb 10, 2023

Looks great, congrats on launching.

I always wonder what web scraping tools use as their proxy solution, because afaict they tend to be quite expensive, especially for residential IPs. How are you handling that?

arcturus17 · on Feb 10, 2023

How is this different to Octoparse and other tools in the space?

buffer_overflow · on Feb 10, 2023

Honestly I haven't tried Octoparse, but doing a quick check, I can see it is way more expensive.

rfeague · on Feb 10, 2023

I was hoping for something that would allow me to load a page to be scraped, mark the things I'm interested in, and have it help with the selection expressions.

buffer_overflow · on Feb 11, 2023

I'm experimenting with something like you described, but there are lots of edge cases and needs a bit of polishing. But for now, I made a Chrome extension that helps tinkering with CSS selectors.

robbiejs · on Feb 10, 2023

Website looks good; but it begs for a video

svdr · on Feb 10, 2023

Looking good, but I would not underline text if it's not a link.

can16358p · on Feb 10, 2023

Yup. Had to tap them a few times on my phone just to realize that they weren't tappable.

buffer_overflow · on Feb 10, 2023

Noted! I'll improve this

buffer_overflow · on Feb 10, 2023

Thanks, I will do one!

jarek83 · on Feb 10, 2023

Basic question before I can recommend this to my boss: can it scrape G2? (or any other page behind CF)

buffer_overflow · on Feb 10, 2023

I could depending on the proxy I got from my provider. I'm currently working on adding the ability to select a higher quality proxy for difficult to scrape websites and to add captcha solvers as well.

But if I have to be honest, I can not guarantee it at the present time.

This app is the side-project I starter 2 months ago, it's evolving fast but I still need to add some key features for enterprise customers.

jetter · on Feb 10, 2023

scrapeninja.net /scrape-js endpoint scrapes company pages of g2 without big troubles (with "us"/"eu" proxy geo in their online sandbox: https://scrapeninja.net/scraper-sandbox ). They also have /scrape which is much faster because it does not bootstrap real browser, and bypasses CloudFlare TLS fingeprint check: https://pixeljets.com/blog/bypass-cloudflare/

tomcam · on Feb 10, 2023

What are G2 and CF?

duckmysick · on Feb 10, 2023

G2 is a software (as a service) comparison website: https://www.g2.com/

CF is Cloudflare, which offers an anti-scraping protection for websites (among other things): https://www.cloudflare.com/

tomcam · on Feb 10, 2023

Thank you!

aleksiy123 · on Feb 10, 2023

Curious how well ChatGPT could write css selectors for these no code scrapers.

If running the model was cheaper I would even say run the whole page through ChatGPT and ask it to format the information on the page for you.

buffer_overflow · on Feb 11, 2023

Already tried that. Not worth it with the current token limitations.

But I'm currently tinkering with other applications of AI and MrScraper. Will ship something when I think it's reliable enough.

Buttons840 · on Feb 11, 2023

I haven't looked at many web scrapers. Do most use CSS selectors? What about xpath? I always found xpath to be much more powerful.

bofadeez · on Feb 11, 2023

Too expensive. Make it 50x cheaper and you'd be competitive with python scrapy + rotating residential proxies.

jdthedisciple · on Feb 10, 2023

Looks good.

Last time I needed it I just used Python + Selenium

Obv at the downside of needing to code but I don't mind that.

theFletch · on Feb 10, 2023

What about scraping PDFs on the web? Anyone have suggestions for that?

mdaniel · on Feb 10, 2023

What would the expression language for that even look like, given that PDFs are basically "canvas as a service"?

I'm aware there are pdf2html toys, and sometimes they do something reasonable, but just like with web scraping the markup of the target matters a lot and so, too, would the "markup" of the target PDF

Further, just like often it is better to go after the underlying XHR instead of trying to de-React the HTML, I'll offer that when possible it would be far better to try and identify the upstream source of the information in the PDF than trying to reverse engineer a postscript VM

KomoD · on Feb 11, 2023

Cool, looks like a more polished version of datagrab.io

xupybd · on Feb 10, 2023

I love the name.

buffer_overflow · on Feb 10, 2023

Thanks buddy!

brap · on Feb 10, 2023

Just how many web scraping tools do we need? It seems like every month there's a new web scraper on HN, is that really such a common task that we need dozens of tools?

firtoz · on Feb 10, 2023

If you don't need it don't use it.

brap · on Feb 10, 2023

I wasn't going to, still I think it's a fair question.

cal85 · on Feb 10, 2023

Rule of thumb: If your comment amounts to “I’m not interested in this”, consider not posting it, and just look at something else instead.

brap · on Feb 10, 2023

My comment was very clearly asking why dozens of different web scrapers are needed.

vinceguidry · on Feb 10, 2023

Wasn't a fair question.