Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: MrScraper – A visual web-scraping tool (mrscraper.com)
215 points by buffer_overflow on Feb 10, 2023 | hide | past | favorite | 82 comments
Two months ago, I started building this side-project in the morning, before my full-time job.

A visual and easy-to-use web scraping app.

Please, roast it a bit so I can work on improving it. Thanks.



Product looks good, but I'm gonna roast you for having too little stuff on your landing page while also asking for a signup. I probably will sign up, I just have a reflexive aversion to doing so and generating yet another telemetry stream and set of incoming marketing emails.

The knowledge base and API documentation is good to me, but maybe not ideal for your target customer, the person looking for a no-code solution and probably somewhat intimidated by anything beyond a CSV. I think you should add a step-by-step or maybe a video showing how the HTML selectors and rules work in outline. When I first got interested in this topic there were two main stumbling blocks: cursors/pagination, and how to identify selectors on a page with multiple similar but distinct items (social media mutuals lists, product catalogs etc.). Since you're aiming at a non-technical audience, I think you need to give them feel of a walkthrough before downloading the app.


Thanks for this amazing feedback. I can get some action items from this.

I've noted down your comment and I'll be improving things for next week!


I scraped https://bot.incolumitas.com/ . Results do not look good, sorry!

{ "new_tests": "{\n \"puppeteerEvaluationScript\": \"OK\",\n \"webdriverPresent\": \"FAIL\",\n \"connectionRTT\": \"FAIL\",\n \"overrideTest\": \"OK\",\n \"puppeteerExtraStealthUsed\": \"OK\",\n \"inconsistentServiceWorkerNavigatorPropery\": \"OK\",\n \"inconsistentWebWorkerNavigatorPropery\": \"OK\"\n}", "detection_tests": "{\n \"intoli\": {\n \"userAgent\": \"OK\",\n \"webDriver\": \"FAIL\",\n \"webDriverAdvanced\": \"FAIL\",\n \"pluginsLength\": \"FAIL\",\n \"pluginArray\": \"FAIL\",\n \"languages\": \"OK\"\n },\n \"fpscanner\": {\n \"PHANTOM_UA\": \"OK\",\n \"PHANTOM_PROPERTIES\": \"OK\",\n \"PHANTOM_ETSL\": \"OK\",\n \"PHANTOM_LANGUAGE\": \"OK\",\n \"PHANTOM_WEBSOCKET\": \"OK\",\n \"MQ_SCREEN\": \"OK\",\n \"PHANTOM_OVERFLOW\": \"OK\",\n \"PHANTOM_WINDOW_HEIGHT\": \"OK\",\n \"HEADCHR_UA\": \"OK\",\n \"WEBDRIVER\": \"FAIL\",\n \"HEADCHR_CHROME_OBJ\": \"FAIL\",\n \"HEADCHR_PERMISSIONS\": \"FAIL\",\n \"HEADCHR_PLUGINS\": \"WARN\",\n \"HEADCHR_IFRAME\": \"FAIL\",\n \"CHR_DEBUG_TOOLS\": \"OK\",\n \"SELENIUM_DRIVER\": \"OK\",\n \"CHR_BATTERY\": \"OK\",\n \"CHR_MEMORY\": \"OK\",\n \"TRANSPARENT_PIXEL\": \"OK\",\n \"SEQUENTUM\": \"OK\",\n \"VIDEO_CODECS\": \"OK\"\n }\n}" }


Interesting test suite, thanks! I have tested scrapeninja.net via https://scrapeninja.net/scraper-sandbox and I got { "puppeteerEvaluationScript": "OK", "webdriverPresent": "OK", "connectionRTT": "OK", "refMatch": "OK", "overrideTest": "OK", "overflowTest": "OK", "puppeteerExtraStealthUsed": "OK", "inconsistentWebWorkerNavigatorPropery": "OK", "inconsistentServiceWorkerNavigatorPropery": "OK" }

and ip range of "us" geo proxy gives is_abuse: true. Consider this to be okayish though, given that this is a default proxy pool.


Thanks for reporting. I'll review this!


wow, you actually went through the trouble of running all this detection tests? that’s super cool :)


This marketing bit seems a bit conflicting:

"With MrScraper, you won't be blocked.

We use real browser instances to perform fast but human web scrapings, resulting in a much lower block ratio."

"won't be blocked" implies a zero block ratio. (I do a lot of work with Puppeteer and Playwright, and some larger websites are pretty advanced at their heuristics at catching automation, so true zero really isn't a defensible claim)


It's obviously an exaggeration, but I think the point is to suggest that you'll have much higher success (as opposed to being blocked) with this service vs rolling your own.

Anyway, if you want to be technical about it, the marking is correct. YOU won't be blocked. The agent running on your behalf might be blocked, however...

But from a marketing perspective, this "you won't be blocked" falls into the acceptable simplification category. Maybe they could add a * footnote, giving some more detail elsewhere. But at this point in the landing page, it wouldn't make sense to try to state it more accurately as that would require too many words.


There’s a difference between acceptable simplification and misleading, and while the line is not stark landing on the wrong side of it won’t build as much trust over time.

How block you’ll be “blocked less” or some variation of that form?

Still simple, less risk of disappointment/trust issues.


Oftentimes being "blocked" is more nuanced than whether the site returns a 200 vs a 4xx. The site may render, but the backend API may respond differently based on the behavior it sees.


Removing “you won’t be blocked.” should be sufficient then.

It looks interesting. I tried puppeteer and playwright but never got the hang of it, so I might be a client for one of these scraper services one day. The first time I tried it I got immediately blocked (probably because it had no agent, which was a raspberry pi)


The best results always come when you run the browser in full GUI mode, rather than headless.


Thanks for sharing your point of view!

I will rewrite the copy to make better statements. Thank you so much


"It won't be blocked" = they imported the stealth plugin most likely


The stealth plugin is good, but not 100%. Some sites rely on heuristics other than what the browser reports.


What is a stealth plugin?



Additions to libraries like Puppeteer that help ensure that the browser being used looks more "organic", often by returning fake data that a normal browser would have (browsers have APIs with things like plugins and fonts installed etc)


Cool thanks for explaining!


> so true zero really isn't a defensible claim

I feel like this is saying your systems have perfect security, which itself is not a defensible claim


also considering tests above - \"webDriver\": \"FAIL\" - seems like you'll totally get blocked by any anti-bot


My actual browser that I use as a human failed that test so it's probably more on them than anything.

Or I might have some kinda of addin/setting configured from hacking around on something over the years.


One would hope that anti-blocking measures are implemented ethically and the documentation clarified to reflect that.


> anti-blocking measures are implemented ethically

Your assumption that blocking is somehow ethical by default is not unproblematic.

There's a world wide web built by academics for free exchange of information and there's a closed garden web built by major capitalists.

Just how free that exchange of information should be is not a settled problem. Some very libertarians argue along the lines of information "wanting to be free". Some commercial entities seem to identify copyright and trademark law with moral doctrines. There are plenty of arguments for in-between positions as well.

If we look at less democratic societies, the efforts made to circumnavigate state censorship are publicly lauded as morally good actions by the international community. Could an analogy be drawn to large corporations censoring the less fortunate in a economically uneven societies too, for instance?


That is a good reply generally, but this

Your assumption that blocking is somehow ethical by default is not unproblematic.

is itself an assumption.

The problem I'm concerned with is aggressive (either deliberately or ignorantly) crawling/scraping of non-commercial sites which often lack the financial resources to defend against activities enabled without apparent concern by tools like the site here.

If a site allows reasonable access in good faith, then subverting those limits and constraints for self-serving reasons is ethically dubious at best, and any service not addressing that while promising to enable that subversion should be questioned.


I've had a look at a number of these "simple" (i.e ones where I don't have to write a complex script) scraping tools recently and none of them seem to support what I consider to be a fairly common scenario of navigating to sub pages.

In my case I have a landing page (with pagination) with a list of records I want to extract. However, to extract the full information I need for each record, I need to click on each item and navigate to a detail page to extract further info.

Looking at your app and docs you don't seem to support this either. Is this something you are considering?


Hi there,

I'm currently working on standard pagination (click next page button) and click button + infinite scroll.

What you comment is not currently possible with a single scraper, you would need to send one to collect links and then scrape those links. But I'm also working on "nesting data" feature, and what you comment should be possible in an ETA 2-3 weeks max.

Thanks for commenting!


I had a nice experience with https://simplescraper.io for a similar use-case. Was able to scrape a few thousand URLs without too much fuss.

The biggest complication with visual scrapers is all the edge cases. The selector algorithms usually become a mess on any complex website especially if there's uneven data.

Then you have css selectors no longer working and so on. Very brittle.


You might want to try https://www.kadoa.com (disclaimer: I'm one of the founders)


for an (unlimited) free local option, https://webscraper.io/ may do what you want. It is simpler than this one (no proxy/scheduling/API...) but the scraping rules are quite elaborate.


I'm the founder of webscraper.io. The paid version includes proxy, scheduling, data export, data parsing, data quality notifications and much more.


Try browserflow.


Ooh yeah, works great, thanks! It's a pity I have to buy a subscription as my needs are more of a once-off.


I 100% recommend browserflow. It's fucking awesome!


Glad to see this text

""" What happens if my scraping fails? Not to worry! We will make every effort to determine the cause of the problem and assist you in resolving any issues with your scraper.

Additionally, please note that unsuccessful scrapings will not be included in your monthly quota. """

I'm curious the feedback mechanism for failed scrapes. Is there any validation configuration or an email notification I can configure in the event the target changes their page layout or DOM or whatever happens to cause interference.


If the target page or DOM changes, it's not considered a failed scraping, just the data you wanted will return empty.

But you gave me the idea to send an alert or email notification if a scraper stops returning data or content changes.

Thanks


totally, that's going to be a huge piece of functionality that will add a ton of value to the user. It's what I always worry about in maintaining my scrapers, updating xpaths, all the rigmarole parsing and what not. Stable pages are much appreciated but we all know web designers are gonna design. Thanks for making this product, being in a post https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn world is gonna make for some great adventures I think. Keep up the good work.


On this page text field are disabled (Chrome, MacOsx)

https://app.mrscraper.com/onboarding

<input x-data="{}" wire:model.defer="name" type="text" dusk="filament.forms.name" disabled="" id="name" class="block w-full transition duration-75 rounded-lg shadow-sm focus:border-primary-500 focus:ring-1 focus:ring-inset focus:ring-primary-500 disabled:opacity-70 border-gray-300" x-bind:class="{ 'border-gray-300': ! ('name' in $wire.__instance.serverMemo.errors), 'dark:border-gray-600': ! ('name' in $wire.__instance.serverMemo.errors) &amp;&amp; false, 'border-danger-600 ring-danger-600': ('name' in $wire.__instance.serverMemo.errors), 'dark:border-danger-400 dark:ring-danger-400': ('name' in $wire.__instance.serverMemo.errors) &amp;&amp; false, }">


The onboarding process is not a fillable form, it's just a simplification of the scraper builder to show how it works. You just have to click next.

Other users have also reported that this is a bit confusing, so I'm going to start working on improving this.

Thanks for you feedback!


I signed up but on the setup wizard (https://app.mrscraper.com/onboarding) I can't seem to edit any of the input boxes ("Give your scraper a name", "Enter the URLs you want to scrape"). I'm on Chrome on Mac with uBlock Origin.


Sorry for not being more clear. The onboarding process is a simplified version of the actual scraper builder. The fields are not editable, it's just to get used to the scraping flow.

I've noted down your suggestion and I'll make this more clear or add field edition.

Thanks!


Yeah, this took me a minute to figure out, too. I'd change it to make it clear those are just static slides. Even better, remove it entirely, and when the user lands on the home page (after verifying), open a wizard that guides them through setting up their first scraper.

I also had a real problem naming the "Store as" field for my data extractor. It didn't seem to like things in the format "foo_bar_baz56" (i.e. ending with digits). This page https://mrscraper.freshdesk.com/support/solutions/articles/1... says "The variable name can not contain special signs" but doesn't explain what special signs are. Anything other than [A-Z_]?

Now that I've finally set up and tested my first scraper, I'm really impressed. It was much easier to set up than I would have guessed, and specifying a selector made it dead simple. Results worked out of the box, on a site that is super touch about being scraped.

However, now that I'm viewing my scraper, I see no way of editing the scraper or data extractor. What's the trick to editing a scraper once you've saved it and gone back to view it?


Thanks for your precious feedback!

I've noted everything down and I'll improve the onboarding experience, fix and make variable names more understandable and improve the edit button.

A scraper is not editable once is queued to run or currently running, then, you need to reload the page for the edit button to appear again. I will improve this.

Thank you again


How do you approach pages that sometimes, non deterministically, present captcha challenges?

Are you using a service like 2captcha to auto-solve captchas?


Honestly, I've been working on it for two months and didn't reach this part of the roadmap yet. I was planning to use a 2captcha integration for the first approach.


Congrats on the launch, I think the description of what MrScraper does vs what you'd have to do yourself really nails it, and that is the value prop. Having experience in that area myself I can say that this looks like a great product and great pricing as well.


Thank you so much. Appreciate it!


Hey if you're looking for a roast we just launched https://Roastd.io to do exactly this!


How big is the market for no code scrapers? Seems I see new tools daily


I'd have no doubt about the demand given the tools we see daily. However, what beg the question is why are we still interesting in pushing this tread to the top? I'm surprise to see yet another web scraper be featured on the front page on HN


Looks great, congrats on launching.

I always wonder what web scraping tools use as their proxy solution, because afaict they tend to be quite expensive, especially for residential IPs. How are you handling that?


How is this different to Octoparse and other tools in the space?


Honestly I haven't tried Octoparse, but doing a quick check, I can see it is way more expensive.


I was hoping for something that would allow me to load a page to be scraped, mark the things I'm interested in, and have it help with the selection expressions.


I'm experimenting with something like you described, but there are lots of edge cases and needs a bit of polishing. But for now, I made a Chrome extension that helps tinkering with CSS selectors.


Website looks good; but it begs for a video


Looking good, but I would not underline text if it's not a link.


Yup. Had to tap them a few times on my phone just to realize that they weren't tappable.


Noted! I'll improve this


Thanks, I will do one!


Basic question before I can recommend this to my boss: can it scrape G2? (or any other page behind CF)


I could depending on the proxy I got from my provider. I'm currently working on adding the ability to select a higher quality proxy for difficult to scrape websites and to add captcha solvers as well.

But if I have to be honest, I can not guarantee it at the present time.

This app is the side-project I starter 2 months ago, it's evolving fast but I still need to add some key features for enterprise customers.


scrapeninja.net /scrape-js endpoint scrapes company pages of g2 without big troubles (with "us"/"eu" proxy geo in their online sandbox: https://scrapeninja.net/scraper-sandbox ). They also have /scrape which is much faster because it does not bootstrap real browser, and bypasses CloudFlare TLS fingeprint check: https://pixeljets.com/blog/bypass-cloudflare/


What are G2 and CF?


G2 is a software (as a service) comparison website: https://www.g2.com/

CF is Cloudflare, which offers an anti-scraping protection for websites (among other things): https://www.cloudflare.com/


Thank you!


Curious how well ChatGPT could write css selectors for these no code scrapers.

If running the model was cheaper I would even say run the whole page through ChatGPT and ask it to format the information on the page for you.


Already tried that. Not worth it with the current token limitations.

But I'm currently tinkering with other applications of AI and MrScraper. Will ship something when I think it's reliable enough.


I haven't looked at many web scrapers. Do most use CSS selectors? What about xpath? I always found xpath to be much more powerful.


Too expensive. Make it 50x cheaper and you'd be competitive with python scrapy + rotating residential proxies.


Looks good.

Last time I needed it I just used Python + Selenium

Obv at the downside of needing to code but I don't mind that.


What about scraping PDFs on the web? Anyone have suggestions for that?


What would the expression language for that even look like, given that PDFs are basically "canvas as a service"?

I'm aware there are pdf2html toys, and sometimes they do something reasonable, but just like with web scraping the markup of the target matters a lot and so, too, would the "markup" of the target PDF

Further, just like often it is better to go after the underlying XHR instead of trying to de-React the HTML, I'll offer that when possible it would be far better to try and identify the upstream source of the information in the PDF than trying to reverse engineer a postscript VM


Cool, looks like a more polished version of datagrab.io


I love the name.


Thanks buddy!


Just how many web scraping tools do we need? It seems like every month there's a new web scraper on HN, is that really such a common task that we need dozens of tools?


If you don't need it don't use it.


I wasn't going to, still I think it's a fair question.


Rule of thumb: If your comment amounts to “I’m not interested in this”, consider not posting it, and just look at something else instead.


My comment was very clearly asking why dozens of different web scrapers are needed.


Wasn't a fair question.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: