- The Rollup
- Posts
- Web Scraping Technology- Lots of interest
Web Scraping Technology- Lots of interest
The Initial Data Offering (IDO) community is a place for data enthusiasts to discover new datasets daily.
The mission is to build a community of data enthusiasts and curate high-quality, unique datasets for businesses, researchers, and organizations worldwide.
If there is one thing I continue to think is becoming more and more important, it’s web scraping or harvesting technology. With the explosion in AI data-hungry companies, everyone is looking to train models, and every business around the world is looking for insights, & having in-house web scraping technology is becoming more and more valuable. A recent Wired article, goes into how LLMs like Perplexity are web scraping and in this case potentially not obeying some of the web scraping unofficial rules like Robots.txt
Building web scraping seems easy at first. But as websites change, sophisticated sites fight bots, and more and more custom scrapes are needed, it becomes a gigantic pain to maintain and manage. Very few companies can manage this internally without an army of people. The companies that are offering this as a service are charging more and more money and the demand is only getting stronger.
I used web scraping to build datasets when I worked in the hedge fund space. We did this both internally and via third-party providers. What we realized is maintaining these over time was not something we wanted to do with our team. Many times we set up offshore teams to tackle these tasks and eventually used 3rd parties. The monitoring of jobs while adding new custom scrapes just became too much for a small team of engineers to own.
Modern web scraping startups are getting off the ground leveraging AI as part of their technology stack. I think this is a very intriguing area of development where we will likely see some early winners emerging in the next 12-18 months.
After managing a team that used a handful of 3rd party web scraping companies, I have seen both the good and the bad. Many of these companies are scaling, and many are also pivoting into becoming both scraping service providers while also building datasets they can sell to many firms.
The industry is evolving and I think we will see some consolidation in the next year and at the same time some big winners given the data demands of AI. I also know of a company with great web scraping technology that is currently looking to be acquired.
We saw Vertical Knowledge acquired, which was a well-established player in the space, and I think this is just the beginning.
I asked a question on Linkedin - Who are the leaders in the web scraping space? - The overwhelming feedback was Nimble, which happens to be an investment from one of our fund managers in the Social Leverage Fund of Funds. The other one that was mentioned a lot was Bright Data. The topic was definitely of interest as I saw over 10k impressions with 41+ comments.
Needless to say, web scraping is a hot topic.

PPC slumps caused by Big Tech’s pixel. Upgrade your pixel❓
If your PPC campaigns are in a slump, Big Tech is the likely culprit.
They’ve weakened their tracking pixel capabilities, making it hard for you to build your Network Audiences, much less implement effective retargeting.
Don’t let your ROI suffer because Big Tech wants to pretend to care about consumer privacy!
The Smart Recognition platform brings back the glory days of PPC campaigns. Capture up to 40% of your traffic instead of the piddly percentage people are getting these days.
Pull your PPC campaigns out of that slump, speed up the growth of your Network Audiences, and get your ROI back.
Reply