Robots.txt

Was this email forwarded to you? Sign up here

All websites have a page where you can see what they don’t want you to webscrape. Its called Robots.txt

Robots.txt is available on almost every website. It tells Google what URLs they can crawl and index. For example, check out IBM’s Robots.txt page. They tell you what not to crawl via Disallow. They also speak to specific crawlers with the User-agent

Why do I know this and why does it matter? When you work at a hedge fund or any company where you want to scrape data from the web, you eventually will find out about Robots.txt. Some firms obey Robots.txt and only crawl what the site allows and others ignore it. There is endless case law and lawsuits you can read up on about the topic. LinkedIn having one of the more famous cases.

I decided to write about robots.txt because, with all of the AI tools being developed and everyone trying to get their hands on as much data as possible, the rules around web scraping and the rights to data will only get more interesting in the years ahead. I think it will be interesting to see if we see more clarity in the future on the need to fully obey Robots.txt or if we get to a point where anything on the web is free and open.