OpenAI’s Web Crawler and FTC Missteps
<p>With AI adoption steeply rising, it’s becoming more and more important for data professionals to think about data sourcing. While the initial wave of high performant LLMs were trained using a common yet controversial tactic of data scraping, this questionable practice has been <a href="https://thisisunpacked.substack.com/p/data-scraping-in-the-spotlight-language-models" rel="noopener ugc nofollow" target="_blank">in the spotlight lately</a>, opening up lawsuits and questions of data ownership. This article provides a robust understanding of the legal concepts behind this and how regulators are addressing this problem (spoiler: not so effectively).</p>
<p><strong><em>Note from Towards Data Science’s editors:</em></strong><em> While we allow independent authors to publish articles in accordance with our </em><a href="https://towardsdatascience.com/questions-96667b06af5" rel="noopener" target="_blank"><em>rules and guidelines</em></a><em>, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our </em><a href="https://towardsdatascience.com/readers-terms-b5d780a700a4" rel="noopener" target="_blank"><em>Reader Terms</em></a><em> for details.</em></p>
<p>Last week, Open AI (maker of ChatGPT) officially announced their <a href="https://platform.openai.com/docs/gptbot" rel="noopener ugc nofollow" target="_blank">web crawler</a> — this is a piece of software which scrapes content from all websites across the internet, which is then used for AI model training. The existence of the crawler is not surprising and several legitimate web crawlers exist today, including Google’s crawler that indexes the entire internet. However, this is the first time OpenAI explicitly announced its existence and also provided a mechanism for websites to opt out of being scraped.</p>
<p>Note that the crawler is <strong>opt in by default</strong>, i.e. you need to explicitly change a piece of code on your website to ask the crawler not to scrape your data. Opt in / out defaults are sticky and often determine what the majority behavior is, because most people don’t take the effort to change defaults. It is the same reason why <a href="https://developer.apple.com/documentation/apptrackingtransparency" rel="noopener ugc nofollow" target="_blank">Apple’s iOS14 privacy changes</a> have had a major impact on the digital advertising industry.</p>
<p><a href="https://towardsdatascience.com/openais-web-crawler-and-ftc-missteps-a14047f4ff69"><strong>Visit Now</strong></a></p>