The large language models that power ChatGPT and other chatbots get their mastery of language from essentially two things: massive amounts of training data scraped from the web, and massive amounts of compute power to learn from that data. That second ingredient is very expensive, but the first ingredient, so far, has been completely free.
However, creators, publishers, and businesses increasingly see the data they put on the web as their property. If some tech company wants to use it to train its LLMs, they want to be paid. Just ask the Associated Press, which struck a training data licensing deal with OpenAI. Meanwhile, X (née Twitter) has taken steps to block AI companies from scraping content on the platform.
Now individual consumers also seem to understand the unfairness and risk of unwittingly contributing their data to train AIs. The recent debacle over Zoom’s terms of use bears this out. When word spread that Zoom’s terms of service seemed to allow the company to train its AI models with user data, the user backlash came hard and fast. The company was forced to relent, assuring users that it wouldn’t use audio, video, or chat content from Zoom calls to train models without users’ explicit consent. (Not everyone is convinced.)
With everybody from consumers to corporations now aware of LLM makers’ practices, the training data free-for-all is likely coming to an end. For OpenAI, the loss of free data may hurt, but it will hurt its competitors more. OpenAI already seized a mountain of training data from the web long before people knew it was happening, and has used it to build the market-leading general purpose LLMs. The company seems to acknowledge the end of the free data party with the recent announcement of its own web crawler, GPTBot, which the company openly tells website operators how to block. Other LLM makers will be under pressure to provide a similar option.
There is a clear parallel between OpenAI and Facebook here. For years Facebook vacuumed up users’ personal data for its advertising engine while being vague and evasive about what data was being grabbed and how it was being used. Facebook was able to deflect attention to this long enough to gain a critical mass of users and advertisers, which gave it an unassailable lead in the social networking market.
OpenAI may have a large enough lead in the LLM market that a dearth of training data will only prolong its dominance.