In an era where artificial intelligence (AI) is rapidly becoming a cornerstone of our digital world, a new breed of bot is causing ripples in the tech community. These bots, identified by Vercel CEO Guillermo Rauch, are ingeniously extracting information from powerful AI models, such as OpenAI’s GPT-4. This new type of data scraping, dubbed by Rauch as "web scraper 2.0", is a growing problem that has already led to one developer receiving an unexpected bill of $35,000 from OpenAI.
The rise of these bots can be attributed to a combination of factors: an insatiable demand for quality data, rising costs of using top-performing models, and an attempt to bypass restrictions in certain countries. Rauch explains that these bots cleverly scrape the outputs of AI models like GPT-4, using the valuable insights as fresh training data for their own models. This practice, known as ‘model distillation’, can in theory enable a new model to learn everything an AI model knows, posing a significant threat to the AI industry.
AI Bots Scrape Data from Models, Cause Financial Havoc
In a recent revelation, Vercel CEO Guillermo Rauch identified a new breed of digital bots that are scraping data from AI models, particularly OpenAI’s GPT-4, causing unexpected financial consequences for developers.
The New Breed of Bot
Vercel, a startup that assists developers in integrating AI models into their websites, has uncovered a new breed of bots. Coined as "Web Scraper 2.0" by Rauch, these bots are designed to extract intelligence from AI models. In a conversation with venture capitalists Elad Gil and Sarah Guo on the No Priors podcast, Rauch explained that these bots seek to obtain free access to AI models like GPT-4, leading to significant issues for developers and AI companies.
The Threat of Model Distillation
As AI technology advances, the demand for quality data has skyrocketed. AI models require this data for training and without it, their performance suffers. Rauch points out that the scarcity of quality data is driving the creation of these new bots. By cleverly scraping the outputs of powerful AI models like GPT-4 or Llama 2, one can generate fresh training data for their own AI models, a practice known as "model distillation". Rauch warns that this can potentially lead to the replication of AI models based on high-quality outputs. Therefore, top AI companies like OpenAI, Google and Anthropic ban the use of their outputs for training other models.
The Impact: High Financial Costs
Another incentive for this bot-driven approach is the high costs associated with using top-performing models. Companies like OpenAI impose rate limits, restricting the number of questions even paying users can ask per minute or day. To circumvent these restrictions, malicious agents are deploying bots that overwhelm these models with questions, leaving others to foot the bill. Rauch recounts the story of a developer who was victim to such an attack and was left with a $35,000 bill from OpenAI. The developer’s application was used as a proxy by bots to access the AI model. After months of contestation, OpenAI eventually refunded her.
Bypassing China’s AI Blockade
China’s recent blockade of top AI models like ChatGPT and GPT-4 presents another driver for this new breed of bots. These bots are being used to collect outputs from these models, bypassing the country’s censorship. With hundreds of thousands of AI applications being deployed on Vercel’s platform each month, there are ample targets for these bots. To combat this, Vercel offers technology to help developers protect against these attacks.
The Threat to SaaS Businesses
Rauch also warns about the threat to SaaS businesses from this new breed of bots. SaaS businesses that sell per-seat subscriptions at a flat rate for unlimited use are particularly vulnerable. If attacked by bots, they may end up paying for outputs that their actual customers are not receiving. Rauch predicts a move towards usage-based charging, like per-token or per-query charges, to counter this. Vercel already integrates rate limits for developers, limiting the number of times a user can query AI models per day and thwarting bot attacks.
Takeaways
The emergence of this new breed of bots underscores the evolving challenges in the AI industry. As AI models become more sophisticated and powerful, strategies to exploit them are also evolving. The AI community must adapt and develop robust security measures to protect both developers and businesses from potential financial ruin. The shift towards usage-based pricing may be an effective strategy to counteract these threats. As AI continues to advance, it’s crucial to keep an eye on the evolving landscape and potential risks.