Data Scraping Continues Despite Regulation, Quivr Founder Raises Alarm

data scraping continues despite regulation quivr founder raises alarm.jpg Technology

In the era of digitalization, personal data has become a new asset class, a fact recognized by the World Economic Forum over a decade ago. This data, which tech companies have been collecting and treating as a valuable commodity, has become the lifeblood of algorithms that power various social media platforms, from Instagram to Netflix. These algorithms use this data to deliver personalized content and targeted ads to users, often with uncanny precision. However, as we delve deeper into this data-driven digital landscape, we’re forced to confront certain uncomfortable truths about data privacy and the extent of data scraping.

The emergence of generative AI technology has repackaged and presented scraped data back to the users it was harvested from, leading to renewed discussions about data privacy. According to Joe Miller, co-founder of Quivr, data scraping has been a longstanding practice, often unnoticed due to lack of feedback to the user. However, with the advent of Large Language Models (LLMs) like ChatGPT, the results of data scraping are becoming more visible, and with it, the potential risks to personal reputation and identity. Miller warns that our words can be taken out of context and used in ways we did not intend, thereby posing a reputation risk that hasn’t been encountered before. As we navigate this new terrain, it’s time to reevaluate our engagement with the internet and take personal responsibility for protecting our identities.

The New Age of Data Privacy in the Era of AI

Since the dawn of the internet, personal data has been a valuable commodity for tech companies. In fact, the World Economic Forum classified personal data as a new asset class over a decade ago. This data informs AI-powered algorithms ubiquitous on social media platforms and fires the engines of targeted advertising. Now, the emergence of generative AI technology, such as ChatGPT, has repackaged and presented this data back to the people it was scraped from, bringing up fresh questions about data privacy and data scraping.

Data Scraping and Personal Data Visibility

Data scraping is not a new concept. However, the visibility of personal data has changed with the advent of Large Language Models (LLMs) like ChatGPT. Previously, the most tangible result of data scraping was personalized ads. Now, as Quivr’s co-founder Joe Miller points out, the data becomes visible, and users’ words can be taken and used in contexts they did not anticipate. This presents a new reputation risk that users have never encountered before.

Miller highlighted an additional issue involving the leakage of personally identifiable information (PII) when AI companies indiscriminately scrape the internet for data to train their models. Publicly shared personal information can inadvertently be included in their outputs, leading to potential privacy breaches.

The Role of Personal Responsibility and Regulation

Miller believes that this new understanding of data scraping risks should prompt people to reconsider how they interact with the internet. He argues that individuals should take more personal responsibility for protecting their identities. While he advocates for penalties for the scraping and use of PII, he acknowledges that it can often be difficult to trace when your data has been directly used, leading to a kind of "perfect crime" by AI companies.

Miller doesn’t foresee a future where scraping public websites becomes illegal, but he does suggest that scraped data should be scrubbed of PII.

Open Source Versus Centralized Models

The debate extends to the structure of AI models – open source versus centralized. Closed source models like ChatGPT are less transparent, making them harder to regulate. Open-source models, on the other hand, are harder to control. Miller asserts that the root of the problem lies in public data presence on the internet and the ease of acquiring and training LLMs on this data.

Cleaning Up Your Digital Footprint

Miller emphasizes the importance of personal responsibility in maintaining a cleaner digital footprint. He encourages internet users to exercise caution and thoughtfulness when posting on social media or providing information online. He co-founded Quivr, a social media service where users maintain full ownership over their data, which is protected, and can monetize it – a stark contrast to the prevalent practice where users’ data is scraped as a price for using a platform.


As the landscape of data privacy continues to evolve with advancements in AI technology, the onus of protecting personal data seems to be shifting towards individuals. The discussions around data scraping, visibility, and the structure of AI models underscore the importance of understanding and managing our digital footprints. While regulation and penalties may play a part, it’s clear that personal responsibility and thoughtful online behavior are becoming increasingly critical in this new era.

Crive - News that matters