Artificial intelligence chatbots (Chatbot) are relatively old by technology standards. But the newest products, led by OpenAI’s ChatGPT, Microsoft’s Bing, and Google’s Bard, that put artificial intelligence into the mix, are proving to be far more capable than previous examples, though not always for positive reasons.
Recent breakthroughs in AI development have already raised concerns about misinformation, disinformation, plagiarism, and machine-generated malware. The answer to the question of how generative AI can pose problems for the privacy of the average internet user, according to experts, largely depends on how these bots are trained and how much we plan to interact with them.
AI chatbots and Common Crawl
AI chatbots are trained on large amounts of data, a significant part of which is derived from repositories such as Common Crawl, to mimic human-like interactions. Common Crawl has scoured the open web, collecting petabytes of data over the years. “These models train on large datasets of publicly available data on the internet,” says Megha Srivastava, a PhD student in Stanford’s computer science department and a former AI assistant at Microsoft Research. Although ChatGPT and Bard use a “filtered” portion of Common Crawl’s data, the sheer size of the model makes it impossible for “one to fully examine and sanitize the data,” Srivastava said.
It is possible that carelessly generated or low-security data that is difficult for the average user to access and located in the far corners of the internet may have been unnoticed in a training set and repeated by the chatbot in the future. And it is by no means impossible for a bot to give someone real contact information. Bloomberg columnist Dave Lee reported on Twitter that when someone asked ChatGPT to chat on the encrypted messaging platform Signal, he gave his exact real phone number. He emphasizes that while such interaction is likely to be extreme, the information these learning models can access is remarkable.
In addition, these chatbots can include the data you provide to them in the learning process. In other words, when you share a data with him, he can bring this data to someone else. Just like what happened to Samsung employees…
“It is unlikely that OpenAI would want to collect certain information, such as health data, and attribute it to individuals to train their models,” David Hoelzer of the SANS Institute security organization told Engadget. “But could it be there by mistake? Definitely…”
In short, artificial intelligence chatbots can collect explicit information about us, as well as use what we share with them for their own training. This shows that anyone using these bots is likely to accidentally or intentionally access our data.
While Open AI has not disclosed what measures it takes to protect data privacy in ChatGPT or how it handles personally identifiable information that may be embedded in training kits, ChatGPT itself is “programmed to follow ethical and legal standards that protect users’ privacy and personal information” and He says he “does not have access to personal information unless it is provided (to him)”.
Google says it has similar “barriers” at Bard to prevent personally identifiable information from being shared during chats. Bard does not have a specific privacy policy and instead uses the comprehensive privacy document shared by other Google products.