In a recent revision of its privacy policy, Google disclosed that its various AI services, such as Bard and Cloud AI, may be trained on publicly available data that the company has gathered from the internet.
Christa Muldoon, a Google spokesperson, stated to The Verge, “Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply clarifies that newer services like Bard are also included. We incorporate privacy principles and safeguards into the development of our AI technologies, in line with our AI Principles.”
The policy update from July 1st, 2023, indicates that Google leverages this information to enhance services, develop new features, products, and technologies for the benefit of users and the public. The company also may "use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”
The updated policy specifies that Google uses "publicly available information" for training its AI products, but does not clarify how the company intends to prevent the inclusion of copyrighted materials in the data pool. Numerous websites that are publicly accessible have regulations that prohibit data collection or web scraping for training large language models and other AI tools. This stance also raises questions about how this data is processed to avoid contributing to potential AI system failures.
Moreover, the gray area of whether the fair use doctrine extends to this type of application has triggered lawsuits and prompted lawmakers in certain countries to propose stricter laws to regulate how AI companies gather and use their training data.
Meanwhile, Gannett, the largest newspaper publisher in the US, is suing Google and its parent company, Alphabet, alleging that advancements in AI technology have aided the search giant in monopolizing the digital ad market. Products like Google's AI search beta have also been criticized as "plagiarism engines" that deprive websites of traffic. Simultaneously, Twitter and Reddit have taken significant measures to prevent other companies from freely harvesting their data, although these actions have negatively affected the user experience on their platforms.