Mustafa Suleiman, Microsoft’s chief executive of artificial intelligence, said this week that machine learning companies can extract most content posted online and use it to train neural networks because it is essentially “free software.”
Shortly thereafter, the Center for Investigative Reporting File a lawsuit against OpenAI and its largest investor, Microsoft, “for using the nonprofit news organization’s content without permission or compensation.”
This follows in the footsteps of eight newspapers Filed a lawsuit against OpenAI and Microsoft In April, The New York Times accused Facebook of allegedly misappropriating content, something the same newspaper had done four months earlier.
Then there are the two brilliant authors Filed a lawsuit against OpenAI and Microsoft In January, they claimed they trained AI models on authors’ works without permission. Also, in 2022, several anonymous developers filed a lawsuit against OpenAI and GitHub based on allegations that the organizations used publicly published programming code to train generative models in violation of the terms of their software license.
Asked in Interview Speaking with CNBC’s Andrew Ross Sorkin at the Aspen Ideas Festival on whether AI companies have effectively stolen the world’s intellectual property, Solomon acknowledged the controversy and tried to distinguish between content people put online and content backed by corporate copyright holders.
“I think in terms of content that already exists on the open web, the social contract for that content since the 1990s has been fair use,” he said. “Anyone could copy it, recreate it, reproduce it with it. This was free software, if you wanted to. That was the understanding.”
Suleiman pointed out that there is another category of content, which is material published by companies that have lawyers.
“There’s a separate category where a website, publisher or news organization has explicitly said, ‘Don’t delete or crawl me for any reason other than to index me, so others can find this content,'” he explained. “But that’s a gray area. I think this will make its way through the courts.”
That’s an understatement. While Suleiman’s comments seem certain to anger content creators, he’s not entirely wrong — it’s not clear where the legal lines lie when it comes to training AI models and the models’ output.
Most people who post content online as individuals have infringed on their rights in some way by accepting the terms of service agreements offered by major social media platforms. Reddit’s decision to license its users’ posts to OpenAI wouldn’t happen if the social media giant believed its users had a legitimate right to the memes and data it publishes.
The fact that OpenAI and other companies that make AI models are striking content deals with major publishers shows that a strong brand, deep pockets, and legal team can bring big tech operations to the table.
In other words, those who create and publish content online are making free software unless they retain, or can attract, lawyers willing to challenge Microsoft and its ilk.
in paper In a study published by SSRN last month, Frank Pasquale, a professor of law at Cornell Technology and Cornell Law School in the US, and Hao-chen Sun, an associate professor of law at the University of Hong Kong, explore the legal uncertainty surrounding the use of copyrighted data to train AI and whether courts will find such use fair. They conclude that AI must be addressed at the policy level, because current laws are inadequate to answer the questions that now need to be addressed.
“Given the great uncertainty about the legality of AI providers’ use of copyrighted works, lawmakers will need to formulate a bold new vision to rebalance rights and responsibilities, just as they did in the wake of the development of the Internet (leading to the Digital Millennium Copyright Act of 1998),” they argue.
The authors point out that continued uncompensated harvesting of creative works threatens not only writers, composers, journalists, actors, and other creative professionals, but also AI itself, which will eventually be deprived of training data. The authors predict that people will stop making work available online if it is used only to power AI models that reduce the marginal cost of content creation to zero and deprive creators of any potential reward.
This is the future that Suleiman foresees. “The information economy is about to change radically because we are able to reduce the cost of knowledge production to zero in terms of marginal cost,” he says.
All of that free software you may have helped create can be yours for a small monthly subscription fee. ®