Copyright Eating AI

Published by Reblogs - Credits in Posts, March 2nd, 2023

Is Copyright Eating AI?

Marc Andreessen famously said that software is eating the world. But the latest and greatest software trend–generative AI–is in danger of being swallowed up by copyright law. Like a cruise ship heading for a scary iceberg, AI is in trouble, and the problems are mostly below the surface.

We now have a pair of lawsuits claiming that GitHub’s Copilot model is stealing open source code from its authors, and that companies using Stable Diffusion or other models (including Stability AI, DeviantArt, and Midjourney) are stealing images from visual artists. Both lawsuits are being prosecuted by Matthew Butterick (best known as the author of Typography for Lawyers) along with the Joseph Saveri Law Firm, a class action firm.

The Co-Pilot lawsuit is widely touted in the press as a copyright infringement case, but in fact it doesn’t claim copyright infringement. It does claim a litany of other wrongs based on torts like removal of copyright information, breach of contract, and fraud. The Stable Diffusion suit is in fact a copyright infringement suit. More importantly, and sadly, these lawsuits are probably a bellwether of more to come.

The Co-Pilot suit is ostensibly being brought in the name of all open source programmers. Yes, that’s right, people crusading in the name of open source–a movement intended to promote freedom to use source code–are now claiming that a neural network, designed to save programmers the onus of re-inventing the wheel when they need code to perform programming tasks, is de facto unlawful. The open source movement is wonderful in many ways, but its tendency to engage in legal maximalism to "protect" open source is sometimes disappointing.

The Stable Diffusion suit alleges copyright infringement, stating that, "The resulting image is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images. It is, in short, a 21st-century collage tool." That characterization is the essence and conclusion of the lawsuit, and one with which many AI designers would disagree.

So, all neural network developers, get ready for the lawyers, because they are coming to get you.

Fair Use or "Fair & Ethical"?

The crux of the problem is that US copyright law, despite many landmark cases, still gives us little or no guidance on how copyright applies to the defense of fair use. The Oracle v. Google case, the biggest fair use case of this century, ambled on a lengthy and astonishingly expensive road to a Supreme Court decision. As Larry Lessig famously quipped, "fair use is the right to hire a lawyer," and the Supremes proved that true by issuing an opinion that provided little guidance outside of the specific facts of the case.

However you may feel about Google, it’s lucky that Google has the determination and resources to have spent astronomical legal fees defending the right of fair use–from books, to thumbnail photos, to news headlines, to software interface specifications. Users of the web benefit from that. If the AI industry avoids this iceberg, it will be partly because of Google’s historical unwillingness to roll over on fair use cases.

Let’s hope Microsoft (which funded OpenAI and owns GITHUB) has the Google-like intestinal fortitude and money to win this battle. But if Oracle v. Google is any measure, the answer might not come for 10 years, by which time the neural network industry may have been litigated out of existence–or worse yet, limited to those large players who can fund an expensive legal defense. For startups, having a lawsuit hanging over their heads is usually a death knell, between expensive legal bills siphoning off their development resources, and investors shying away from the risk.

Tell Me What You Want, What You Really, Really Want

One perplexing aspect of the lawsuits–and likely all that will follow in its footsteps–is what best practices the plaintiffs actually would want the AI industry to adopt going forward. Butterick says his class action cases are "another step toward making AI fair & ethical for everyone." But other than netting a hefty fee for the lawyers who bring the suit, what is the endgame, exactly?

Both lawsuits ask for permanent injunctive relief, which would essentially shut down the use of the accused models, but that is part of the playbook for litigation and probably not the result they would prefer. And even for most people who sympathize with the lawsuits, that is not the preferred endgame. Though there are lots of memes out there about Skynet, most people do not want AI to shut down, and if they do, it’s not because of copyright law.

One possible best practice would be to allow authors to specifically opt out of use of their output for ML training. (In fact, Stability has suggested this approach.) This type of approach can work when technical development bumps up against the limits of copyright law. For example, there is a "do not index" mechanism (robots.txt) for web sites that is broadly honored by large scale search engines. But such a convention would have a prodigious backlog to tag, and also, for software authors, prohibiting ML training would be antithetical to the Open Source Definition. So that probably won’t work.

Another possibility is compensation for those who wrote the original material used to train the models. Over the years, there have been various attempts to compensate authors for numerous and small contributions to copyrightable works. This is primarily an information problem, and those who try to solve it usually propose a blockchain based approach, lest payment transaction costs outweigh the compensation. None has been successful yet.

Even if there were a technical solution to the information problem, it would be difficult to allocate compensation to a broad set of creators in a fair way. In the music business, there are artists’ rights organizations like ASCAP and BMI that amalgamate the power to grant blanket music performance licenses to consumers of music, like restaurants that play music over their sound systems. In fact, these rights amalgamation organizations enjoy a limited safe harbor from antitrust law, because they facilitate what would otherwise require millions of small, individual licensing deals.

But this will not work for generative AI. Performance rights organizations reward their authors roughly according to the popularity of their songs. For generative AI, it would be functionally impossible to track which work had been used, because the output is not, in fact, a copy of the original, nor even a collage–but a new work synthesized from a model trained using the original work. If compensation is not tracked to the images actually used, then we would likely see a spate of garbage images being thrown into the mix to grab some of the proceeds. It would be easier to set up a grant fund for artists generally than to track the contributions among millions of artists to a single AI-generated image.

The problem is that neural network models, and their outputs, are not copies of the original works. They are a set of probabilities (weights) that are trained based on thousands or even millions of data points. And at least as of now, it is not possible to look at ML output and determine which inputs, nodes and weights created it. ML, for now, is mostly a "black box" whose inputs and outputs are impossible to connect. In fact, the lack of reproducibility of ML has already been tagged as a social issue: if you build a model that discriminates in its output, how do you audit it? Eventually, the ML industry may solve this problem, but for now, it means there is a usually disconnect between the inputs and outputs, and that probably means that copying could never be inferred in a way that could reliably allocate compensation to the authors of inputs. That, in turn, should mean there is no copyright infringement, but the lawsuits posit otherwise.

Moreover, there is a notice problem. Each of the lawsuits alleges a claim under the Digital Millennium Copyright Act (DMCA) 17 USC §1202(c) of the DMCA) ("CMI"), which prohibits removal of copyright information such as copyright notices. But even assuming that some license notice, or copyright notice, would have to be communicated whenever an AI output was generated, how exactly would that happen? Would each resulting image require thousands or millions of notices? Even now, conventional users of open source code struggle greatly with management and delivery of license notices–anyone who has worked on open source compliance knows how difficult that can be. But these lawsuits make that problem look like child’s play.

If AI Dies, Who Wins?

Both of the Butterick suits are being brought as class actions–a type of lawsuit popularized in the US and still relatively unusual elsewhere. You may have gotten notices from class action lawyers asking you to opt in to a settlement class to which you belong. If you’re like me, you toss them out, because your reward for joining the class will probably be a coupon or a princely settlement of $20.

And so, who benefits from class action suits? Well, class action lawyers. When you hear that a class action suit has resulted in $6 million in damages, the lawyers probably get about $2 million (one-third). Because a class can consist of thousands of members (or in the case of the Butterick suits, probably millions), the damages allocated to the individual class members can be tiny indeed. Sometimes, the lawyers actually get bigger payouts than all the plaintiffs combined. The US class action model has been strongly criticized for being a vehicle for enrichment of plaintiff’s lawyers that provides relatively little real compensation to the plaintiffs they represent. Class action proponents use populist rhetoric and anecdotes to justify their suits, but empirical studies are relatively few, and sometimes, damning. (See for example: https://instituteforlegalreform.com/research/do-class-actions-benefit-class-members/ and https://www.tortreform.com/news/study-class-action-lawyers-often-take-more-money-from-settlements-than-class-members/)

Don’t Die

If the AI industry is to survive, we need a clear legal rule that neural networks, and the outputs they produce, are not presumed to be copies of the data used to train them. Otherwise, the entire industry will be plagued with lawsuits that will stifle innovation and only enrich plaintiff’s lawyers. Matthew Butterick has stated that these lawsuits are an attempt to set a precedent in favor of artists, because the law is unclear. Lack of clarity causes people to act conservatively to avoid liability, and that stifles innovation. Given that the courts are unlikely to come up with a common-law rule in this decade, clarity probably needs to come in the form of a legislative amendment to the copyright law. Unless it comes soon, the generative AI industry may be in trouble.

It’s unclear whether Butterick’s suits are mostly a publicity stunt and a ploy for the plaintiff’s lawyers to make a windfall, or a selfless attempt to provide equity for authors, or somewhere in between. But one thing is sure: they will spark a cottage industry for plaintiff’s lawyers, cause crippling expenses for AI developers, and thwart innovation in the generative AI field. As the tech industry celebrates the frothy emergence of machine learning in a time of economic doom and gloom, let’s hope this nascent field doesn’t sink because of the copyright iceberg looming ahead.

Note: Since I started preparing this article for publication yesterday, an additional case was threatened in London by Getty Images regarding Stability AI. Because Getty is the single owner of so many images, and outside the US, this is not a class action suit, and may be more likely to result in a settlement.

Update February 6, 2023: The other shoe drops: Getty filed a complaint in Delaware against Stability AI.

Also, this blog post is a personal opinion, and nothing I have written here should be attributed to any of the parties involved.