AI, AMR, and Data Ownership: Blessed Are the Data Generators

Our tech overlords are upset that they actually might have to pay people for their labor (boldface mine):

Andreessen Horowitz is warning that billions of dollars in AI investments could be worth a lot less if companies developing the technology are forced to pay for the copyrighted data that makes it work.

The VC firm said AI investments are so huge that any new rules around the content used to train models “will significantly disrupt” the investment community’s plans and expectations around the technology, according to comments submitted to the US Copyright Office.

“The bottom line is this,” the firm, known as a16z, wrote. “Imposing the cost of actual or potential copyright liability on the creators of AI models will either kill or significantly hamper their development.”

…A16z argued that the “only practical way” LLMs can be trained is via huge amounts of copyrighted content and data, including, “something approaching the entire corpus of the written word” and “an enormous cross-section of all of the publicly available information ever published on the internet.”

The VC firm has invested in scores of AI companies and startups based on its “expectation” that all this copyrighted content was and will remain available as training data through “fair use,” with no payment required.

Well, a16z, it sounds like you made a risky bet, and, now, you might lose that bit. I’m sure the system of capitalism that exists in the U.S. in Year 2023 of Our Gritty will rectify that through market mechanisms (WE MAKE THE FUNNY!).

But what this really tells us is that the economic gains of LLMs will be marginal and are dependent on free data.

To unpack that last sentence, let’s consider the burgeoning field of using genome sequencing to predict what antibiotics can kill an infection (really! It is!). To recap for the non-cognescenti, currently we determine what antibiotics will kill your infection by culturing the infection and then trying to kill it with various antibiotics (‘antibiotic susceptibility testing’). Genomic approaches have the potential to be faster, and they also provide other information about its ability to cause disease–the genomic information can be (and currently, is) used to understand the epidemiology of an outbreak: for example, are these three people with an infection killed by the same drugs actually infected by the same thing (an outbreak) or is it three different infections that just happen to be susceptible to the same drugs?

One obvious approach is some form of machine learning or AI–using the genome sequence as input, and generating a predicted antibiotic susceptibility profile as output. Most companies that manufacture antibiotic susceptibility testing equipment are exploring this approach: a few years ago, Biomérieux spent $100 million to sequence lots of bacterial pathogens. I’m guessing that the antibiotic susceptibility testing side of it cost millions–and for someone else to replicate the antibiotic susceptibility testing work from scratch might cost $50 million, so let’s, for discussion purposes, assume that the cost of those data–which might not be the worth of the data–is $150 million.

Now imagine Marc Andressen were to approach Biomérieux and ask them to give them their $150 million data stash for free. If he were lucky, he would be laughed out of the room because those data are valuable–if Biomérieux or any other company pulls this off, it could be worth billions–every year.

If LLMs were actually as valuable as everyone claims, then it would be worthwhile to pay authors. Obviously, if you get the data for free, then the expected benefits of LLMs don’t have to be very high. But if (hopefully, when) one factors in the cost of data generation–which is to say, writing–then the gains from LLMs have to be much higher than currently envisioned.

This is an unusual situation for our techbro overlords: it’s not the coders who create most of the value, it’s the data generators who provide the real value and are the real cost. Unlike much of the data, Silicon Valley deals with (consumer information provided for free by customers, and bought very cheaply), this kind of data acquisition is expense. And iff your potential product’s gains can’t cover the costs of the data generators, that’s a bad business model.

On the other hand, having a bunch of LLMs that sound like a bunch of those nineteenth century forsooth and verily reply guy assholes would be kind of hilarious, so maybe the actual available free data does have some utility…

AI, AMR, and Data Ownership: Blessed Are the Data Generators

Like this:

20 Responses to AI, AMR, and Data Ownership: Blessed Are the Data Generators

Recent Posts

Top Posts & Pages

Meta

Mike the Mad Biologist

Email Subscription

Categories

Archives

Sitemeter

AI, AMR, and Data Ownership: Blessed Are the Data Generators

Share this:

Like this:

20 Responses to AI, AMR, and Data Ownership: Blessed Are the Data Generators

Recent Posts

Top Posts & Pages

Meta

Mike the Mad Biologist

Email Subscription

Categories

Archives

Sitemeter

Discover more from Mike the Mad Biologist