Here is a query that may throw a generative AI firm right into a twist: “What content material has been used to coach your fashions?.” Whereas some choose to dodge the query, and others bullishly entrance out the difficulty completely, the query of whether or not an AI firm has scraped content material for its personal enterprise functions with out permission is a thorny one.
At greatest, you are prone to get a mealy-mouthed clarification of “curated datasets”, and at worst, a polemic about whether or not the whole lot on the web is basically truthful sport.
Now a doc obtained by 404media seems to point out that a part of the information used to coach Runway’s newest AI video era instrument, Gen-3, might have come from the YouTube channels of 1000’s of widespread media corporations, together with Pixar, Netflix, Disney and Sony.
Whereas 404media would not go into particulars as to how the doc was obtained, nor may it confirm that each video talked about inside was used to coach Gen-3, it is probably an perception into the form of practices that an AI firm would possibly use to scrape copyrighted materials to coach its fashions.
A former Runway worker spoke to 404media in regards to the methodology concerned. The 14 spreadsheets contained inside the leaked doc are stated to characteristic phrases like “seaside” or “rain”, with the names of Runway staff subsequent to them.
In keeping with the supply, these names have been stated to be staff tasked with discovering movies or channels associated to those key phrases, who would then go on to make use of a YouTube video downloader instrument through a proxy to scrape them from the location with out being blocked by Google.
It isn’t simply YouTube content material that appears to have been scraped, both. A spreadsheet containing 14 hyperlinks to non-YouTube sources, together with a hyperlink to an internet site devoted to streaming widespread cartoons and animated films, with 1000’s of copyright complaints logged towards it.
Basically, pirated media seems to have been no less than into account for coaching information, if circuitously scraped and used.
404media really went one step additional, and tried to make use of Gen-3 to generate video utilizing prompts that contained key phrases primarily based on the phrases discovered within the spreadsheet, and was capable of create clips that appeared to be very a lot in the identical model because the related content material.
Runway was itself part-funded by Google, amongst others, so scraping content material with out permission from creators on its platforms, if true, is prone to land it in vital scorching water. By no means thoughts the potential wider authorized repercussions.
Nonetheless, whereas the difficulty of AI content material theft is a thorny one, the mannequin does nonetheless seem to have points. Ars Technica tried creating some movies not too long ago with Gen-3 Alpha, and it gave a cat a pair of human palms. I am unsure what content material was used to coach that specific model of the mannequin, however I might counsel that regardless of the methodology used right here, it may do with some work someway.