Contibuting text from LLMs to public domain wikis
Published on under the IndieWeb category. Toggle Memex mode
There is an ongoing discussion in the IndieWeb about the extent to which use of LLMs should be permitted on our community wiki. Our wiki, in development for over 10 years, is an authoritative source on many technical topics pertaining to the web. The wiki is licensed under a Creative Commons Zero (public domain) license. A disclaimer appears on any page that lets you submit content to the wiki informing you that you must have the ability to license content to the wiki under the public domain. That is to say, you cannot reuse copyrighted material; pasting from another site, verbatim, whose content you do not own is not allowed.
Our wiki states the following above any buttons to contribute:
Please note that all contributions to IndieWeb are considered to be released under a CC0 public domain dedication (see IndieWeb:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here. You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
There are quality concerns about using LLMs to contribute text to wikis -- what if the content generated by an LLM is wrong? -- that have generated opposition to using LLMs. We deeply care about the quality of our content.
A new concern recently emerged whose discussion I want to advance: what if you use an LLM to generate content and publish it to the wiki but it turns out the LLM has "regurgitated" (rendered, verbatim) information from another source? OpenAI says the following about regurgitation (also known as "memorisation") in their recent blog post on the lawsuit filed against them by the NY Times:
Memorization is a rare failure of the learning process that we are continually making progress on, but itβs more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites. So we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs. We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use.
Despite any investment in progress to resolve issues with memorisation, if there is a chance that an LLM copies, verbatim -- or close to verbatim -- copyrghted material, I assume that material cannot be contributed to the public domain (I am not a lawyer). You are contributing the words of another author to a wiki, even if you did not know that to be the case.
The text above pertains to OpenAI's models. But what about other LLMs? What about Bard? Or open source LLMs?
We do not have any community members contributing text with LLMs, verbatim, to the extent we understand. With that said, the above is an interesting question: if you use an LLM to help you write text and the text turns out to be copyrighted, what happens next?
The public domain license of content is important to our community. Readers should know that they can copy any snippet of text or code they want and use, re-use, remix, or re-distribute that text or code on their site. I do not have an answer to any of the questions I have posited in this post, but one this is for sure: I would love to learn more, and to see more discussion on the topic of how copyrighted material that is returned by an LLM can be licensed. Please email me at readers [at] jamesg [dot] blog if you can help advance this discussion (or blog on your own personal website and send me a link!).
Tagged in IndieWeb.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.
