Wikipedia:Using neural network language models on Wikipedia

This idea is in the brainstorming stage.
Feel free to add new ideas; improve, clarify and classify the ideas already here; and discuss the merits of these ideas on the talk page.

With the rise of machine learning, discussions about Wikipedia and AI models are becoming more and more heated. As of December 2022, with the release of ChatGPT for free to the public, AI has shown its potential to either massively improve or disrupt Wikipedia. It is clear that research is needed in order to inform discussions surrounding potential AI policies, so I made this page to catalog my observations around ChatGPT and its potential use based on its capabilities. And yes, this page is written entirely by human editors.

NOTICE: Don't use neural networks to generate content, use them to assist you at creating content. Especially in the neural network context, confidence in the result does not mean validity.

Proposed guidelines

Based on my research below, here are my proposed guidelines on how to align neural network models to our purpose of building an encyclopedia. Some of the guidelines are obvious from common sense, but I think it's worth it to write them down.

You may not ask neural networks to write original content and find sources, as these neural networks don't know what is right and wrong. Adding this kind of content would jeopardize Wikipedia's WP:OR and WP:RS policies. Even if it is heavily edited by humans, seek other alternatives that don't use the neural network's original content.
You may use these neural networks as a writing advisor, i.e. asking for outlines, asking how to improve the paragraph, asking for criticism for the text, etc. However, you should be aware that the information it gives to you can be unreliable and flat out wrong. Use due diligence and common sense when choosing whether to incorporate the neural network's suggestion or not.
You may use these neural networks for copyediting and paraphrasing, but note that it may not properly detect grammatical errors or keeping the key information intact. Use due diligence and do heavily edit the response from the neural network.
Use due diligence when crafting prompts for neural networks. Prompts designed for Wikipedia should use natural sentences and be as descriptive as possible, and include keywords such as "encyclopedic", "keep the meaning intact", etc. to minimize the AI from adding original content.
You are responsible for making sure that using neural network will not be disruptive to Wikipedia. Therefore, you must denote whether the edit uses the neural network or not, and what it is being used for in the edit summary.

Potential uses

Monitoring and flagging edits

It could be possible to at least detect simple readability or troll edits by having a bot patrol articles and their history and try and spot any weird edits. The prompt doesn't have to be just English, you can instruct LLMs in some cases to output in JSON, so that would allow for some automation as well.

What this means is you can have a bot periodically scan articles of interest and then autoflag it if it reaches over a certain threshold of suspicion.

Planning an article

It is no surprise that the bot can give coherent answers since it is based on the earlier GPT-3 model. As many have noted, original content from AI models should not be imported directly to Wikipedia due to sourcing and accuracy concerns. I am very impressed however by the fact that the bot knows about our internal policies and give a reasonable outline about how a WIkipedia article may be structured. It seems ChatGPT uses Wikipedia's policy pages in addition to articles for its dataset.

Based on the results, AI models seem to be a very powerful brainstorming tool, and via prompt engineering, these AI do allow an impressive amount of refinement to the plan. AI can also be a great tool as a pointer to potential sources and can remind editors of Wikipedia's content policy (NPOV, RS, etc.) Even though original content from AI is not suitable for Wikipedia as an import, it can be used by editors as an inspiration for research ideas. In the future, when Abstract Wikipedia becomes a thing, AI tools can be a massive help for organizing information for the planning stage of the article. This research is a bit limited due to the fact that the article SpaceX Starship has already existed when the AI was trained.

Copyediting paragraphs

AI copyediting of Wikipedia text as of 2022 can slightly reduce the work copyeditors need to do. However, human supervision is critical when using such tools. This task heavily relies on prompt engineering in order for the AI to give satisfactory results. For me, I settled with the prompt "Can you copyedit this paragraph from Wikipedia while still keeping the tone and the information as intact as possible:" followed by the paragraph without citations in plain text. There seems to be room for improvement for the prompt as ChatGPT may occasionally give texts that have run-on sentences or grammatical errors, but other than that, the text usually is more clear after a run by the AI.

Even though the AI is conservative at removing information and details, the text's length usually decrease by quite a bit as it removes redundant phrases. The AI is also good at reordering phrases to make the text more coherent, but at the cost of grammar errors and obscuring meaning. In more developed articles, AI seems to give more minor fixes to the text and less inclined to slash out content. In my opinion, ChatGPT can be used on Wikipedia as a coherence checker, as long as care is taken to make sure that no critical information is lost.

I^[who?] have published the following AI-generated texts^[which?] on Wikipedia after heavily modifying them. Overall, I believe that ChatGPT can indeed reduce the amount of copyediting work required, although perhaps not to the extent that some people may think. It's more accurate to view the AI response as a 'second opinion' on what to edit, rather than an authoritative answer.

Shorten a bloated section

Based on User:JPxG's optimistic result using ChatGPT to condense plot summaries, I tried my hand at trying to condense sections in general, which some articles in Wikipedia:Vital articles are guilty of. I found ChatGPT to be prone to the "garbage in, garbage out" problem; if the text contains a lot of junk and not enough useful details, then it may try to repackage that junk in the result, despite being explicitly told not to do so.

Translation

A 2023 Translatathon found that chatbots and Bing Translate, which use Large Language Models, could produce rough drafts of articles in another language. Translations should not be published without review.

Potential pitfalls

Requesting citations

Use great caution when asking ChatGPT for specific sources. Its neural model will likely respond with very persuasive looking citations, but they generally should not be relied upon without detailed examination. Sometimes, the bot will list a real author along with a fictitious article or book title that looks authentic but is not, and sometimes both the author and the title are invented. On the other hand, major authors and works are known, so if you ask about Chomskyan linguistics, it's going to know about Aspects of the Theory of Syntax and other works.

This also applies to Perplexity AI and other AI "conversational" search engines like Microsoft Bing with ChatGPT integration, YouChat and other models that claimed to answer to complex questions but for a different reason; they cite unreliable sources (some if not most of which are listed in perennial sources list), Wikipedia itself, and/or even blacklisted sites.

Further examples

Greater risk

Templates, modules and external software. LLMs can write code that works great, often without any subsequent modification. As with any code (including stuff you found on Stack Exchange), you should make sure you understand what it's doing before you execute it: bugs and errors can cause unintended behavior. Common sense is required; as with all programming, you should not put large chunks of code into production if you haven't tested them beforehand, don't understand how they work, or aren't prepared to quickly reverse your changes.
Copyediting existing article text. The same pitfalls as with LLM-assisted content creation exist here: instead of pure copyediting LLMs may change the meaning and introduce errors. Experienced editors may ask an LLM to improve the grammar, flow, or tone of pre-existing article text. Rather than taking the output and pasting it directly into Wikipedia, you must compare the LLM's suggestions with the original text, and thoroughly review each change for correctness, accuracy, and neutrality.
Summarizing a reliable source. This is inherently risky, due to the likelihood of an LLM introducing original research or bias that was not present in the source, as well as the risk that the summary may be an excessively close paraphrase, which would constitute plagiarism. You must proactively ensure such a summary complies with all policies.
Summarizing the article itself (lead expansion). Lead sections are nothing more than concise overviews, i.e. summaries, of article body content, and text summarization is one of the primary capabilities of LLMs which they were designed for. However pasting LLMs output to expand the lead is still inherently risky because of a risk of introducing errors and bias not present in the body.^[a] It's better to only use an LLM to generate ideas for lead expansion, and create the actual improvements yourself.

Lesser risk

Despite the aforementioned limitations of LLMs, it is assumed that experienced editors may be able to offset LLM deficiencies with a reasonable amount of effort to create compliant edits for some scenarios:

Tables and HTML. Because their training data includes lots of computer code (including wikitext and HTML), they can do things like modify tables (even correctly interpreting verbal descriptions of color schemes into a reasonable set of HTML color codes in fully formatted tables). If you do this, care should be exercised to make sure that the code you get actually renders a working table, or template, or whatever you've asked for, and that the numbers in the table actually check out (mistakes happen occasionally).
Generating ideas for article expansion. When asked "what would an encyclopedia entry on XYZ include?", LLMs can come up with subtopics that an article is not currently covering. Not all of these ideas will be valid or have sufficient prominence for inclusion, so thoughtful judgment is required. As stated above, LLM outputs should not be used verbatim to expand an article.
Asking an LLM for feedback on an existing article. Such feedback should never be taken at face value. Just because an LLM says something, does not make it true. But such feedback may be helpful if you apply your own judgment to each suggestion.

Countermeasures

One of the main concerns about using these language models is that somebody may not be able to detect whether the text is original or it is written by AI.

Detection tools

There are various tools available online which promise to detect whether a given text was generated by AI. However, as of February 2023, these were being regarded as "definitely not good enough" to use for important decisions, due to frequent false positives and false negatives (and are often intended for outdated models like 2019's GPT-2).

Notes

^ It should especially not be assumed that a prompt of "write/expand the lead section of X Wikipedia article" will generate a genuine summary; LLM-based applications which can't look things up on the internet (the norm as of early 2023) may not know what the exact content of the article is in order to be able to summarize it, and even if it has been a part of their corpus, they do not seem to function in such a way that they can isolate the whole article from the rest of the corpus, in order to derive the output exclusively from one article's content.