Wikipedia:Wikipedia Signpost/2024-09-04/In the media

File:Shizuki and Jun Kasai on June 3, 2009.jpg
DPS-fan Magyar
CC BY-SA 3.0
70
450
In the media

AI is not playing games anymore. Is Wikipedia ready?

Portland pol's publicly-paid profile: Part II

See previous coverage: "Portland politician spends $6,400 in taxpayer dollars to 'spruce up his profile on Wikipedia'" about the article Rene Gonzalez (politician)

The 2020 Oregon Ballot Measure 107 allows campaign finance disclosure regulations in the state of Oregon, which may have been violated by the Gonzalez campaign, in addition to Gonzalez authorizing irregular expenditures of taxpayer funds not allocated to campaigning. Alt-weekly Portland Mercury said "It's unclear which fund the money for the Wikipedia edits came from, and why the money didn't instead come from Gonzalez's mayoral campaign funds."

Two Portland-based television stations had stories on an investigation into the expenditures. KOIN, the CBS affiliate, said that Gonzalez claims "the money went to train staff on how to follow Wikipedia standards", not to conduct impermissible campaigning; KGW, the NBC affiliate, also carried a full story about the case, titled "Commissioner Rene Gonzalez now the subject of Portland campaign finance investigation". - B

Is Wikipedia ready to play the game of Jum-AInji?

A transformer might think this image depicts "The Transformer", but it does not (it is, however, depicting an instance of Japanese hardcore)

In a recent article for The New Yorker, titled Was Linguistic A.I. Created by Accident? (paywalled), Stephen Marche focuses on the role of chance and good luck in the research that led to the landmark 2017 AI paper "Attention Is All You Need", which introduced the transformer architecture. The paper was originally supposed to focus on using the transformer to make English-to-German translations.

Instead, as part of the AI model's training process, the Google team asked the transformer to read Wikipedia entries for two days, covering almost half of the platform's pages. The model was then asked to create five new Wikipedia-style articles from scratch, all about made-up subjects called "The Transformer": a fictitious Japanese hardcore punk band formed in 1968, a fictitious video game, a fictitious 2013 Australian sitcom, a fictitious studio album by an alternative metal group called Acoustic, and even a fictitious science-fiction novel. At first reading, the articles produced by Transformer on the made-up topics all looked like real Wikipedia articles: they were almost too good, "filled with inconsistencies, but [...] also strikingly detailed", suggesting that AI had made a jump of twenty or more years of progress:

Why was a neural network designed for translating text capable of writing imaginative prose from scratch? "I was shocked, blown away," (researcher Aidan) Gomez recalled. "I thought we would get to something like this in twenty years, twenty-five years, and then it just showed up." The entries were a kind of magic, and it was unclear how that magic was performed.
— Was Linguistic A.I. Created by Accident?, Stephen Marche

The historical bond between Wikipedia and machine-learning based natural language processing goes back even further. The first attempts to provide the encyclopedia with text generated using artificial neural networks trace back to at least 2009.

But artificial intelligence and large language models are not just derived from Wikipedia; they are important topics for discussion and policy about the platform's future.

The rapid rise of ChatGPT has raised the most interest and sparked dozens of research efforts towards the implementation of LLMs in the creation and improvement of Wikipedia articles, among other tasks, with the STORM system prototype being the latest example. The Wikimedia Foundation has taken note of AI's progress, for example, by expanding its Machine Learning team and even testing an experimental ChatGPT plugin between July 2023 and February 2024. The Signpost itself has included DALL-E-generated images in various articles. On the other hand, in somewhat Jumanji style, the more we get invested in the AI game, the more traps we discover: without proper checks and balances, machine-generated content can pose a threat to the integrity of Wikipedia, should the number of unsourced and fictitious articles keep increasing and causing more problems with COI-related material and disinformation.

The Spanish newspaper El País recently interviewed Wikimedian and Wikimedia España member Miguel Ángel García, along with the WMF's Director of Machine Learning, Chris Albon (in Spanish, free registration might be required). García, who joined Wikipedia in 2006, noted how many newly-registered users introduce themselves by "[pasting] a giant text, apparently well-structured and well-developed", which turns out to be poorly-written and redundant after a closer look. Luckily, the platform is usually able to handle this material through mechanisms such as speedy or proposed deletion, as well as the continuous efforts of its volunteers, which have also been acknowledged by Albon. (Everyone interested can give a helping hand by joining initiatives such as the WikiProject AI Cleanup.)

However, both expressed concerns over the long-term impact of automatic content on the encyclopedia: while García is mainly worried about the incorporation of "pseudo-media" hosting bot-generated articles as sources on Wikipedia - a phenomenon that could actually be mitigated through reports at the noticeboard - Albon took a brief detour from his usually optimistic view on AI tools, explaining that "if there's a detachment between the places where knowledge is created, like Wikipedia, and the places where it is accessed, like ChatGPT, we're at risk of losing a generation of volunteers". He also said that LLMs providing the platform with poorly-sourced or unreferenced content could "introduce an unprecedented amount of disinformation" on the Internet, since "users will not be able to easily distinguish accurate information from [AI] hallucinations"; quite an ironic situation to find ourselves in, considering that chatbots such as ChatGPT and Google Gemini are being fed with thousands of Wikipedia articles as part of their training schedules.

Titled "ENC-AI-CLOPEDIA. AI is mining the sum of human knowledge from Wikipedia. What does that mean for its future?", a separate interview by Sherwood News (the media arm of trading platform Robinhood Markets) also featured Albon, together with his colleague Lane Becker, Senior Director of Earned Revenue at the Wikimedia Foundation and president of its for-profit subsidiary Wikimedia LLC, which runs Wikimedia Enterprise.

The interviewer first confronted them with "Data from Similarweb [which] shows that traffic to Wikipedia has been in decline" since about 2020. In response, Albon pointed to the Foundation's own (presumably more precise) pageview and unique devices data, with Becker asserting that "We have not seen a significant drop in traffic on Wikimedia websites that can directly be attributed to the current surge in AI tools." (This conclusion is somewhat in contrast with two recent academic papers, see our coverage: "ChatGPT did not kill Wikipedia, but might have reduced its growth", "'Impact of Generative AI': A 'significant decrease in Wikipedia page views' after the release of ChatGPT")

However (similar to Albon in the El País interview), Becker voiced "concern [...] about the potential impact that these AI tools could have on the human motivation to continue creating and sharing knowledge. When people visit Wikipedia directly, they are more likely to become volunteer contributors themselves. If there is a disconnect between where knowledge is generated (e.g. Wikipedia) and where it is consumed (e.g. ChatGPT or Google AI Overview), we run the risk of losing a generation of volunteers." (Not mentioned, but presumably on Becker's mind as well, was the fact that these visitors are also, via Wikipedia's well-known donation banners, the Foundation's most important source of revenue by far.)

Asked "How do you feel about practically every LLM being trained on Wikipedia content?", Becker stressed that "we welcome people and organizations to extend the reach of Wikipedia's knowledge. Wikipedia is freely licensed and its APIs are available for free to everyone, so that people all over the world can use, share, add to, and remix Wikipedia content." However, "We urge AI companies to use Wikimedia's free APIs responsibly and include recognition and reciprocity for the human contributions that they are built on, through clear and consistent attribution. They should also provide pathways for continued growth and maintenance of the human-created knowledge that is used to train them" - such as "Clearly attributing knowledge back to Wikipedia", but also, for "high-volume commercial reusers of Wikipedia content to use our opt-in paid for product, Wikimedia Enterprise." Becker shared that its total revenue (i.e. not accounting for the staffing and other costs of Wikimedia Enterprise itself) "for FY 2022-23 was $3.2 million - representing 1.8% of the Wikimedia Foundation's total revenue for the period." However, he declined to disclose how much of that came from Google (one of the few publicly known customers, another one being yep.com).

- S, O, H

See also in this issue's News and notes: "AI policy positions of the Wikimedia Foundation"

In brief

Red clover for Clovermoss
See previous Signpost coverage about the controversy surrounding this article, as well as the discussion about the reliability of the Anti-Defamation League on the Israeli-Palestinian conflict, here and here.



Do you want to contribute to "In the media" by writing a story or even just an "in brief" item? Edit our next edition in the Newsroom or leave a tip on the suggestions page.