Wikipedia talk:Wikipedia Signpost/2024-10-19/Recent research

Discuss this story

Latest comment: 1 hour ago11 comments6 people in discussion

"We have therefore been encouraged to omit any identifying information in the specific pages we discuss". A commendable approach to ethics (even if, as noted, not perfect). Unlike some other cases I can think of... --_{Piotr Konieczny aka Prokonsul Piotrus| reply here} 12:21, 19 October 2024 (UTC)Reply
I would encourage you though to look beyond your personal experience (with a controversial paper whose central subject was the longtime impact of your and several other editors' activity on a particular historical topic area) and also consider the wider impact on open science practices here.

To be clear, my main problem with the statement quoted in the review is not that they e.g. leave out the specific user name of that editor who created five articles on English Wikipedia, detected by both tools as AI-generated, on contentious moments in Albanian history (btw, the paper goes further into the administrative actions taken against that user). I might have done the same. Rather, it is that they take this as an excuse not to adhere to the good practice (which has become more prevalent in much of quantitative Wikipedia research over the years) to publicly release the data that their paper's central conclusion is based on, which would include the output of the detectors for particular articles (without user names).

This not only prevents Wikipedians from using that data to improve Wikipedia (by reviewing and possibly deleting AI-generated Wikipedia content that the authors spent quite a bit of money on detecting - in the "Limitations" section, they describe their experiments as "costly"). It also makes it impossible for the community to discuss the performance of the AI detection method used by the paper in concrete examples (apart from those very few that were cherry-picked to be presented in the paper). After all, going back to the example of that paper from last year that (understandably) still seems very much on your mind, the fact that it had provided extensive concrete evidence for its claims across many specifically named articles and hundreds of footnotes was also what enabled you to dispute that evidence in lengthy rebuttals.

Regards, HaeB (talk) 17:39, 19 October 2024 (UTC) (Tilman)Reply
While I concur that releasing data is a good practice we should encourage, I also believe we need to encourage the good practice of protecting the subject studied. In here, in all honesty, I think the authors should have replaced stuff like "Albanian" with "Fooian" and obscure other content. That said, I understand there we have to weight good of the project and research vs good of small number of people, and also, most likely most editors identifiable here would not have their real names connected to their accounts, but still, protecting research subject is an important ethical consideration, and compromising it leads to a slippery slope. Ethical guidelines exist for good reasons, after all (and the fact that they are often ignored is not something that we should be proud of, as a society, IMHO). All I am saying is that the authors tried to do it at least a bit more than in the case we are both familiar with, and that's a plus. _{Piotr Konieczny aka Prokonsul Piotrus| reply here} 02:50, 20 October 2024 (UTC)Reply
Regarding "The Rise of AI-Generated Content in Wikipedia" link, which randomly sampled "2,909 English Wikipedia articles created in August 2024", I am puzzled about several things:
- Why aren't the data pools (table, top of page 2) exactly the same size - say, 2,500 each, since these data pools are samples of larger data sets?
- The authors say that their August 2024 sample came from Special:NewPages, which - of course - doesn't include deleted pages. But it makes a big difference if the authors did real-time collection of data during August, or took a snapshot in (say) early September, and this isn't specified. [Footnote 1, the link to "data collection and evaluation code", might provide the answere, but it returns a 404 error message.]
- Footnote 2 provides the source of the article's set of Wikipedia pages collected before March 2022, which are (from that source) datasets of "cleaned articles", stripping out "markdown and unwanted sections (references, etc.)". But the table at the top of page 3 includes "Footnotes per sentence" and "Outgoing links per word" - where did that information come from?
- And speaking of that table, perhaps it's just me, but I find it extraordinarily hard to believe that new articles in August 2024 (with the sample limited to those over 100 words) contained, on average, 1.77 outgoing links per word. -- John Broughton (♫♫) 18:14, 19 October 2024 (UTC)Reply
on average, 1.77 outgoing links per word - indeed. This nonsensical claim is one of the things that makes one wonder about the peer review process used by the "NLP for Wikipedia Workshop". (It also doesn't seem to be a mere typo, as the "per word" is reiterated in that table's caption and in different phrasing in footnote 4: We normalize by [...] word count.) Fortunately, for this secondary result the authors have actually released some partial data, providing the raw number of links per word calculated for each article (although again while withholding the information on how each article was classified by the two detectors, see also discussion above). It looks like at least for English, the numbers there are all below 1, as they should be. So the error must have happened later in the process. Again this also illustrates the value of adhering to open science practices by publishing replication data.

Another problem about this particular table (which I had left out of the review as too detailed, but which doesn't inspire confidence either): In the text they claim that Table 2 shows how, compared to all articles created in August 2024, AI-generated ones use fewer references. But in the table itself, that is not true for one of the four listed languages: In Italian, that number was actually higher for "AI-Detected Articles". Now, perhaps one could still support the overall claim using something like a multilevel regression analysis on the underlying data. But the authors don't do that, similar to how they hand-wave their way through various other issues in the paper.

where did that information come from? - Note that Table 2 appears to refer only to articles created in August 2024, so the absence of links in the 2022 dataset would not be a problem here. But yes, one could ask why they didn't vet their conclusion that AI-generated [Wikipedia articles] use fewer references and are less integrated into the Wikipedia nexus by calculating the same metrics for their March 2022 comparison articles.

Why aren't the data pools (table, top of page 2) exactly the same size - I mean, they didn't specify what sampling method they used, so one can't expect the resulting samples to have exactly the same size. But yes, it seems one of many unexamined researcher degrees of freedom in this paper. E.g. why did English Wikipedia end up with the smallest sample in the August 2024 dataset and the second-smallest for the per-March 2022 dataset? Did German, Italian and French Wikipedia have a higher number of new articles (of >=100 words) in August 2024 than English Wikipedia?

Footnote 1, the link to "data collection and evaluation code", might provide the answere, but it returns a 404 error message. Does it? The link [1] works for me right now. In an earlier draft of this review as posted here I had linked to [2], a link that afterwards turned 404 because one of the authors renamed the file from "recent_wiki_scraper.py" to "run_wiki_scrape.py" two days ago. The published version of the review uses a permalink (search for "scraping") which still works for me.

Regards, HaeB (talk) 01:51, 20 October 2024 (UTC) (Tilman)Reply
There needs to be an RFC on the use of Artificial Intelligence formally consigning it to the dustbin and banning off those who use it. Even as we speak there are some in the Foundation who think it's a great idea to facilitate the use of AI so that drivebys find it easier to "contribute." Carrite (talk) 19:48, 19 October 2024 (UTC)Reply
- As already briefly mentioned in the review, such an RfC already happened, see our coverage in the Signpost: "Community rejects proposal to create policy about large language models". It's also worth noting that the use of Artificial Intelligence is a very broad term which includes things that have been widely accepted for many years, like ORES (which many editors including myself have used to revert thousands of vandalism edits), see e.g. Wikipedia:Artificial intelligence. Lastly, we need to keep in mind that AI-generated articles (as well as AI capabilities in general) are a moving target, with recent systems getting more reliable at generating Wikipedia-type articles than a simplistic ChatGPT prompt would achieve, see e.g. the previous "Recent research" issue: "Article-writing AI is less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors". Regards, HaeB (talk) 00:26, 20 October 2024 (UTC) (Tilman)Reply
  Only speaking for myself, but I would like to see more AI in terms of tools, both in terms of helping augment the power and reach and scope of existing admins to make up for their steep decline, and for use by content editors to help them check, verify, and prepare articles for creation and reviewing. This does not mean that I support AI tools that would write the articles, but could help editors check for errors and look for plagiarism. One thing I've been thinking about for a very long time is how most of our articles stand alone within their separate topics and disciplines, without showing how the subjects cross fields, and interact with other similar and not so similar ideas. One potential use of a future AI tool would be to help editors unify the collection of all knowledge and show how it all links together. Currently, our primitive category system attempts to do this, but on an almost imperceptible level that isn't expressed as content or as a visualization. How does all of this content link together? That's what I would like to see it used for, and then, if at all possible, create new knowledge from the unification of all the information. Right now, I can ask various different systems questions, but they don't seem to be able to give me an accurate or insightful answer into anything. As everyone already knows, the weakest link here is our search interface, which doesn't provide 1% of the potential answers that it could. Viriditas (talk) 00:55, 20 October 2024 (UTC)Reply

Isn't GPTZero debunked as too inaccurate to use? –Novem Linguae (talk) 22:36, 19 October 2024 (UTC)Reply

I was wondering the same thing. Viriditas (talk) 23:07, 19 October 2024 (UTC)Reply

I think "debunked" is a bit too strong. But yes, there have long been concerns about its accuracy and false positive rates. I have myself advocated early on (February 2023) against relying on GPTZero and related tools for e.g. reviewing new pages, although WikiProject AI Cleanup just today weakened their previous "Automatic AI detectors like GPTZero are unreliable and should not be used" recommendation a little. It's also interesting that GPTZero themselves recently announced their goal [...] to move away from an all-or-nothing paradigm around AI writing towards a more nuanced one. An overall problem is that GenAI is has only been getting better and (presumably) harder to detect, and will quite likely continue to do so for a while.

As mentioned in the review, the authors of the paper seem broadly aware of these problems, but insist that they can work around them for their purposes. And to be fair, for a statistical analysis about the overall frequency of AI-generated articles the concerns are a bit different than when (e.g.) deciding about whether to delete an individual article or sanction an individual editor. Still, my overall impression is that they are way too cavalier with dismissing such concerns in the paper (they are not even mentioned in its "Limitations" section, and their sole attempt to validate their approach against a ground truth has too many limitations, some of which I tried to indicate in this review).

Regards, HaeB (talk) 00:47, 20 October 2024 (UTC) (Tilman)Reply

Add topic