Wikipedia:Wikipedia Signpost/2023-08-01/In focus

In focus

Journals cited by Wikipedia

In 2009, I had an idea I thought was pretty neat. What if we looked at all the |journal= parameters of citation templates? Nearly 14 years later, it's more than time to share what that idea became. Here, then, is a historical tour of one of Wikipedia's interesting little secrets.

An idea is born

The idea was born out of a desire to understand which journals were highly cited on Wikipedia, so WikiProject Academic Journals could prioritize work and focus on the in-demand journals and redirects. I figured we had hundreds of thousands, if not millions of |journal= parameters – we might as well compile them and see what comes out of it. This was the basic idea behind Journals cited by Wikipedia, or WP:JCW for short.

After a bot request, ThaddeusB tackled the task with WikiStatsBOT (BRFA), and the idea began to take shape (albeit in a crude form). The bot would download a dump of all Wikipedia articles and process it into three searchable sub-compilations: an alphabetical one, comprehensively listing all the |journal= parameters, a most popular journal listing, and a most popular missing journal listing.

Sample outputs of the first decently-accurate run (30 June 2009):

Most popular journals, covering the top 100 journals
Most popular missing journals, covering the top 100 missing journals
A1 the first page of the alphabetical listings.

The initial output was crude and inaccurate (especially before the above date), but it was good enough to get us started. WikiStatsBOT was quickly improved to clean up some entries, have better accuracy, and present things in a more appealing and useful way. After the early kinks were sorted out, about 355,000 citations were covered. The most popular journal entry was J. Biol. Chem. (with 17,543 citations), and the most popular 'missing' journal entry was Genome Res. (with 10,190) – Genome Research actually existed, but the redirect from Genome Res. hadn't been created yet.

The early days

ThaddeusB continued to run the bot every now and then. Its first run was in June 2009, followed by others in July and August of that year. Things were looking good: by the end of 2009, the top 300 most cited entries all had articles (or redirects), and only about 40 entries in the top ~500 remained missing. The topmost missing entry went from 10,190 citations to below 120 citations. That's a reduction of two orders of magnitude in the span of half a year! During that time, we also developed a writing guide for journal articles (which has since been greatly refined).

The next and last run didn't occur until May 2010. ThaddeusB then abruptly became inactive (a reminder that bus factors of 1 are bad), leaving us without both bot and coder. We still worked with what we had, clearing the first 500 journals by the end of December of that year.

A new bot request was made, looking for a bot to take over WikiStatsBOT's old task. I used the opportunity to bring in new ideas, and redesign some of its functionality and visual appearance. After a few days, a coder was found in JLaTondre. A BRFA was filed, and in July 2011, the JL-Bot unleashed its 0s and 1s in service of WikiProject Academic Journals.

Sample output of the first decently-accurate run (10 July 2011):

Again, the bot was quickly improved to clean up some entries, have better accuracy, and present things in a more appealing and useful way.

Modern era

JL-Bot's takeover marked the beginning of what I call the "modern era" of JCW. The compilation became reliably updated, which let us focus on adding functionality and having a more polished presentation. The current update schedule is for major updates twice a month (after data dumps become available) and daily minor updates as needed. The major updates reflect the evolution of Wikipedia since the last dump (i.e. new citations, deleted citations, cleaned up citations, etc.), while the minor updates reflect smaller changes in how things have been organized since the last daily run (i.e. categorization and redirect updates, as well as bot configuration tweaks).

Over time, new sub-compilations were designed to browse the data according to different criteria.

A by-target compilation (July 2017)

A compilation of all redirects pointing to the same target page

The Wikipedia CiteWatch (August 2018)

A compilation of questionable and unreliable sources (see previous Signpost coverage)

A by-publisher compilation (April 2019)

A compilation aiming to group all the journals of a publisher together

Various maintenance compilations (August 2019)

Used to clean up unusual, weird, or known-to-be-wrong stuff

A by-DOI prefix compilation (December 2019)

DOI prefixes can belong to individual journals, to publishers, or to access platforms like AJOL or JSTOR

A list of DOI registrants based on CrossRef (January 2020).

This is not part of the JCW compilation proper, but it is used to create redirects from DOI prefixes used by the compilation.

Those can all be easily accessed through the current JCW mainpage. Browsing the archives of User talk:JL-Bot and WT:JCW will give some insights as to how each feature got implemented over time, but I must warn you that the discussions can get pretty technical. But, perhaps more importantly, the source code is available on GitHub under the MIT License, greatly mitigating the bus factor issue.

As of writing, the compilation covers about 3.3 million citations, with 1.5 million distinct DOIs, with 7,290 distinct DOI prefixes. This is nearly ten times the initial coverage we had in 2009, which reflects the expansion Wikipedia had since (both in the number of articles and in the number of citations per article). For posterity,

Summary of the current compilation, based on the 20 July 2023 dump
Most cited publishers	Citations^{[n 1]}	Most cited journals	Citations^{[n 2]}	Most cited missing journals	Citations^{[n 3]}
Elsevier	360,000	Nature	51,000	The NamesforLife Abstracts	1974
Springer Science+Business Media	286,000	Proceedings of the National Academy of Sciences of the United States of America	40,000	Cesa News	824
Wiley	255,000	Science	37,000	New Zealand Journal of Geology and Geophysics	534
Nature Research	118,000	Journal of Biological Chemistry	33,000	The Real Estate Record: Real Estate Record and Builders' Guide	509
Informa	112,000	The Astrophysical Journal	23,000	Memoirs of the American Entomological Institute	505
^ Approximate, contains some false positives ^ Approximate, can include some related publications ^ Exact, but overlooks alternative titles

You might say, "but wait, those redlinks contains things that aren't journals!" Well, read on to find out more. I will however, take a small pause here to thank various people that helped with the development of the compilation in one way or another over the years.

First JLaTondre whose speed and skill at implementing my vision is simply unparalleled and unfathomable, as well as ThaddeusB for giving it a shot ages ago. Then, in no particular order of importance, AManWithNoPlan, the late DGG (obituary), Fgnievinski, Galobtter, IntoThinAir, John Vandenberg, Johnuniq, Mark viking, Markworthen, Randykitty, Rjwilmsi, the late Ronhjones (obituary), Smith609, Steve Quinn, TheSandDoctor, Tokenzero, and Trappist the monk. I'm sure I'm forgetting some people, and I apologize for doing so. But believe me, I have appreciated every bit of help I have even gotten.

How does it work, exactly?

Understanding what exactly the compilation is is important. As mentioned above, it's a searchable compilation of all |journal= parameters from citation templates on the English Wikipedia, taken from the latest database dump. The |doi= parameters are also used by the compilation to group things by DOI prefixes (the 10.xxxx/... part of DOIs). It is based on citations like:

<ref name=Bloom1969>{{cite journal |last1=Bloom |first1=E. D. |display-authors=etal |year=1969 |title=High-Energy Inelastic e–p Scattering at 6° and 10° |journal=Physical Review Letters |volume=23 |issue=16 |pages=930–934 |doi=10.1103/PhysRevLett.23.930}}</ref>

It will, however, ignore named-reference repeats like <ref name=Bloom1969/>, as well as "manual" citations like

<ref>Bloom, E. D. et al. "High-Energy Inelastic e–p Scattering at 6° and 10°". Physical Review Letters, 23 (16): 930–934. doi:10.1103/PhysRevLett.23.930</ref>

There is also limited support for semi-manual citations involving {{doi}} and {{doi-inline}}, like:

<ref>Bloom, E. D. et al. "High-Energy Inelastic e–p Scattering at 6° and 10°". Physical Review Letters, 23 (16): 930–934. {{doi|10.1103/PhysRevLett.23.930}}</ref>

Then some cleanup and processing is done:

|journal=[[Foo|Bar]] is treated as |journal=Bar
Markup, whitespace, and certain templates like {{small}} are stripped and normalized
Fuzzy logic is used to match likely typos and likely related entries
For the purpose of matching, common terms are normalized (Bulletin = Bull., Catalogue = Catalog, Journal = J., Proceedings = Proc., etc.) unless an article/redirect exists
For the purpose of matching, supplements and sections are treated as their base publications (Acta Foobarol. Suppl. = Acta Foobarol., MNRAS Letters = MNRAS, J. Phys. A = J. Phys.) unless an article/redirect exists
Matching ignores common articles like an, the, and, &; likewise for other languages (French le, la, l', German für, etc.)
WP:JCW/EXCLUDE is used to unmatch entries that don't belong together. For example, African Journal of Arts will be a fuzzy-logic match for American Journal of Arts, even though nobody with a working brain would think these were the same.

Matching is not perfect, so you'll often find mismatched entries like:

2842	Nature Sustainability	Nature Sustainability (165 in 132) The Future of Sustainability (1 in 1)

When these are found, they can be bypassed in WP:JCW/EXCLUDE, and won't show up in the next daily run. A great deal of energy is spent dealing with bad matches, particular after new data dumps. But there will always be mismatches.

The |journal= parameter will often be misused for books, magazines, newsletters, websites, or contains wrong/extraneous data like authors/publisher/volume/page. We try to identify what type of publication we're dealing with in the most popular journals, and alphabetical sub-compilations based on categories and keywords, but no filtering is done because we want to be able to clean that stuff up when we come across it! And if we have a highly-cited non-journal, like Cesa News, well that's still good to know. If it's not notable on its own, maybe we can create an article on its publisher, Centre for Entomological Studies Ankara, and redirect Cesa News there.

Additional information on how to read the compilation can be found at the bottom of each page in the compilation, as well as on the compilation's main page.

How is it used?

The main historical use of the compilation was to find highly cited missing journals. That is still the case today. But so much more can now be done, particularly on cleanup:

Finding common typos, misspellings, miscapitalizations, using WP:JCW/TYPO and WP:JCW/MISCAPS. (See previous Signpost coverage.)
Finding unusual typos, misspellings or miscapitalizations. For example, as of writing, the 10.4401 doi prefix entry lists Annals of Geophysics (97 in 91) and Annals of Geophysics Journal (1 in 1). One might suspect (correctly) that Annals of Geophysics Journal is the wrong name of the journal being cited.
Finding books being cited as journals, with ISBNs in the journal parameter
Finding journals with the wrong DOI, or DOIs with the wrong journal.
Finding former names of journals.
Finding ISO 4, Bluebook, MathSciNet, or US National Library of Medicine abbreviations and other (often incorrect) abbreviations of journals.
Creating redirects from Foobar Journal to The Foobar Journal and vice versa

Citation bot and JCW-CleanerBot will often be seen doing cleanup based on these compilation.

Where to go from here?

Well, the first natural extension would be WP:MCW – Magazines cited by Wikipedia. But that already exists! It was developed alongside JCW, but given much less attention; historically, {{cite magazine}} redirected to {{cite journal}}, so its adoption was much less widespread. We could also have WP:PCW (Publishers cited by Wikipedia) or WP:BCW (Books cited by Wikipedia). But those would require a great deal of curating, given those would represent entirely new datasets, with their own peculiarities. It doesn't mean they won't ever get done – just that those would represent a big design challenge. The bot code would probably be relatively straightforward to adapt once the design was clear, but I'm sure that would still have its challenges.

But for now, I hope that you'll have fun exploring the compilation, and perhaps decide you want to tackle the many invalid titles, or clean up the many proceedings cited as journals. Feel free to share your experiences with JCW or suggest improvements to the compilations in the comment section!

← Previous "In focus"

Next "In focus" →

In this issue

1 August 2023 (all comments)

News and notes

In the media

Disinformation report

In focus

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Suggestion for a future piece: thoughts on JCW and Wikidata. fgnievinski (talk) 04:19, 1 August 2023 (UTC)[reply]
Headbomb, Warmly seconded! Agreed that one or two highly cited (or miscited) articles will skew these stats, but any power law in that would also be also interesting to see. – SJ + 15:53, 14 September 2023 (UTC)[reply]
Excellent work. Speaking of Wikidata, what about citations using {{Cite Q}}? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 12:47, 1 August 2023 (UTC)[reply]
Completely ignored. Headbomb {t · c · p · b} 15:48, 1 August 2023 (UTC)[reply]
Yes, but how do you plan to fix that omission? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:04, 2 August 2023 (UTC)[reply]
I don't. Cite Q lives in Wikidata, not in the data dumps. Headbomb {t · c · p · b} 13:24, 2 August 2023 (UTC)[reply]
Interesting experiment. I am probably more interested in individual highly cited journal articles than which journal they were published in. But there is obvious overlap. Highly cited articles will contribute to the highly cited journal count. I think one needs to take care in interpreting the results.^[1] The most highly cited in Wikipedia was a human genome sequencing article.^[2] This citation was added to a lot of Gene Wiki articles by a bot. It in many cases, it was the first description of the sequencing of a particular gene. One the other hand, the citation is not a very good one because it is not very specific to a particular gene and also the source is primary. This and similar citations are gradually being replaced by more specific citations. So it would be interesting to see how the most cited scholarly articles and journals change over time. Boghog (talk) 20:29, 1 August 2023 (UTC)[reply]
Very interesting article. I would be curious to know if there are similar projects in other languages such as French or German. PAC2 (talk) 05:38, 14 August 2023 (UTC)[reply]

Not to my knowledge, though it should be fairly straightforward to port and localize the bot to those projects. Headbomb {t · c · p · b} 01:45, 15 August 2023 (UTC)[reply]

References

^ Guglielmi G (May 2018). "Wikipedia's top-cited scholarly articles - revealed". Nature. 557 (7705): 291–292. Bibcode:2018Natur.557..291G. doi:10.1038/d41586-018-05161-6. PMID 29765126. S2CID 256768531.
^ Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD, Collins FS, et al. (December 2002). "Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences". Proceedings of the National Academy of Sciences of the United States of America. 99 (26): 16899–903. Bibcode:2002PNAS...9916899M. doi:10.1073/pnas.242603899. PMC 139241. PMID 12477932.

The Signpost: doing it for free since 2005.

Home

About