Removes existing parameter checked=yes

edit

https://en.wikipedia.org/w/index.php?title=MDPI&curid=13943035&diff=1232637577&oldid=1231081226

Headbomb {t · c · p · b} 20:24, 4 July 2024 (UTC)Reply

@Headbomb Cheers - bot have been stopped, fix put in locally and I'll test tomorrow before starting it up again. Mdann52 (talk) 20:30, 4 July 2024 (UTC)Reply

If it's fixed, it's fixed. Unleash the bot!

Do you only process a few articles per day, or otherwise limit the bot? Because I find it strange that it only does 10-20 articles a day. I'd much rather the both tagged everything it could asap. Headbomb {t · c · p · b} 20:52, 4 July 2024 (UTC)Reply

So it's rate limited to 5s between API calls (was 20s but has been reduced, but not going below 5s as I don't consider this an "urgent" task that needs to run quicker than that), and 5s (was 60s) between edits. I'm using the search API to scan for the identifiers, which does risk false positives which slow things down somewhat.
I did look at using the dumps instead, but that actually slows the task down significantly so I've not taken that option. There's currently ~49600 retractions I'm scanning for, and I've significantly optimised the code as of late which has further sped things up. I can't think of a better way to capture the data, short of pulling every DOI out the dump and running that way. Mdann52 (talk) 06:00, 5 July 2024 (UTC)Reply
How long do you estimate it would take to process the ~50K or so retractions?
I notice that majority of retractions concern papers with DOI prefix ~10.1400 or under. It might be worth, to process all 10.1517/... original DOI with retractions together, and compare with [1], so you can do it prefix-by-prefix.
I don't know how well that would scale to big prefixes like 10.1002 (85K hits on Wikipedia) or 10.1007 (125K hits) though. Headbomb {t · c · p · b} 07:49, 5 July 2024 (UTC)Reply
I estimate now I've sped the bot up significantly and added a bit more code for efficiency, and on the run speeds this morning, around 3.5 days if I was just scanning for all references, and hopefully 10 days to do everything (previously was looking like 30 days, however I've had some issues with the database driver I'm using on toolforge crashing). I did consider fetching by prefix, I don't think this saves time, as it's just then more pages to fetch/scan. My current approach is to take all the DOIs on the page, check against the database and fix any hits, which given a number of pages seem to have multiple retracted DOIs (and different prefixes) I think strikes a good balance. Mdann52 (talk) 08:08, 5 July 2024 (UTC)Reply
Under 10 days for the whole database? That's pretty good! I was worried we'd be looking at months+. Headbomb {t · c · p · b} 08:18, 5 July 2024 (UTC)Reply
I'd also alternate between DOI and PMID, after full runs, this way papers without DOIs but PMIDs get picked up, and vice versa. Headbomb {t · c · p · b} 08:19, 5 July 2024 (UTC)Reply
The bot is already doing a "DOI" OR "PMID" in the searches, and should pick up instances of either (but if it's missed one please shout!) - 10 days is an estimate, and I'm not going to be held to that, but it shouldn't take months now! I prefer starting bots slow and sorting out the niggles. Mdann52 (talk) 08:32, 5 July 2024 (UTC)Reply
Even if it's 12.3 days, that still would be pretty good. The bot could process the entire db twice a month, ish. Headbomb {t · c · p · b} 08:35, 5 July 2024 (UTC)Reply