Billboard charts URL change

Billboard album and single charts changed their URL. There are some complicated cases which we are still figuring out, but the majority are quite simple, see discussion.

  • Previous URL: https://www.billboard.com/music/Madonna/chart-history/TLP
  • New URL: https://www.billboard.com/artist/Madonna/chart-history/TLP

Or more generally

  • Previous URL: https://www.billboard.com/music/*/chart-history/*
  • New URL: https://www.billboard.com/artist/*/chart-history/*

The search times out after 2600 of them, so I am not sure how many there are in total. --Muhandes (talk) 14:21, 17 November 2021 (UTC)

Muhandes , thank you for the information. I can work on this. More migration maps like above will help. Billboard is a large + complex site: the URLs exist in over 80,000 unique articles, and the Billboard URLs here (http) and here (https) are of wide variety. Some will be soft-404s, for example [1] doesn't go where expected but is a live URL. Some will return 404. Some can be migrated, as listed above. Some can't be migrated. We may not solve Billboard all at once but any problems that can be identified will help move it forward. -- GreenC 16:46, 17 November 2021 (UTC)
Sure thing, one bite at a time. --Muhandes (talk) 10:10, 18 November 2021 (UTC)
Alright, we'll focus on these, it is a lot: 14 thousand pages. BTW it is not a blind move, it will test to make sure the new proposed URL exists if not add an archive URL. -- 15:29, 18 November 2021 (UTC)
Muhandes, how would prefer to handle cases like https://www.billboard.com/music/Taylor-Swift/chart-history/canadian-albums in Taylor Swift. Options are: 1) leave the existing |archive-url= in place, replace the |url= with the new URL, and toggle |url-status=live 2) Same as 1. except replace |archive-url= with an archive to the new URL 3) Same as 1. except delete the archive URL. Consider chart history data can be subject to content drift, and citations verify based on page content on the date the citation was made. It actually raises questions if these URLs should just be treated as all dead and archive URLs added for everything, based on the |access-date= with an archive URL close to that date. Similar issue with the template version. Billboard charts is not an area I am familiar with and will leave to you and others to decide best option. -- GreenC 19:29, 18 November 2021 (UTC)
@GreenC: As expected, this turns out more complicated. As it turns out, not all the charts were moved. I suggest the following:
  • Case 1: The chart already has an |archive-url=. While in many cases this URL can be improved, I would still suggest we assume this was done by an editor who tested the archive and verified it includes the information. In that case, leave it be.
  • Case 2: There is no |archive-url=. Sadly, we know some charts were abandoned in this move. We can recognize them by the code after /chart-history/.
  • Case 2a: We know that the chart was abandoned. This includes (table moved below to "Orphan codes not migrated") and we will add more as we find them. I suggest the following:
  • If an archive to the original URL exists, add an |archive-url= with date as close as possible to the |access-date=.
  • If an archive does not exist, an archive without the code may exist, so add an |archive-url= to that. For example instead of an archive to https://www.billboard.com/music/Madonna/chart-history/LUX we may have an archive to https://www.billboard.com/music/Madonna/chart-history.
  • Otherwise, add [dead link]
  • Case 2b: We don't know the chart was abandoned, but it does not have a three letter code. For example, canadian-albums. The chart was either moved to a three letter code (in this case, CNA) or abandoned. I suggest we have an editor look at these, so just log them.
  • Case 2c: The chart has a three letter code. This would be the original case, replace /music/ with /artist/.
What do you think? I tried not to overcomplicate it. --Muhandes (talk) 10:02, 19 November 2021 (UTC)
Hey Muhandes, between trying to understand your post in relation to my bot I'm pretty confused. The bot has done 100s of moves before and has a lot of default logic built-in to handle many things. I just upload 6 diffs take a look see what you think: Madonna, Dolly Parton, Miley Cyrus, The Rolling Stones, Aretha Franklin, Neil Young. It checks the new page at the new site is working if not adds an archive URL. It follows redirects and adds the final URL so you see in Neil Young the 3 letter code is lower case. 2a not sure I want to do all that right now with archives to chart-history when there is a 3 letter code this is pretty edge case and complicated. Otherwise if the chart didn't migrate the bot will pick it up since the new URL won't work and it just adds an archive URL or dead link. 2b it just adds an archive URL and if someone later wants to migrate to the new site they can, at least it gets a working URL into the cite for now. These actually appear to be the majority of cases like in The Rolling Stones. -- GreenC 07:02, 23 November 2021 (UTC)
@GreenC: I reviewed the articles you pointed out. If this is too much, skip to the conclusion.
  • Madonna: This is case 2b, but you handled it like case 2c. The URLs are slightly off, but not useless. I suppose we can do with this approach, log the changes, and interested editors can correct them manually.
  • Miley Cyrus: This is a case I was not aware of, a link to a specific song. These are no longer supported, nor needed, since the information was moved to the main table. The resulting archives are, in most cases, useless. The best would be to strip the end and deal with like case 2. Again, it can be argued that the URL was useless to begin with so the bot did no harm, and manual resolution could be done later, through case 2b.
  • Case 3: https://www.billboard.com/music/(artist)/chart-history/(chart)/song/*. Strip the end: https://www.billboard.com/music/(artist)/chart-history/(chart), and treat as case 2.
  • Dolly Parton: I don't know what the original purpose of the URL was and the archived URL is also useless. The only real solution is a manual correction. On the other hand, the URL was useless to begin with, so no harm done. We can keep these changes in the log and interested editors can correct them later.
  • Case 4: https://www.billboard.com/music/(artist)/chart-history/(chart)/(not "song")/*. Archive if possible.
In conclusion, I don't think this is the correct bot for the job. The resulting archives are useless in most cases and there are too many manual corrections needed. Case 3 needs to be translated to case 2, but more importantly, if case 2b (or case 3 -> 2b) is the majority, we need a bot to do that. The only clear cases the bot does correctly are 2a, 2c and 4. I would argue that we can use this bot to do a quick run of case 2a, 2c and 4 and move onward from there. If you need me to reiterate those rules I would be happy to. Your thoughts? --Muhandes (talk) 09:13, 26 November 2021 (UTC)
Ok. Thank you for the manual corrections. It seems apparent most of these can't be done safely by bot, or there is not enough information available. For example with Aretha Franklin how would the bot know to use 3-letter code dsi? Same with Madonna. Or in Miley Cyrus the bot would not be able to determine that new URL automatically. The only case a bot can safely do is 2c (and case 3). That's no problem and can do. Will post some more diffs soon. -- GreenC 19:13, 27 November 2021 (UTC)
Muhandes: Please see Wikipedia:Link rot/cases/Billboard - do you have thoughts on how things might be improved or changed? Table B has a lot of repeating URLs like "hot-100" that seem like candidates for a migration rule. The bot is not adding archive URLs, unless you think some in Table B could be archived. -- GreenC 05:27, 28 November 2021 (UTC)
Oh I see "tlp" = "Billboard 200" and "hsi" = "Hot 100". Duh. That opens new possibilities such that
Will re-process with these conversions. -- GreenC 05:37, 28 November 2021 (UTC)

@GreenC: Looked at the run, there is one instance I think should have been rule 3 -> 2c:

I thought it should be https://www.billboard.com/artist/Alicia-Keys/chart-history/hsi/

There was also another simple rule I can add for Bill Cosby

  • Case 5: https://www.billboard.com/music/*/chart-history/ -> https://www.billboard.com/artist/*/chart-history/

3-letter codes

  • Codes migrated: The long and short code pairs are known, and the short code works at the new site
  • Orphan codes migrated: The short code works. The long code is unknown or never existed
  • Orphan codes not migrated: The code (long or short) does not work at the new site

Break

I can add to it when we find more. --Muhandes (talk) 08:59, 28 November 2021 (UTC)

3LC table very helpful. Wikipedia:Link rot/cases/Billboard is updated with latest results, looks like a clean sweep? There is Case 5 in Bill Cosby in Table A. I added code for case 2a which can be seen in Table C for Kurt Cobain. However it does not "If an archive does not exist, an archive without the code may exist" because it's too far outside the bot's procedure flow, it would probably require a new bot run, after a {{dead link}} is added. -- GreenC 19:11, 28 November 2021 (UTC)
@GreenC: The only item change which I don't understand is Rush (band) ---- https://www.billboard.com/music/rush/chart-history/HSIP ---- https://www.billboard.com/artist/rush/chart-history/hsip/
Since this is not a 3LC and the table does not supply a 3LC for it, I would expect it not to be changed by the bot. By the way, I believe it was a typo so I fixed it. --Muhandes (talk) 13:21, 29 November 2021 (UTC)
The table is only used for replacements (from left column to right column). If a code already exists in the URL, it checks the new URL is live and if not adds an archive. As I saw this Rush example that "worked" I allowed for 3 or 4 letter codes. You may be right it's a typo and is not actually working. I'll go back to 3 letter codes only and to be safe only those in the tale and log any others. -- GreenC 15:43, 29 November 2021 (UTC)

Muhandes, I processed the next 200 articles and more unknown codes appeared, listed at Wikipedia:Link rot/cases/Billboard. -- GreenC 20:47, 29 November 2021 (UTC)

@GreenC: In the future, can you please include the article name like before? This helps the detective work figuring out where each chart went. --Muhandes (talk) 08:30, 30 November 2021 (UTC)
@GreenC: I added a few more to the table. I also added a few more to case 2a, I'm not sure you are using that list. Archiving of billboard is very sketchy so we will need larger sample runs in order to fill in more of the gaps. --Muhandes (talk) 13:03, 30 November 2021 (UTC)
Muhandes Yes confirmed I am using the 2a list thanks. Wikipedia:Link rot/cases/Billboard is updated, this time I kept the duplicates in case that is helpful. The number of new cases should get fewer as it goes along. Will next process 400 articles once these 200 are done (100..200..400..) If the number of problems gets too large will slow down the batch size to keep it manageable. Goal is 14,000 articles in total, seems doable, and will result in a really good map table at the end which might be worth publishing on GitHub for other websites or wikis dealing with this migration. -- GreenC 19:09, 30 November 2021 (UTC)
@GreenC: I added some 3LC for which I don't know the longer code and it may have never existed, but I verified that they work. Looking forward to the 400. --Muhandes (talk) 13:31, 1 December 2021 (UTC)
Muhandes hi you removed BBR and IDT from table C, but did not add them into table A or B. The bot will only process a URL when in one of the three tables. If it comes across an unknown code it gets flagged and I come back here to ask which table to put it in. That is the question for BBR and IDT - which table? Or is there a 4th table for a new sort of action? -- GreenC 20:20, 1 December 2021 (UTC)
I think from your edit comment BBR and IDT are valid codes, but the specific URLs in this case were not migrated thus return 404. Since they are valid codes, I added them to table B. Two step process: 1. see if a code is known in one of the tables otherwise flag it for review. 2. check the new URL is working otherwise add archive URL. -- GreenC 22:35, 1 December 2021 (UTC)
@GreenC: Sorry I wasn't clear, I did it while in a hurry. I removed BBR and IDT simply because I don't know if they are good or not. I suggest for now do nothing about them, we can fix them in a later run. I suspect there are many more where I still don't know. I suggest you run another 400 and we will learn more. --Muhandes (talk) 22:52, 1 December 2021 (UTC)
Alright, now they are in table D "Uncertain". Those URLs will be treated as dead with archives, can return later and reprocess if new info appears. Finishing the 200 tonight, then will start on the 400. -- GreenC 23:17, 1 December 2021 (UTC)
Muhandes, Wikipedia:Link rot/cases/Billboard updated. Surprised how many there are still, must be a long tail distribution. -- GreenC 05:35, 2 December 2021 (UTC)
@GreenC: I am done going over the run and I updated the definition lists, ready for more. --Muhandes (talk) 12:44, 2 December 2021 (UTC)
Reprocessed and it ran clear. Diffs uploaded. -- GreenC 01:27, 3 December 2021 (UTC)
Muhandes, Wikipedia:Link rot/cases/Billboard updated. 800 articles. -- GreenC 01:27, 3 December 2021 (UTC)
@GreenC: GIMMIE MOAR!!! --Muhandes (talk) 11:52, 3 December 2021 (UTC)
Heh. I think next is gimmie shelter. Bad news, one my HDs started failing, can't run without. Working on backup, it's a big drive millions of files, might take a few days. No loss for this project other than time. After will take a day to process the 800. -- GreenC 18:42, 3 December 2021 (UTC)
GreenC, No problem, take your time and ping me when ready. Muhandes (talk) 09:38, 5 December 2021 (UTC)
Muhandes, Wikipedia:Link rot/cases/Billboard updated. 1500 articles. -- GreenC 02:52, 6 December 2021 (UTC)
@GreenC: Done, definitions updated. --Muhandes (talk) 15:28, 6 December 2021 (UTC)
After this batch it will be 3000-6000 then 6000-10000 and then 10000-finish (about 14k total) for technical reasons Elasticsearch is unable to return > 10,000 results so need to process the first 10000 then search for remaining. Unless you want to try 3000-10000 at once. -- GreenC 17:08, 6 December 2021 (UTC)
It's fine this way. --Muhandes (talk) 20:08, 6 December 2021 (UTC)
Muhandes, Wikipedia:Link rot/cases/Billboard updated. 7000 articles. 260 unknowns. -- GreenC 05:03, 7 December 2021 (UTC)
Muhandes, Wikipedia:Link rot/cases/Billboard came in with another 45 from same batch. Some look like wiki typos or possibly problems with my code not sure. -- GreenC 07:08, 8 December 2021 (UTC)
@GreenC: I made an error with the table (rap-song should be rap not rsa), can you please share the logs so I can fix the incorrect replacements? --Muhandes (talk) 09:30, 8 December 2021 (UTC)
OK there were about 50, I just wrote a quick script to fix them should be good. -- GreenC 17:19, 8 December 2021 (UTC)
Also, done goign over those 45 rejects, made another small update to the conversion charts. --Muhandes (talk) 11:01, 8 December 2021 (UTC)
Awesome. Diffs for the 7000 successfully uploaded, that was the bulk. Starting to process another 6800 articles still in search. It won't be that many due to archive URLs but need to run through them to find which can be converted. -- GreenC 02:47, 9 December 2021 (UTC)
Muhandes, Wikipedia:Link rot/cases/Billboard updated. It processed 6,808 articles, found matches in 3,944 and logged 37 errors. This is the last batch. -- GreenC 15:26, 9 December 2021 (UTC)
@GreenC: Most of those 37 were errors which I fixed manually, with few minor additions to the definition tables. Does this mean we are done? --Muhandes (talk) 08:43, 10 December 2021 (UTC)
One didn't get through: Euphoria (Leftover Salmon album)----https://www.billboard.com/music/leftover-salmon/chart-history/NMT/song/187896 - can you fix manually in the wiki ? I'm going to start processing the batch for upload. -- GreenC 17:45, 10 December 2021 (UTC)

Results

  • Articles processed: 16,808
  • Articles edited: 13,259
  • Billboard URLs migrated (table a & b): 31,408
  • Billboard URLs not migrated (table c): 1,966
  • Billboard URLs added archive URL: 2,713
  • Billboard URLs added {{dead link}}: 993

Muhandes, believe it is done. Really great work finding the new codes and cleaning up errors in the wiki text. I'm going to re-tool the bot for other projects now. Should it be needed can be restored again. Cheers! -- GreenC 22:01, 10 December 2021 (UTC)

GreenC, great, thanks for all the effort. Muhandes (talk) 15:39, 11 December 2021 (UTC)

www.cia.gov

I found many broken links to this site throughout Wikipedia (here, for example). Jarble (talk) 01:33, 27 February 2022 (UTC)

www.questia.com

I found several broken links to this site that have not yet been repaired. Jarble (talk) 19:30, 28 February 2022 (UTC)

Academic Journals

I represent an academic journal publisher that has legacy URLs appearing as external links on at least 80,000 Wikipedia pages. Due to a change in both domain name and URL structure, all of these URLs are now deprecated but the citation is still valid and the content is still available. Most URLs result in a 301 or 302 redirect to the correct page. However, redirects are not a permanent solution and may fail in the future.

The affected pages include the most trafficked and most influential pages on Wikipedia including, for example:

What is the best way to inventory and update these external links en masse?

We don't want broken links to be archived. We want the opportunity to fix them before a bot updates the pages. Is there a recommended method for collecting all the current external links from wikipedia before initiating a 'bot to update the citations? — Preceding unsigned comment added by SibeliusHicks (talkcontribs)

If you want any action on this, you need to specify the journal publisher that you represent and give examples of specific legacy urls. The current request is not actionable due to a lack of information. The articles you mention have hundreds of references, so it is not possible to easily determine the publisher that you are talking about. Hemiauchenia (talk) 23:31, 3 March 2022 (UTC)
From their previous edits, they appear to be referring to American Association for the Advancement of Science journals. Hemiauchenia (talk) 01:10, 4 March 2022 (UTC)
Yes, journals of American Association for the Advancement of Science. However, please don't initiate any bot activity until I confirm that all redirects are in place. SibeliusHicks (talk) 13:42, 4 March 2022 (UTC)

My bot can help. Would need a list of domain names eg. science.org etc.. if redirects already exist it's pretty trivial to migrate the URLs. When there are no redirects, we can discuss other methods based on info available about new URL structure. The goal is to migrate to the new URL, not to add archives, where possible. The bot will attempt to migrate to the new URL and change any existing |url-status=dead to live, or remove any existing {{dead link}}. If it can't find a migration destination it will add an archive URL and log. If it can't find an archive URL it will add a {{dead link}} and log. It will do these things for citations in template format like cite journal, or when square and bare links. -- GreenC 01:29, 4 March 2022 (UTC)

I would be interested in the bot. First I need to inventory and test the currently published citations so I can assure the highest level of success for the bot. Short of a complete, brute force crawl of Wikipedia, is there a recommended method for finding the existing citation URLs? SibeliusHicks (talk) 13:42, 4 March 2022 (UTC)
I've done this kind of work for many sites larger than this I am confident it can go well, if you can provide guidance on URL mappings old to new. And the list of domains. Typically the biggest problem with this kind of work are soft-404s, meaning a page that returns status 200 but is not the intended page, such as a working redirect to a home page or some other info page. To answer your question, URLs for a given domain eg. science.org see https and http (need twice once for http and https). -- GreenC 14:26, 4 March 2022 (UTC)

Fixing internal "linkrot"

Can we discuss, and possibly develop a "bot tasking" for "real time repair" of internal linkrot caused when page sections are moved by "cut and paste"? Currently, when relisting XFD threads and archiving page sections, the incoming Page#Section links are broken and the backlinks are not subsequently repaired (being left inoperable). In my opinion, this is untenable when corrective measures are within technical capabilities of relative ease. I propose that we undertake correcting the current situation and invite further discussion here (with keen anticipation). Thank you.--John Cline (talk) 16:41, 5 March 2022 (UTC)

User:John Cline: this is not the best forum because it deals with URLs ie. external links. Recommend WP:BOTREQ. -- GreenC 21:10, 10 March 2022 (UTC)
Thank you for your reply; I'll take up my request there. Cheers.--John Cline (talk) 20:38, 12 March 2022 (UTC)

therailwaycentre.com hijacked with malware

This website has been hijacked and is now pushing out malware to visitors. It is cited over 250 times in railway articles. Can it be marked as a dead usurped link as a matter of urgency? (Also requested to add to the mediawiki blacklist) 10mmsocket (talk) 16:38, 26 March 2022 (UTC)

User:10mmsocket. It's done. Edited about 266 pages including File space, converting to usurped in various ways depending on situation. Recent diffs at Special:Contributions/GreenC_bot. -- GreenC 22:39, 26 March 2022 (UTC)

C-Span Template IDs - String -> Numeric conversion to fix root of linkrot

In short the C-Span Template using a numeric ID is more reliable as string changes are not reliably redirected see also the Wikidata P2190 Discussion. Querying the C-Span URL generated by the template will return a response which includes the new ID for resolving links. For the items in Wikidata, this has been done for those string IDs added prior to 26 Feb 2022 (those added after that date haven't been checked and those before with no numeric are broken). The ID values are available to swap and fix the root cause of link rot, but there are 5 or 6 thousand more uses of C-Span templates on EN Wikipedia that may not have been imported to Wikidata and are at risk of breaking or are broken. If a bot could 1. replace existing strings with the numeric matched on P2190 this would be the first step in fixing the issue. 2. Have the bot query and parse the response URL to get the new ID for numeric not in Wikidata. For the Wikidata batch processing I did about 10% needed manual resolving due to the URL changing or there being no clear match after manual search by name variants. Generating a list to review and the associated page/wikidata Q for these would be useful as well to review any manual clean up of broken links. Anyone have a bot that can do this or will take on this fix? Wolfgang8741 says: If not you, then who? (talk) 16:24, 14 March 2022 (UTC)

Tracking categories from template now available to aid in cleanup - mainspace focused since these also include templates and user space uses.
Looks like Roccerbot is taking care of this. -- GreenC 21:18, 27 March 2022 (UTC)