Wikipedia:Link rot/URL change requests/Archives/2023/June

Latest comment: 1 year ago by Justanothersgwikieditor in topic NewspaperSG url change for articles.


www.thecanadianencyclopedia.com → www.thecanadianencyclopedia.ca

Vipz (talk) 22:14, 25 May 2023 (UTC)

Looks like pages need to be checked for the string "Oops! Page not found" and if so treat as a dead link. They return status 200 ie. soft-404s -- GreenC 03:24, 26 May 2023 (UTC)

Vipz, this is done, edited about 1,850 pages. -- GreenC 19:18, 9 June 2023 (UTC)

The site is long dead, but many links have yet to be given archives. Kailash29792 (talk) 16:22, 7 June 2023 (UTC)

https://www.indolink.com suggests it has been usurped in addition. I don't know if that's only for the home page, or other sub-links. I'll treat it as usurped to be safe. -- GreenC 04:18, 8 June 2023 (UTC)

Done. -- GreenC 16:13, 9 June 2023 (UTC)

WebCite

webcite archive links no longer work, please fix them, thanks Notrealname1234 (talk) (my dothings) Notrealname1234 (talk) 13:24, 28 May 2023 (UTC)

I moved whatever can be moved to other archives about a year ago (not 100%, but close). Whatever is left doesn't have good options. I offered the community to delete the remaining links (Wikipedia:Village_pump_(policy)#RfC:_Delete_all_webcitation.org_links), but they don't trust me or something. I don't consider this my problem anymore, the community has spoken they don't want me involved. Fine with me I have lots of other work to do. Good luck. -- GreenC 16:03, 28 May 2023 (UTC)
What about the other deprecated web archives here? Notrealname1234 (talk) 00:24, 29 May 2023 (UTC)
I think they are all migrated except www.bac-lac.gc.ca which has about 5,500 links. -- GreenC 17:08, 29 May 2023 (UTC)
so, atleast can you try changing the links? Notrealname1234 (talk) 23:12, 8 June 2023 (UTC)
At least. -- GreenC 01:51, 9 June 2023 (UTC)
done. -- GreenC 15:59, 10 June 2023 (UTC)

Imageshack

imageshack images redirects to a login page and no longer works (img19.imageshack.us for example) the "img" at the start of the url changes. Notrealname1234 (talk) 17:31, 17 June 2023 (UTC)

The main site is still live https://imageshack.us but after processing the URLs on Enwiki, the result is every single one is either dead or a soft404. They're now archived or marked {{dead link}}. -- GreenC 20:04, 18 June 2023 (UTC)

publicartarchive.org/work/

There are about 120 articles with links to "publicartarchive.org" and top directory level of "work", none of which actually work. Also, there do not appear to be working wayback archives.

Finding a working url requires doing a search from the website's home page, searching on the name of the artwork or the artist or the location. It doesn't look feasible to automate this. I have done a handful manually. In most cases, finding the replacement url is not too difficult, but actually updating the links manually seems pretty tedious. Fabrickator (talk) 02:44, 13 June 2023 (UTC)

Here are a couple of examples:
change http://www.publicartarchive.org/work/stepped-tower
to https://publicartarchive.org/art/Stepped-Tower/15e37eb5
change http://www.publicartarchive.org/work/theater-wind
to https://publicartarchive.org/art/Theater-Wind/16509da4
(it appears that the level below "art" is descriptive only and doesn't actually affect the resolution of the url)
Fabrickator (talk) 16:48, 14 June 2023 (UTC)
That's something I might try automating. It will test the proposed new URL is working before making the change. -- GreenC 18:13, 14 June 2023 (UTC)
I'm assuming I will need to do a manual search to determine the key (e.g. 16509da4) unless you have a better idea. Fabrickator (talk) 18:50, 14 June 2023 (UTC)
It's odd, https://publicartarchive.org/art/Stepped-Tower works and https://publicartarchive.org/art/Theater-Wind doesn't work (it returns a status code 200 ie. a soft-404). Neither have redirects in the header so it's some kind of Javascript which might trip up the bot, I'll have to see what happens. I could at least try a dry run only log the results, to see how many it could migrate, it's not many pages. -- GreenC 19:18, 14 June 2023 (UTC)
I just realized these seem to have working wayback copies... so I presume it would be simpler just to add the archive link to any of the urls which have the "/work/" directory. Fabrickator (talk) 01:58, 15 June 2023 (UTC)
90 links have archives available. 41 do not. The rest are already archived. Do you want to add archives for the 90, and {{dead link}} for the 41? -- GreenC 03:57, 15 June 2023 (UTC)
Is there a tool I should be learning to use to reduce the effort required to do this? Fabrickator (talk) 06:00, 15 June 2023 (UTC)
This is a very particular case due to how the remote site is configured. Really to do it correctly the 131 URLs needs to be checked manually and the list of working URLs loaded into my bot which then makes the move and archives/deadlinks the remaining. -- GreenC 15:32, 15 June 2023 (UTC)
I manually checked the URLs and anything not working at the new URL will have an archive URL added, and any of those without an archive URL available it will add a {{dead link}}. There are 18 of them with a dead link. I went ahead and uploaded the results. If you see any problems let me know. -- GreenC 02:39, 16 June 2023 (UTC)
I'm going through this list manually and I'm finding a fair number of significant issues, but it's quite tedious and due to my personal situation, it's probably going to be a couple of days before I am finished, so for the moment, please don't archive this topic. Thanks. Fabrickator (talk) 15:58, 16 June 2023 (UTC)
I already ran the bot: it added 32 archive URLs, 18 {{dead link}} and moved 83 URLs to the /art/ format. Unless you meant don't archive this discussion. -- GreenC 18:05, 16 June 2023 (UTC)
My original intention was just to track the changes that would still be needed (i.e. to get working links, but I have now surmised what's going on with this website, and why sometimes you can specify just the descriptive label of an object and other times you must specify the object's unique key. (Maybe this is obvious to everybody else, but there's no a priori way to know that's how this particular website works.) I would like to include these details so that others are likely to find it, assuming they think to search the archives. This will also inform us as to what we can do to minimize the likelihood of these problems recurring. So you tell me if you agree this is the best place to document these details. Fabrickator (talk) 17:04, 17 June 2023 (UTC)
You can document it here, there is also a special page Wikipedia:Link_rot/cases where certain domains are documented. It's the rule for most websites, when moved to a new URL format, some of the old URLs don't migrate correctly it's rare to see a clean migration. The admins loose track of what they have or make mistakes or intentionally drop some links. -- GreenC 17:39, 17 June 2023 (UTC)

I have the explanation for the link for "Theater-Wind". They have evidently adopted a convention in which the "label" is generally derived from the title that's shown when you view it. In this case, that is "Theatre of the Wind". The label (which is case-sensitive) will be the title with spaces changed to hyphens, and non-alphabetics deleted. So this becomes "Theater-of-the-Wind". I'm suggesting that the preferred form is to include both the label and the key. Having the key protects against a change in the label, but having the label is a convenience because it lets you know what is actually expected and would help you to find the updated key if there is a new version. Keep in mind that it's possible they would get an updated version with a new unique key while maintaining the same label, but this is really theoretical on my part. Mostly I think this is a "don't care" situation, whether they serve up the old content or the new content, as long as both are available. Fabrickator (talk) 10:19, 18 June 2023 (UTC)

There is no easy way to obtain the key because it's generated by a Javascript redirect and bots are unable to see the redirect. Even the WaybackMachine is unable to follow the redirect it only saves a blank page. Archive.today is successful saving it, in theory I could save it at archive.today then scrape the snapshot for the redirected URL, but that's a lot of work to program for a few dozens links. -- GreenC 15:40, 18 June 2023 (UTC)
As great as it is to be able to use bots to do the work, we all realize this has limitations, but at the same time, I want to capture as much of the technical details as I can (though I realize this is one of numerous domains, each with its own set of unique issues). There are also alternate use cases, e.g. instead of specifying a single object, the link needs to select a set of objects meeting specified criteria. The progress I'm making is slow in numerical terms, but it's a learning experience with each distinct set of use cases. Fabrickator (talk) 07:42, 19 June 2023 (UTC)
Nah it's always better to do it that way, manually. Bots are a sort of emergency solution that can't get all the details a human can. I wish there was no need for bots. This case has 120 URLs so it's possible, but what if there were 4,000 like below which is not even a large case. There are 100s of thousands of domains, 100s of millions of URLs. Even with bots we can't keep up with the constant changes external sites make. I'm glad you are taking the time to do it right. -- GreenC 14:15, 19 June 2023 (UTC)
At first I was okay with the links to Wayback archive versions, but I realized that this sometimes meant that one or more of the images might not be available, and the images are the main point of linking to this website. Whenever I modified a link, I made sure to include both the label and the unique identifier, but I left alone any pages that had a working live url (e.g. even though the url contained only the label rather than the unique identifier). A small handful of links involved some sort of challenge to determine the label, in most of these cases, I believe that label was simply not unique. There's no actual requirement for a unique label, it just makes it possible to have a link without including the unique identifier. There were also a few cases where the selection returned multiple objects. Altogether I modified links in somewhere around 65 articles, plus another dozen or so articles in non-English wikis (Chinese, French, Italian, Slovenian, Spanish, Swedish, Tagalog) and though this is obviously not a very scalable approach, there is some insight obtained about how some websites work as well as the satisfaction of providing the most useful result. Fabrickator (talk) 05:10, 22 June 2023 (UTC)

NewspaperSG url change for articles.

The newspaper archive of Singapore, NewspaperSG has moved its newspaper archive url from eresources.nlb.gov.sg/newspapers/Digitised/Article/straitstimes19331015.2.83 to eresources.nlb.gov.sg/newspapers/digitised/articles/straitstimes19331015.2.83 . You can use the first three references from Lim Yong Liang as an example. While the case change does not affect the url, the introduction of a s break the url. This should be a quick and easy change. Appreciate if anyone can assist with this. Thanks. Justanothersgwikieditor (talk) 04:34, 24 June 2023 (UTC)

User:Justanothersgwikieditor: I started working on this last night and ran into some complications with the remote site blocking bots; and other things like for example "/Digitised/Article.aspx?articleid=straitstimes19730106-1.2.98" needs to be converted to "/digitised/articles/straitstimes19730106-1.2.98" .. that was from a test of the first 10 pages so there are probably other issues to be discovered. Won't have much time today but I continue to work on this. -- GreenC 14:35, 24 June 2023 (UTC)
@GreenC, thanks for working on this. As far as speed os concerned on this, take your time. This issue was noted the last time NewspaperSG changed their format. I done a fix using AWB previously but my fix did not cover all cases it seems (i still find articles in this old format from time to time as you did now). Thank you. Justanothersgwikieditor (talk) 15:23, 24 June 2023 (UTC)
There's "/Digitised/Page.aspx?pageid=" and "/Digitised/Page/" which converts to "/digitised/pages/" .. there are also a few search results but I can't find a working URL for them: [1] (bottom of page). I found the bot block problem (User-agent string). URLs with a trailing ".aspx" it needs to be removed. Pages like this should be treated as dead links. Some URLs don't convert ie. the remote site returns a dead link at the new URL (eg. [2]), thus archive URLs should be added where available. Sometimes, the old format URL works [3] and the new format does not [4]. -- GreenC 16:36, 24 June 2023 (UTC)
@GreenC, for /Search, this are some of the working comparisons:
https://eresources.nlb.gov.sg/newspapers/search?q=%22barbara+yu+ling%22+albery
For https://eresources.nlb.gov.sg/newspapers/digitised/articles/straitstimes20060702-1.2.5.2, this is available within the national library's intranet due to copyright agreement between the library and the publisher. Rather than dead link, we should class them with the 'url-access' parameter set with 'subscription' value, since the 'paywall' is the need to travel down to the library to access the articles via paid by time use terminals. – robertsky (talk) 23:24, 24 June 2023 (UTC)
'subscription' could work for CS1|2 templates but not anything else (eg. bare and square URLs). It's also probable there are URLs that were previously converted by AWB that didn't update the url-status. So on the first pass it will convert /Digitised/Article/ only. Then will need to check all URLs with /digitised/articles/, and if in CS1|2 convert url-status to subscription. -- GreenC 08:21, 25 June 2023 (UTC)
Justanothersgwikieditor: It looks like the remote site has changed, as of this morning. The old format now works: https://eresources.nlb.gov.sg/newspapers/Digitised/Article/straitstimes19331015.2.83 and the new format no longer works: https://eresources.nlb.gov.sg/newspapers/digitised/articles/straitstimes19331015.2.83 .. the site admins are actively changing things. At some point I may need to check all new form URLs and covert them back to the old form. -- GreenC 16:38, 26 June 2023 (UTC)
@GreenC Sheesh. I am sorry for wasting your time. It seems like they did an update to the new display format and then a switcheroo with the urls in live mode, aka reverting to the old url style
There was no new maintenance notice as of yesterday (SGT time).
Thank you for helping to fix the old-old style. Justanothersgwikieditor (talk) 23:56, 26 June 2023 (UTC)
I think we can consider this request closed and if NewspaperSG decides to change again, I will put in a new request. Thanks so much! Justanothersgwikieditor (talk) 01:14, 27 June 2023 (UTC)