Wikipedia:Bots/Requests for approval/PrimeBOT 25
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Primefac (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 22:22, Thursday, February 1, 2018 (UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): AWB
Source code available: WP:AWB
Function overview: Fix broken URLs
Links to relevant discussions (where appropriate): BOTREQ request
Edit period(s): one time run
Estimated number of pages affected: ~1500
Namespace(s): All
Exclusion compliant (Yes/No): Yes
Function details: www.cwgc.org has changed their URL parameters, leaving a lot of pages with broken links. Simple find/replace:
(?<!/)http://www.cwgc.org/search/casualty_details.aspx?Casualty=([0-9]+)
→http://www.cwgc.org/find-war-dead/casualty/$1
(?<!/)http://www.cwgc.org/search/cemetery_details.aspx?cemetery=([0-9]+)&mode=1
→http://www.cwgc.org/find-a-cemetery/cemetery/$1
Discussion
edit- Consider archive URLs:
- etc.. a full list of archive types and URLs WP:WEBARCHIVES. -- GreenC 03:13, 2 February 2018 (UTC)[reply]
- Good point. I've amended my code to include a lookbehind. Primefac (talk) 03:22, 2 February 2018 (UTC)[reply]
- Would it overlap with template instances like
|url=http://www.cwgc.org/find-war-dead/casualty/9898
? -- GreenC 04:10, 2 February 2018 (UTC)[reply]- There are 5 instances out of over 1000 where the archive URL includes "?url=", so assuming that I remove those five from the list there shouldn't be an issue. I actually thought it would be the other way around... Primefac (talk) 04:15, 2 February 2018 (UTC)[reply]
- Yeah those are WebCite, not too common. The domain was whitelisted by IABot at some point, so it hasn't been auto-archived (in the wiki anyway) which turned out to be a good thing. The 1000 with
|url=
might also have|deadurl=yes
and ideally it would be set back to|deadurl=no
but understandably that would be a more complex bot and not crucial. IABot might be able to detect and make the change not sure. @Cyberpower678: -- GreenC 04:55, 2 February 2018 (UTC)[reply]- IABot can, but won't.—CYBERPOWER (Chat) 22:52, 2 February 2018 (UTC)[reply]
- Yeah those are WebCite, not too common. The domain was whitelisted by IABot at some point, so it hasn't been auto-archived (in the wiki anyway) which turned out to be a good thing. The 1000 with
- There are 5 instances out of over 1000 where the archive URL includes "?url=", so assuming that I remove those five from the list there shouldn't be an issue. I actually thought it would be the other way around... Primefac (talk) 04:15, 2 February 2018 (UTC)[reply]
- Would it overlap with template instances like
- Good point. I've amended my code to include a lookbehind. Primefac (talk) 03:22, 2 February 2018 (UTC)[reply]
- The original request that I made didn't consider web archive URLs. If those instances need fixing manually before any bot run, I am happy to do that. Some URLs are just wrong (for various reasons), and will need manual checking to correct the ID numbers used. If it is possible to output a list of URLs that still return 404 errors, even after this correction is applied, that would help immensely. It is important to limit this to the casualty and cemetery URLs only, not the other URLs from the CWGC site (many of which are also broken). Good examples from the web archive links at Cemeteries and crematoria in Brighton and Hove of the change in the appearance of the CWGC pages from 2013 to the present: 1 vs 2 and 3 vs 4. For the appearance in 2011, see the archived link at Percy Charles Pickard: 5 Carcharoth (talk) 12:35, 2 February 2018 (UTC)[reply]
- Couple more points. For some reason I don't understand, the bottom 44 or so links here are lacking any ID number at all. Those will need to be fixed manually. I also checked the 'https' links as well. Respective numbers are 163, 0, 51, 0. i.e. none of the 'aspx' links are https, and only a few of the correct ones are https. Carcharoth (talk) 21:02, 2 February 2018 (UTC)[reply]
- If they don't have a numerical value in the URL, then they won't be picked up. The regex is only looking for digits. Primefac (talk) 15:08, 3 February 2018 (UTC)[reply]
- I have have checked the list of incorrect URLs and those all have a numerical value in the URL. The ones with the numerical values missing are in the 'correct' form - I will fix them manually. Carcharoth (talk) 14:49, 4 February 2018 (UTC)[reply]
- If they don't have a numerical value in the URL, then they won't be picked up. The regex is only looking for digits. Primefac (talk) 15:08, 3 February 2018 (UTC)[reply]
- Couple more points. For some reason I don't understand, the bottom 44 or so links here are lacking any ID number at all. Those will need to be fixed manually. I also checked the 'https' links as well. Respective numbers are 163, 0, 51, 0. i.e. none of the 'aspx' links are https, and only a few of the correct ones are https. Carcharoth (talk) 21:02, 2 February 2018 (UTC)[reply]
- Gen fixes? Yes or no?—CYBERPOWER (Chat) 22:55, 2 February 2018 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete.—CYBERPOWER (Chat) 16:59, 3 February 2018 (UTC)[reply]
- Trial complete. - Edits. Note that I did 25 of each URL. Primefac (talk) 18:06, 3 February 2018 (UTC)[reply]
- I've looked through most of those and they all look fine to me in terms of fixing the URLs (of course, other fixes may be needed, beyond the scope of the bot). Carcharoth (talk) 14:49, 4 February 2018 (UTC)[reply]
- Trial complete. - Edits. Note that I did 25 of each URL. Primefac (talk) 18:06, 3 February 2018 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.