Wikipedia:Bots/Requests for approval/WildBot 8
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Request Expired.
Operator: Josh Parris (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 23:47, Friday December 6, 2013 (UTC)
Automatic, Supervised, or Manual: Automatic
Programming language(s): Python, wikitools
Source code available: Sure, prompt me
Function overview: Replace broken urls to *.thecanadianencyclopedia.com with working ones to thecanadianencyclopedia.ca
Links to relevant discussions (where appropriate): Wikipedia:Bot requests/Archive 57#URL updates for The Canadian Encyclopedia
Edit period(s): one run
Estimated number of pages affected: ~3468
Exclusion compliant (Yes/No): No, one-time run
Already has a bot flag (Yes/No): Yes
Function details: swap broken urls for tested good ones. I have assembled a mapping of certain URL updates for The Canadian Encyclopedia based on lookups into the Wayback Machine of all external URLs that match *.thecanadianencyclopedia.com and used that to generate and test combinations of URLs against thecanadianencyclopedia.ca until I got a 200-sucess. Links to the home page of the site will be stripped. URLs where I couldn't get a successful hit will be left unchanged. Variations on the {{dead link}} templates are added or removed to the article to reflect the status of external links; they're only removed for thecanadianencyclopedia. The work parameter of the various cite templates is altered to change hyperlinks into domains.
Discussion
editApproved for trial (10 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — HELLKNOWZ ▎TALK 22:30, 11 December 2013 (UTC)[reply]
- Trial complete.
- 10 Trial edits in I made a stupid error with my HTML comment not being an actual comment. I fixed all the edits.
- More interesting errors:
- [1] "corrected" to a 404. The substitution did what it was told. The translation list was mis-populated because of a parsing error on http://web.archive.org/web/20110929060526/http://www.thecanadianencyclopedia.com/index.cfm?PgNm=TCE&Params=U1ARTU0000203 where the "writing" section was determined to be the article title. An inspection of the translation list shows this has not occurred on any other occasion, nor has "author" nor "bibliography" been the target of any translations. It was just good luck catching this one. Fixed on a couple of levels - tighter regex matching, plus those headings have been added to the blacklist.
- [2] shows a replacement within a ref tag where the link is followed by a
{{dead link|date=December 2013}}
. Ought I be removing these {{dead link}}s? Josh Parris 02:44, 12 December 2013 (UTC)[reply]
- Do I get this right -- you are comparing the actual page content on wayback archived version to find matches? Or title?
- Yes, you should remove {{dead link}}s after the citation or reference tag if you fix them. — HELLKNOWZ ▎TALK 13:16, 13 December 2013 (UTC)[reply]
- The technique I'm using for translating from the old URL to the new one is:
- Check for a 302 redirect sometime in 2012. The redirect will be to a URL similar to what's used now, with quite a few variations - a trailing slash may or may not be required, the order of words may have changed, parts of the path might have been moved around.
- Failing that, the Wayback Machine's copy will have an article title, which might be transformable in various ways into the corresponding URL in the new website
- All I'm doing is checking for a 200 status code to confirm a match - do you think I ought to be doing something less naive?
- I'll get onto removing deadlink tags; it might be easy, or perhaps not. Josh Parris 11:48, 14 December 2013 (UTC)[reply]
- The technique I'm using for translating from the old URL to the new one is:
- I think that's good enough -- I don't think there would be obvious false positives, especially if you use their own 302s. — HELLKNOWZ ▎TALK 12:03, 14 December 2013 (UTC)[reply]
Approved for extended trial (10 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — HELLKNOWZ ▎TALK 12:03, 14 December 2013 (UTC)[reply]
- Trial complete. Wow, that expanded the source code dramatically. I selected ten articles that had {{dead link}} and canadianencyclopedia urls. Performed 10 trial edits, highlights include:
- So, it seems all went well. Josh Parris 06:55, 17 December 2013 (UTC)[reply]
- Okay, these look good, but that's quite a range of functionality. — HELLKNOWZ ▎TALK 13:47, 18 December 2013 (UTC)[reply]
Approved for extended trial (20 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. More trial since the addition of code and just a larger sample. — HELLKNOWZ ▎TALK 13:47, 18 December 2013 (UTC)[reply]
- Trial complete. with results here. Points of note:
- [6] has the bot swapping out a dead url for text, which would be fine except this is a url= field. I've removed this functionality from the bot and will leave it to humans to clean up these urls. But [7] shows I removed it wrong; I should have detected those URLs and done nothing, rather than treating them as any other URL. Fixed.
- [8] has the bot making supplemental fixes but not the main fix of swapping dead urls. This was due to a logic bug in the code to detect null edits - fixed.
- I stand ready for another trial. Josh Parris 00:59, 19 December 2013 (UTC)[reply]
Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. without removing external links from |work=
. — HELLKNOWZ ▎TALK 10:28, 19 December 2013 (UTC)[reply]
- Functionality altered to reflect this. Josh Parris 10:40, 19 December 2013 (UTC)[reply]
- Trial complete. after 50 edits. Every edit seems fine.
- I did get a scare from [9], but looking at http://web.archive.org/web/20120315000000*/http://www.thecanadianencyclopedia.com/index.cfm?PgNm=TCE&Params=U1ARTU0002865 I'm reassured that the bot isn't at fault. Josh Parris 11:27, 19 December 2013 (UTC)[reply]
- [10] What's up with first dead link here? Or here], although here it is dead. This happens in many pages, are you now checking unrelated links for 404s (I must have misread this from the ever-changing function details)? — HELLKNOWZ ▎TALK 17:17, 21 December 2013 (UTC)[reply]
- Sorry for the delay in responding; my Internet is temperamental right now.
- The problem with that first link is that it was to an HTML anchor, and anchors weren't being stripped (now fixed). No other links were incorrectly marked as dead.
- As an aside, do you have any insight as to why cURL and my browser agree that a 404 is returned for http://www.lethbridge.ca/NR/rdonlyres/D4CEB98B-9F18-4786-870D-84A06E1533FC/310/LethbridgeProfile2003.pdf yet Python's httplib thinks it's a 200?
- Yes, I figured checking all URLs was a value-add during swapping the Canadian URLs, given I had to check the deadurls weren't dead - you know, "free functionality". Josh Parris 04:42, 26 December 2013 (UTC)[reply]
- It says that because "checking 404 is not an easy task". None of our 404-checking and archiving bots (iuncluding mine) are running simply due to all the continuous issues that they have and all the ingenious ways web developers break them. That's why I didn't think you were also checking all dead links. That will extend this BRFA and trials a lot and I really recommend this be a separate task (or I'll go mad). Not to mention, you cannot check a link once, you need to come back in a week or so and check it again or there will be tons of false positives on temporarily 404ed sites. In your case, there was possibly a redirect (one of several ways to do it) or a different page version served as there might be an agent or referrer check or cookie requirement or many of the subtle HTTP protocol options. It could just be some broken headers or inconsistencies between curl and your typical browsers. Not setting up a cookie container has led to many sites failing on me. — HELLKNOWZ ▎TALK 20:02, 26 December 2013 (UTC)[reply]
- Gotcha. Pulled 404 checking for anything other than the target URLs. Josh Parris 03:45, 27 December 2013 (UTC)[reply]
Approved for extended trial (20 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — HELLKNOWZ ▎TALK 10:48, 27 December 2013 (UTC)[reply]
- Sorry for the delay; flaky Internet. Trial complete. After 20 edits, every edit seems fine. Josh Parris 09:17, 28 December 2013 (UTC)[reply]
- I've got solid Internet under my feet now, so {{BAGAssistanceNeeded}} Josh Parris 20:39, 5 January 2014 (UTC)[reply]
- [11] -- [12] doesn't look like the right one
- [13] -- same [14]
- [15] -- same [16] (macleans again)
- [17] -- broken link
Note that I haven't (yet) gone through previous trials link by link. — HELLKNOWZ ▎TALK 21:59, 5 January 2014 (UTC)[reply]
- If only there was an exasperated sigh template I could invoke here.
- The Ben Johnson (and Kurdish protest) edits show the "check for a 200 status" rule isn't adequate. I'll work up something more robust in the face of this.
- The Eva Rose York edit is actually fine.
- The The Queensway – Humber Bay edit is particularly galling, as running the list generator against the page today pulls up the 404 and can't resolve it, but going to the URL in the article redirects to a valid article. The site operator has not only 404'd their old URL, they've made the older one work by redirecting it to their new one. I'm going to have to throw away my old translation list and regenerate it.
- I'll ping back once I've made the necessary code changes. Expect a two week delay. Josh Parris 21:40, 6 January 2014 (UTC)[reply]
- That fix was easier than I thought.
- It seems something similar to the mcleans thing happened with French articles, so I already had code to simply strip it out.
- I've coded up a fix to the 404.
- I'm going to review all the edits the bot made since the start of time and confirm they correlate to what the bot would now do, and repair anything that's wrong. Josh Parris 14:49, 8 January 2014 (UTC)[reply]
A user has requested the attention of the operator. Once the operator has seen this message and replied, please deactivate this tag. (user notified) Anything new about this taske? 46.107.88.236 (talk) 16:45, 24 January 2014 (UTC)[reply]
- @Josh Parris: Any progress? (t) Josve05a (c) 13:12, 2 April 2014 (UTC)[reply]
Request Expired. --slakr\ talk / 07:03, 12 April 2014 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.