Wikipedia:Bots/Requests for approval/WaybackBot
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Withdrawn by operator.
Operator: Tim1357
Automatic or Manually assisted: The bot would run extremely supervised until it had enough "experience" to run by itself.
Programming language(s): Python Using pywikipedia
Source code available: here is a link to the code (it updates automatically every time i change it). It needs work, i keep getting little format errors that I need some programers to help me with.
Function overview:
WaybackBot would (intelligently) check the Internet Archive for archives of dead pages.
Links to relevant discussions (where appropriate): There are a lot. Some are
Edit period(s):
At first, I will baby sit the bot and check every edit it makes, until I can feel confident enough to let it run free.
Estimated number of pages affected: An estimated 10% of all links on wikipedia are, in some way, dead. If there are 2.5 million links on wikipedia (there were in 2006), then that means 250000 are dead. Thats a 'lot of pages.
Exclusion compliant (?):Im not sure, is pywikipedia automatically exclusion compliant?
Already has a bot flag (Y/N):
Function details: The bot's syntax looks like this:
- Load a page (from xml dump)
- Extract and check all the external links
- check them all, return dead (defined as error code 404 or 401)
- if they are dead, look for their corresponding accessdate, if none exists, use wikiblame
- create range of acceptable dates (for right now, the range of an acceptable archive is within 2 months of the original accessdatye, I am willing to change that. Remember that a larger range means that an archive is more likely.)
- if the url is referenced using {{citeweb}}, and does not already have an archive, add archive-url and archive-date.
- if there is not cite-web, append reference with {{wayback}} using parameters |date and |url
- if there is no internet archive, mark the reference with {{Dead link}} using parameters: |date and |bot
- start over, and cache links that were checked as either dead or alive, so I don't have to check them again. i will add a function to the script to clear the cache.
Whew, i think thats it. if you want a more nitty-gritty explanation of what the bot does, look at the source code. Pretty much each line has a comment. Note that the source is being hosted from my home computer, so It might not be up when the computer is off.
Discussion
editSome Stuff You Should Know:
- see this skrew up i made (still very sorry).
- I am pretty new to python, this was my first big project, so i need some help
- the Internet archive does not show archives until 6 months after they are grabbed (right now they are still processing archives from June), so that means if I request an archive for a page that was accessed today, the bot will not get any archives.
- I support a larger archive range, but I will leave it up to consensus here.
Things to do
add logging that is similar to the logs of User:WebCiteBOTY Still need to write the code that uploads the log.- make bot exclusion compliant
- auto-clear cached links
- add synonyms for templates (citeweb=CIteweb=cite-web) ect.
Id like to put this on hold for a while. User:Dispenser gave me some points about the bot's concept that I hadn't thought about. I am going to tweak the code so I can make it more fail-safe, and so that the bot gives a dead link two tries before it finds the archive (as some links are only dead for a bit, then are live again). Thanks Tim1357 (talk) 02:05, 10 December 2009 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.