Wikipedia:Bots/Requests for approval/DFRBot
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Withdrawn by operator.
Operator: DFRussia
Automatic or Manually Assisted:
- Manually Assited (it makes no changes, just data mines and gives results back to user)
Programming Language(s):
- Python + PyWikipedia
Function Summary:
- It checks all the article names in one or more files to see if they exist on wikipedia, and then tells the operator which links exist and which don't.
Edit period(s) (e.g. Continuous, daily, one time run):
- The algorithm is run whenever I am needed to see if any of a list of people are notable (during standard editing activities). So as a good estimate: daily to weekly
Edit rate requested: X edits per TIME
- This bot makes no edits
Already has a bot flag (Y/N):
Function Details:
- opens one or more files on the local computer as specified by user
- checks every line to see if it is the title of a wikipedia article
- if it is not a wikipedia article, reports immediatly that it is not
- if it is, then adds to a list of articles and outputs at the end of processing a file
- makes no edits to wikipedia, only checks if pages exist. It is the operators responsibility to then check the links given back to him, to see if they are what he was looking for
Discussion
editThis is the first algorithm I have for this bot. I plan to make more as time goes on. I will seek approval for each algorithm before running it. For me, this is mostly an excersise in programming, a contribution to Open Source and a way to simplify editing of wikipedia for myself and others --DFRussia 20:40, 1 November 2007 (UTC)[reply]
- Have you considered using database dumps instead? It would use a lot less bandwidth and be considerably less harsh on the servers for data mining operations. However, some applications are unfeasible for that; so, if you do not intend to use the database dumps, could you please elaborate more on why? Also, have you considered using the API if the database dumps are not feasible? --slakr\ talk / 00:10, 2 November 2007 (UTC)[reply]
- Hmm, I don't plan to use it for extensive data mining operations. Setting up a dump for me is a bit of a hastle, since I only really have one laptop and limited resources. I will look into setting up something using the API, but that will take more work than the current Python script. If you are worried about my program eating away bandwidth too quickly, I can stick a timer in there to wait some arbitrary amount of time between requests. In general however, I am only hoping to use these for rather small lists (no more than about 100 items at any given point) since big lists will produce too many results for me to review anyways. Also, I never fetch the page content (as far as I understand pywikipedia) so it shouldn't be requesting that much over the network. If you still think the current algorithm is a problem, even for small lists, then I will see what I can do --DFRussia 00:28, 2 November 2007 (UTC)[reply]
- Based on my experience with ImageRemovalBot, switching from screen-scraping to the API is worth the effort: ImageRemovalBot is running four times faster than it used to. --Carnildo 01:41, 2 November 2007 (UTC)[reply]
- Does pywikipedia use screen scraping or API? I thought it used API --DFRussia 02:28, 2 November 2007 (UTC)[reply]
- it uses screen scraping, But I can write a script for you that uses the API. βcommand 02:31, 2 November 2007 (UTC)[reply]
- Does pywikipedia use screen scraping or API? I thought it used API --DFRussia 02:28, 2 November 2007 (UTC)[reply]
- Based on my experience with ImageRemovalBot, switching from screen-scraping to the API is worth the effort: ImageRemovalBot is running four times faster than it used to. --Carnildo 01:41, 2 November 2007 (UTC)[reply]
- Hmm, I don't plan to use it for extensive data mining operations. Setting up a dump for me is a bit of a hastle, since I only really have one laptop and limited resources. I will look into setting up something using the API, but that will take more work than the current Python script. If you are worried about my program eating away bandwidth too quickly, I can stick a timer in there to wait some arbitrary amount of time between requests. In general however, I am only hoping to use these for rather small lists (no more than about 100 items at any given point) since big lists will produce too many results for me to review anyways. Also, I never fetch the page content (as far as I understand pywikipedia) so it shouldn't be requesting that much over the network. If you still think the current algorithm is a problem, even for small lists, then I will see what I can do --DFRussia 00:28, 2 November 2007 (UTC)[reply]
Do not try to do this with screen-scraping. That would take multiple database accesses to do the task you're suggesting. But it can be done in one query via the API: look at http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Main%20Page%7CMain%20page%7CMain%20pagd&format=xml, for instance. The results show that Main pagd is a redlink, Main page is a redirect, and Main Page is a non-redirect article (the names are URL-encoded in the query). And that was accomplished in one database access; putting more titles there would work and still only do one database query. With this API function that's there to do what you're requesting, I think this task is so trivial it doesn't need bot approval (or a bot, for that matter); generally speaking bots aren't needed when there's a software feature that does the same thing. --ais523 18:58, 2 November 2007 (UTC)
- I see thank you for the offer Betacommand and thanks for the info ais523. I will rewrite this extension and then ask for approval again. --DFRussia 02:23, 3 November 2007 (UTC)[reply]
- A member of the BAG (including myself) can archive this request until you are ready to proceed if required. Please let us know what you want. — E talkBAG 06:24, 4 November 2007 (UTC)[reply]
Withdrawn by operator. – until the script uses the MediaWiki API. It can be reactivated at that time. — madman bum and angel 19:25, 6 November 2007 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.