Wikipedia:Bots/Requests for approval/HairBot 2

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Withdrawn by operator.

HairBot

Operator: Hair (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 18:10, Sunday November 18, 2012 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python

Source code available: Yes

Function overview: Removes invalid section links on pages ([[Foo#Bar]] Bar being a nonexistent header/id).

Links to relevant discussions (where appropriate): None.

Edit period(s): Every 2 weeks

Estimated number of pages affected: 500 to ∞, as pages change (formatting/maintenance), the headers/ids can change too

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: Generates a list of every main namespace article to go through. Going through each page, it filters out the links. Then going through each link, sees if links to a section (#). If so, it goes to linked page, obtaining all the headers to match the section with. If there isn't a match, it'll remove the section from the link on the original page. If there is a match, but it is capitalized incorrectly, it'll be fixed. If it is a correct match, nothing will be done to the link. Headers and ids will be ignored if they are in nowiki, pre or source tag.

Example (from my test wiki on wikia): This diff is on Bar, with two links to Foo. On the foo page, it contains two headers: bar and Foobar. In the diff, HairBot fixed the section link to bar and removed the other header that didn't apply. Hair^Talk 18:10, 18 November 2012 (UTC)[reply]

Discussion

Can the bot modify the section title if it gets renamed in the history rather than simply removing it? Legoktm (talk) 18:20, 18 November 2012 (UTC)[reply]

Well, the problem being would be: What if the content is different resulting in the header being changed? As some pages undergo a complete rewrite, it would be impossible to know whether or not the section would still apply. It'd require knowledge of more than a machine to associate words together to know if it still applies to the page/link. Hair^Talk 18:34, 18 November 2012 (UTC)[reply]

True, but the situation happens the same way in reverse. If "Section Name" --> "Section name" happens, why should a bot strip away the anchor when it would be trivial for a human to notice and fix? Legoktm (talk) 22:07, 18 November 2012 (UTC)[reply]

Sorry, but I guess I didn't explain well! In function details, I included how "if there is a match, but it is capitalized incorrectly, it'll be fixed". This meaning Section Name will be changed to Section name accordingly. Again, sorry for not explaining well enough. Hair^Talk 04:47, 19 November 2012 (UTC)[reply]

This does not seem to be well thought through.

Is there any consensus at all that these links should be removed by bot?
Why does your code null-edit every page that doesn't have any broken section links?
Did you consider any design that does not involve loading the content via API for every mainspace page on Wikipedia?
How do you intend to run this every two weeks when as written it will take the initial run over 142 days to complete? (And that's a best-case estimate, in practice it will probably take far longer!)

Anomie ⚔ 18:49, 18 November 2012 (UTC)[reply]

No. 3 especially. I assumed this was done via a database dump (which would probably be the best way to go about it). Legoktm (talk) 19:36, 18 November 2012 (UTC)[reply]

IMO, #1 is actually the most serious problem here. Anomie ⚔ 03:16, 19 November 2012 (UTC)[reply]

#2: It wouldn't, if the text is changed, then a save would happen, otherwise it'd skip over.

#3: Perhaps the index.php, but I don't go through API for every page. It'd be stored in a database too, associating revision id with contents, this does mean it'd initially have to get the contents from api but can be then obtained from the database.

#4: 142 days... oh my, I guess I didn't think this through. Perhaps the code should (and will) be changed so it'd be done in queries of 5000. That way, 5000 pages can be obtained, the process can be done, then another 5000 pages.

Regarding #1 (which is probably a biggie), I must've misunderstood the instructions at the top of the page. As it states to gain consensus if the task is controversial, comparing this task to the examples, it flew by my head that this is something that needs to have consensus too. I'm greatly sorry about this, perhaps this can be put on hold until consensus can be reached at the Village Pump? Hair^Talk 04:47, 19 November 2012 (UTC)[reply]

There are two separate issues here, technical, and consensus; I'll start with the technical:

Put bluntly, the current proposal is technically impractical and a really bad idea. There are currently 4 million articles. Your suggestion of requesting articles in queries of 5000, is not possible. The API has a limit on the max number of titles you can request (see this query to see what I'm talking about); the limit is 50 for normal users, and 500 for bots. At 4 million articles, that's 8000 requests. Per Tim you cannot make API requests concurrently. Each query, takes between 2 and 6 seconds. I didn't have a huge sample size, so we'll take a conservative estimate of 3 seconds a query. 8000*3=24000 seconds = 400 hours = 16 days. And that is best case. Worse case (6 seconds) you're looking at about 33 days. On top of all this is the unnecessary server load all these queries would create.

The only feasible way I see of doing this is with a database dump. If the proposal isn't changed to fix the above, I will seriously consider denying it.

Secondly, there isn't an established consensus for the task. My guess is this task will prove to be controversial, so yes, starting a thread on the Village Pump is a good idea. At the moment there is not enough support to warrant a trial.

--Chris 08:59, 28 November 2012 (UTC)[reply]

Technical solution must be based around a dump and some offline assessment. Also, if a section doesn't exist, I don't think the bot should remove the section link. It should either perform some more sophisticated checking to find the new section (beyond a case check), which would be a valuable activity. Alternatively, it could produce reports of broken section links, wikiproject editors might be interested to fix the links to/from their articles. Some analysis would be useful: how many currently broken section links, how many can be fixed by capitalisation, how many could be fixed by revision history analysis (perhaps do stats on a random sample of 100 for the last one)? Rjwilmsi 19:13, 10 December 2012 (UTC)[reply]

Will those links be changed if a topic was archived?--Justincheng12345 (talk) (urgent news here) 10:58, 11 December 2012 (UTC)[reply]

If something happens between dump time and real time, the links can get broken. The dump would really be used to find potential pages to fix, then get the live versions and work on those if needed. — HELLKNOWZ ▎TALK 11:00, 11 December 2012 (UTC)[reply]

Withdrawn by operator. - I would like to say, I know I haven't replied back in a while here, as my life contains depression, busy life style and other commitments. Knowing this, I'd rather not continue with this bot as if an issue arises, I wouldn't be able to come to it in a timely fashion. Whether or not this will change in the future, I hope it will, but the odds aren't in my favor. Sorry for wasting everyone's time, but I'd like to withdraw from this bot request. Hair^Talk 23:41, 23 December 2012 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.