User talk:Rich Smith/Archive89
This is an archive of past discussions about User:Rich Smith. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
The Signpost: 31 October 2021
- From the editor: Different stories, same place
- News and notes: The sockpuppet who ran for adminship and almost succeeded
- Discussion report: Editors brainstorm and propose changes to the Requests for adminship process
- Recent research: Welcome messages fail to improve newbie retention
- Community view: Reflections on the Chinese Wikipedia
- Traffic report: James Bond and the Giant Squid Game
- Technology report: Wikimedia Toolhub, winners of the Coolest Tool Award, and more
- Serendipity: How Wikipedia helped create a Serbian stamp
- Book review: Wikipedia and the Representation of Reality
- WikiProject report: Redirection
- Humour: A very Wiki crossword
Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.
Recent changes
- There is a limit on the amount of emails a user can send each day. This limit is now global instead of per-wiki. This change is to prevent abuse. [1]
Changes later this week
- The new version of MediaWiki will be on test wikis and MediaWiki.org from 2 November. It will be on non-Wikipedia wikis and some Wikipedias from 3 November. It will be on all wikis from 4 November (calendar).
Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.
20:27, 1 November 2021 (UTC)
Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.
Recent changes
- Mobile IP editors are now able to receive warning notices indicating they have a talk page message on the mobile website (similar to the orange banners available on desktop). These notices will be displayed on every page outside of the main namespace and every time the user attempts to edit. The notice on desktop now has a slightly different colour. [2][3]
Changes later this week
- Wikidata will be read-only for a few minutes on 11 November. This will happen around 06:00 UTC. This is for database maintenance. [4]
- There is no new MediaWiki version this week.
Future changes
- In the future, unregistered editors will be given an identity that is not their IP address. This is for legal reasons. A new user right will let editors who need to know the IPs of unregistered accounts to fight vandalism, spam, and harassment, see the IP. You can read the suggestions for how that identity could work and discuss on the talk page.
Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.
20:35, 8 November 2021 (UTC)
ClueBot NG on SqWiki
Hey Rich!
I'm a crat from SqWiki. These days I was shown ClueBot NG from a user when I asked him advice in fighting vandalism. Would it be possible to make ClueBot NG work in wikis other than EnWiki? We (and I believe a lot of other wikis as well) would be really grateful to benefit from it if it was possible. - Klein Muçi (talk) 00:49, 1 November 2021 (UTC)
- @Klein Muçi: it can, however I needs a lot of training data. Pinging @DamianZaremba: to see if he can provide more input to what is required - RichT|C|E-Mail 07:06, 1 November 2021 (UTC)
- Yeah, I understand that because I read how it worked. I was thinking to maybe keep it on a kind of a "simulation" mode while it learned (maybe just don't give it the bot flag yet?) and later unleash it in full power. - Klein Muçi (talk) 11:32, 1 November 2021 (UTC)
- I don't think it quite works like that, the bot flag is irrelevant. @Cobi: could maybe assist as well? - RichT|C|E-Mail 11:33, 1 November 2021 (UTC)
- At the very least, the bot needs a several tens of thousands of randomly sampled main-space edits categorized as good or bad to even have a chance of being reasonably accurate, but ideally more. I also do not speak Albanian, so I couldn't reasonably offer support for false positives or anything like that. The bot itself is open source, and most of the tooling should be in the repo.
- It seems that DamianZaremba's been reworking some of the training tooling, but the original training tooling is mostly here. It's a bit of a mess since it is mostly a snapshot of some of our working directories back when we were originally training the bot. The basic idea was there was a MySQL database called EditDB, and it had a table called editset.
- Tools like editClassificationToEditDB.php took data in on stdin in the form of "123456 V" or "234567 C" to mark revid 123456 as vandalism and revid 234567 as constructive. Tools like generateXML.php would then emit XML suitable for training the bot's core from the edits in the EditDB. Tools like autodatasetgen.go were built to find other ways of generating classifications like by checking if someone reverted real-world edits. This was not as effective as the smaller (but still large) hand-curated datasets.
- Finally, after using generateXML.php to generate train.xml, trial.xml, and bayestrain.xml in the editsets directory (we used limit clauses to split the files, with 0-16000 in bayestrain.xml, 16000-60000 in train.xml, and the rest in trial.xml), we then ran trainandtrial.sh to train the bot and then get metrics on the efficacy of the bot. There are also tools like autotraintrial.php which attempts to explore reasonable ANN parameters which are stored in localtoolconfig and what we believe to be reasonable values for training datasets between 50,000 and 100,000 edits.
- If any of that made some sort of sense, you may wish to give it a go. If not, maybe find a bot dev on SqWiki that has time and desire to curate and run a SqWiki version? -- Cobi(t|c|b) 03:54, 5 November 2021 (UTC)
- @Cobi, thanks a lot for taking the time to explain the details! I followed every provided link along with your explanations. I saw that there hadn't been any changes for the last decade almost so I do understand that it may appear as an "old project" for you. I have a naïve question I couldn't understand from your explanation though: You say that the bot should use around 50k results (just an example) divided into C and V type to start its training which then gets information added by reverts and more. Then you also mention "hand-curated datasets". Should I understand that those initial 50k results (again, just an example) were divided into C and V type manually? If I'm misunderstanding that, how was that initial division made?
- The reason I ask is because if there's one thing we (and all the small wikis) lack is a large active userbase. We struggle so much with having an active working force that that was actually what brought me here. Even after setting up strict edit filters and trying to block vandals fast, still the number of pages and changes pending review is so large that it's unmanageable by us. (We lowered it to 0 some time ago but still...) Therefore it's unfortunately very common for changes to expect review for months if not years before someone actually comes to do that. Lately we started being attacked by some IP vandals which come and change just small trivial information on articles for example the name of the city where someone is born or the date when someone died or the number of works published by someone. These are undetectable by the filters and are unblockable for very large periods of time because they're IPs (and more than one) and they're not on the same IP range. This not only lowers the project's overall integrity but also increases the workload for the already non-existing patrollers which starts a vicious cycle: New patrollers/reviewers may become interested in helping and seeing the extremely large number of pending changes feel like their work won't matter and leave which only makes the number grow more. When I asked for help here in dealing with this situation, Xaosflux showed me your bot. It is crucial for us in automatizing vandalism fighting so we can have a chance in reviewing the remaining constructive edits which may or may not be acceptable for SqWiki standards.
- Currently I'm the only active one dealing with bot developing in SqWiki. I run a bot myself which operates in SqWiki, SqQuote and LaWiki but it's a rather simple one working on the Pywikibot framework and the occasional AWB changes. I haven't had a chance to work on GitHub yet even though I have an account there, if I'm not wrong. I can try starting that journey (even though I'm an autodidactic coder) but I'd need a lot of guidance along the way. To be honest what I was expecting was to work towards some localization "tables", like I've done with the other imported bots in the past (maybe, most notably, IABot), not duplicate the code. I highly expected Cluebot's functionality to have been requested by many Wikis during its existence and i18n infrastructure to be already implemented in it. I was surprised to understand that I may be one of the few (if my understanding is correct) users who's going on with a request like this. - Klein Muçi (talk) 10:52, 5 November 2021 (UTC)
- @Klein Muçi The links I posted are to the original versions of the files since the original training hasn't changed in the decade or so. The bot itself has been updated more regularly in the bot repo and the core repo. But, yeah, we collected and categorized some of the edits ourselves, and some had been collected by open research projects that have analyzed vandalism on enwiki, and some were crowd-sourced by using a web-interface that let others we trusted categorize edits.
- Essentially at a high level, the bot takes the edits and generates hundreds of statistics about each edit and then compares them against the known good and known bad edits' statistics using an Artificial Neural Network. If it looks like good edits more than bad edits, it leaves it alone, otherwise it reverts it. This is essentially what machine learning is.
- This does, of course, lead to why the bot hasn't been localized, yet. It needs a completely new data-set for each new wiki it operates on, and no one has taken on that challenge yet. It's also why ClueBot NG does not operate on other English wikis other than en.wikipedia, because the data-set actually needs to be made for the wiki in question, not just the language. For example, an article on the English Wikipedia would look totally different than one on the English Wikinews or Wiktionary, and because the bot works by looking at an edit and trying to determine whether or not it belongs based on its data-set, it would notice the differences. The actual strings tables used for messages themselves are trivial to update for localization in comparison with the data-set.
- Other projects have asked for ClueBot NG before, but not that often. I've told them essentially what I've told you: The bot is open source, but you have to collect a data-set for it to work. There is also the old version of ClueBot that could potentially be used and updated, but its functionality was limited and largely eclipsed by the Edit Filter, and much less effective than the machine learning approach that ClueBot NG uses. -- Cobi(t|c|b) 13:53, 5 November 2021 (UTC)
- @Cobi, I see now. My initial expectation was that you could "load it on background", like a third party app, and it would collect information in regard to our community's reviews (what we accepted and reverted), eventually constructing the needed dataset and when it was sufficiently trained, we could release it into the wild. You say that that thing does happen but first you need to feed it a lot of premade datasets before you can come at that phase. How wrong am I? - Klein Muçi (talk) 01:15, 6 November 2021 (UTC)
- For reference, the work I did basically consumes the reviewed edit set (which right now only includes historical entries), effectively meaning there is not nearly enough data to actually re-train the bot, let alone verify it's within tolerance. As far as I know we do not have the original training set used for the production datasets and the review interface effectively died so was started to be re-written in a form that could work on toolsforge. You can see the current training logic on GitHub, the output of which is calculated each day under trained-datasets. Given the current (historical) community interest in reviewing reported edits, I don't foresee being in a position to re-train en.wiki without substantial work, let along support another wiki. - Damian Zaremba (talk • contribs) 17:02, 15 November 2021 (UTC)
- @DamianZaremba, I see... Well, if anything else, this has been informing. Even though I wasn't able to get the hoped results, thanks for taking the time to reply to my answers. :) - Klein Muçi (talk) 18:22, 15 November 2021 (UTC)
- For reference, the work I did basically consumes the reviewed edit set (which right now only includes historical entries), effectively meaning there is not nearly enough data to actually re-train the bot, let alone verify it's within tolerance. As far as I know we do not have the original training set used for the production datasets and the review interface effectively died so was started to be re-written in a form that could work on toolsforge. You can see the current training logic on GitHub, the output of which is calculated each day under trained-datasets. Given the current (historical) community interest in reviewing reported edits, I don't foresee being in a position to re-train en.wiki without substantial work, let along support another wiki. - Damian Zaremba (talk • contribs) 17:02, 15 November 2021 (UTC)
- @Cobi, I see now. My initial expectation was that you could "load it on background", like a third party app, and it would collect information in regard to our community's reviews (what we accepted and reverted), eventually constructing the needed dataset and when it was sufficiently trained, we could release it into the wild. You say that that thing does happen but first you need to feed it a lot of premade datasets before you can come at that phase. How wrong am I? - Klein Muçi (talk) 01:15, 6 November 2021 (UTC)
- I don't think it quite works like that, the bot flag is irrelevant. @Cobi: could maybe assist as well? - RichT|C|E-Mail 11:33, 1 November 2021 (UTC)
- Yeah, I understand that because I read how it worked. I was thinking to maybe keep it on a kind of a "simulation" mode while it learned (maybe just don't give it the bot flag yet?) and later unleash it in full power. - Klein Muçi (talk) 11:32, 1 November 2021 (UTC)
Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.
Recent changes
- Most large file uploads errors that had messages like "
stashfailed
" or "DBQueryError
" have now been fixed. An incident report is available.
Problems
- Sometimes, edits made on iOS using the visual editor save groups of numbers as telephone number links, because of a feature in the operating system. This problem is under investigation. [5]
- There was a problem with search last week. Many search requests did not work for 2 hours because of a configuration error. [6]
Changes later this week
- The new version of MediaWiki will be on test wikis and MediaWiki.org from 16 November. It will be on non-Wikipedia wikis and some Wikipedias from 17 November. It will be on all wikis from 18 November (calendar).
Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.
22:05, 15 November 2021 (UTC)
Thanks for removing the template. I usually do that, but this one escaped my attention. I appreciate your help. Eddie Blick (talk) 03:23, 21 November 2021 (UTC)
Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.
Changes later this week
- There is no new MediaWiki version this week.
- The template dialog in VisualEditor and in the new wikitext mode Beta feature will be heavily improved on a few wikis. Your feedback is welcome.
Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.
20:01, 22 November 2021 (UTC)
ArbCom 2021 Elections voter message
The Signpost: 29 November 2021
- In the media: Denial: climate change, mass killings and pornography
- WikiCup report: The WikiCup 2021
- Deletion report: What we lost, what we gained
- From a Wikipedia reader: What's Matt Amodio?
- Arbitration report: ArbCom in 2021
- Discussion report: On the brink of change – RFA reforms appear imminent
- Technology report: What does it take to upload a file?
- WikiProject report: Interview with contributors to WikiProject Actors and Filmmakers
- Recent research: Vandalizing Wikipedia as rational behavior
- Humour: A very new very Wiki crossword
Articles you might like to edit, from SuggestBot
Note: All columns in this table are sortable, allowing you to rearrange the table so the articles most interesting to you are shown at the top. All images have mouse-over popups with more information. For more information about the columns and categories, please consult the documentation and please get in touch on SuggestBot's talk page with any questions you might have.
SuggestBot picks articles in a number of ways based on other articles you've edited, including straight text similarity, following wikilinks, and matching your editing patterns against those of other Wikipedians. It tries to recommend only articles that other Wikipedians have marked as needing work. We appreciate that you have signed up to receive suggestions regularly; your contributions make Wikipedia better — thanks for helping!
If you have feedback on how to make SuggestBot better, please let us know on SuggestBot's talk page. -- SuggestBot (talk) 11:24, 29 November 2021 (UTC)