Wikipedia talk:WikiProject Vandalism studies/Study2/Archive2

This is an archive of past discussions on Wikipedia:WikiProject Vandalism studies. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Proposed table

To get things rolling I have made the following table to gather data. I know it is cumbersome, but I figured others can propose revision. I made the table sortable so that should be interesting for analysis purpose. Remember 03:05, 13 April 2007 (UTC)

Table

ATOE means "at time of edit"

Data #	Edit #, Link and article name	Date of Edit	Edit type	Notes on edit	Change in article	Time of ten edits before	Time of ten edits later	Size (ATOE)	Number of cats (ATOE)	Links to article	Create date	Editor Name	# of edits (ATOE)	First edit time	Revert date	Reverter name	Date of collection
1	87310774, [1], Rockford, Illinois	09:43, 12 November 2006	Legit edit	None	???	20:07, 29 Oct 2006	18:43, 17 Nov 2006	??	3	??	10:08, 3 Jul 2005	User: Jjb	??	09:18, 1 Feb 2007	NA	NA	07:29, 12 Apr 2007
2	87310766, [2], Ateneo de Manila University	09:43, 12 Nov 2006	Legit edit	None	???	15:31, 9 Nov 2006	13:14, 22 Nov 2006	??	5	??	11:03, 28 Sep 2004	User: Vdejesus	??	14:03, 20 Oct 2006	NA	NA	07:43, 12 Apr 2007
3	87310050, [3], Aerodynamics	09:34, 12 Nov 2006	Vandal	added gibberish "amir kuamr shrestha"	??	18:39, 7 Nov 2006	16:41, 23 Nov 2006	??	3	??	15:43, 25 Feb 2002	User: 202.71.180.82	2	09:33, 12 Nov 2006	09:35, 12 Nov 2006	User: Thatperson	08:60, 12 Apr 2007

Table looks great! I think we should probably remove the columns for "change in time" and other derivative stats. Those will be useful for crunching data, but because they can be gleaned from data in the table, I'm not sure it would need to be included in the data gathering table. I imagine all serious crunching will actually be done outside of wikipedia in another spreadsheet program. But great job with the table. Now we just need to solve how get those tricky data points. Martschink 13:48, 14 April 2007 (UTC)

And how did you get the number of categories at the time of the edit? Did you use that tool at the bottom of this page? Martschink 14:36, 14 April 2007 (UTC)

No I just counted them up. Remember 15:34, 14 April 2007 (UTC)

I've reduced the width of the table, as well as the number of attributes per edit to be gathered, by doing the following:

Tweaking the column titles
Removing "Editor type" (both for editor and reverter); this is redundant because the user name is also listed; the editor type can be derived (on a spreadhseet, outside of the sampling)
Removing "change in time"; as noted by someone else, this can be calculated
Removing the edit number of the first edit of the editor; seems not that useful (just there to "prove" something?)
Removing low-value information about the reverter: # of edits at time of edit, date of first edit, edit number. The critical issues are whether the revert is by an IP address or not, and how long it was before the revert occurred. If more information about reverters is needed, a relatively small sample should suffice.
Changing months from full spelling to three-letter abbreviation
Changing edit type to shorter descriptions

Before putting back any columns, I strongly urge a review of (a) the cost of gathering the additional data; (b) the value of the data. (For example, asking those working on the project to calculate elapsed times between edits obviously will take time, but doing this in an Excel spreadsheet does the same thing, so the value of human beings doing the calculations are nil (or negative, given the likelihood of human mistakes while doing arithmetic). -- John Broughton (♫♫) 21:56, 14 April 2007 (UTC)

I like the revisions, John. At least one thing comes to mind. We need the column for edit number because we have to know what edit the data relates to. As we've viewed this so far, once we start collecting data, the only full column from the start will be the edit number column. That let's volunteers know for which edits data needs gathering. I'm not sure I agree aout the low value of the information of the reverter, but I'm not sure I disagree either. I suppose that as long as we have the reverter's ID and time of reversion, a subsequent extension of this study could go and pick that data up. I'll ponder that one. And I agree we need to shrink the form of the data. In my list (way) above, I put a list of several different types of edit. I think we can get those down to two-letter entries (e.g., NV = Not vandalism, LS = Link Spam). Martschink 23:28, 14 April 2007 (UTC)

And I think we can 86 the first column, data number. Given that this is all going to be sortable, I'm not sure we need to keep track of that (especially since we have a column for date the data was collected). We still have the open problem about how to get some of the data ATOE, such as links to article. One last thought: edit number and article name need to be seperate b/c we may end up with more than one edit examined for one article. Martschink 23:34, 14 April 2007 (UTC)

Check out Wikipedia:Village pump (technical)#Questions for Vandalism study, it contains some answers to our questions, but it is pretty technical. Remember 13:36, 15 April 2007 (UTC)

Random Edits study, formulation and structure

Latest comment: 17 years ago6 comments5 people in discussion

Alright, as it looks like the Random Edits study one the vote above by 2:1 (see above), lets get the blueprints laid out. I was thinking this time we should follow a breakdown using the scientific method. Let's start filling in areas we can and set things up to start work. (omitted until work is done: Abstract, Data, Results and Discussion, Conclusion, References). JoeSmack ^Talk 20:06, 6 April 2007 (UTC)

I went ahead and added this to the main page. I think we can start working on stuff there now unless anyone objects. Remember 16:45, 7 April 2007 (UTC)

I was WP:BOLD myself and proposed an outline procedure. Make any changes and comments as need, including scrapping the whole thing. This is just one idea. --Jayron32|talk|contribs 17:51, 7 April 2007 (UTC)

Some have talked about a recent changes study, but I think the consensus was to get a truely random sampling we need to look at randomly selected edits (which would be selected by a number gneerator and then going straight to that edit). So I have moved your suggestions to below for further comments. Any other feelings? Remember 18:19, 7 April 2007 (UTC)

My understanding is that we're going to go forward with the random edit study, not the recent changes version. That said, I'd like to see the scope of this study to be massive. We need to look at far more edits than we will be able to look at ourselves. I think we should proceed by (1) Figuring out what data we want to look at for each edit. See the working list above. (2) Spelling out a step-by-step procedure for examining an edit. This way we can easily harness the work of new volunteers. (3) Generating the list of random edit numbers that we need data for. (4) Start filling in the data. I've talked this over with another contributor, so I'm going to make to further specific proposals. First, that we examine 5000 edits. That sounds like a huge number, but it will give the study credibility and attention. That, coupled with a significant amount of information about each edit will also make this study useful. Economist dig massive amounts of information, and our study can help provide that (along with whatever our initial conclusions are). Which leads me to my second proposal, that we collect the data into a comma separated value (CSV) chart. CSV is a common denominator for spreadsheets and databases and will make it easy to load for sophisticated analysis. I believe that is the way to add additional value to the study. Martschink 00:29, 8 April 2007 (UTC)

I concur. Pick a date range, generate a large set of random edit numbers within that range (after determining what edit #s correspond to that date range). The set should include more than the x edits we want to study, because not all edits will be in article space. Once the measurements are decided on, then decide on a format for recording the results, ensure all contributors understand what their task is (do a test round perhaps), divide the set of random edits among those involved, and off we go. (Still not sure how much I will participate, but thought I'd contribute here.) –Outriggr § 23:39, 20 April 2007 (UTC)

Coder has volunteered help

Latest comment: 17 years ago3 comments3 people in discussion

I asked for help on the Bot request page and User:Autocracy has graciously volunteered to help so when we figure out what data we want to pull, he may be able to write a bot to make it easier. Remember 14:40, 8 April 2007 (UTC)

This looks interesting - if you need anything, I can help as well. ST47Talk 16:07, 8 April 2007 (UTC)

OK, so here are my thoughts from what you've written up so far (using the table above as a reference):

I can create a table like the one above as output, and have an extra column at the end for signing off reviews.
To help with determining edit type, I can link to the point in the history list where the change is mid-point.
I can have a program populate all fields except "Edit type," "Notes on edit," "Revert Date," and "Reverter Name"

I can output results into the same format as that wiki table, and we can go from there. Further ideas or comments, anybody? --Auto^{(talk / contribs)} 14:39, 30 April 2007 (UTC)

Possible tools

Latest comment: 17 years ago1 comment1 person in discussion

I thought I would create a section for possible tools we could use to collect the relevant data. Remember 17:56, 9 April 2007 (UTC)

Stats on a particular article

Feeling slow today

Latest comment: 17 years ago9 comments4 people in discussion

I must be slow today.... can someone walk me through form the beginning of what we would track and how we would actually track it? I am probably just missing something super obvious. Alex Jackl 03:45, 13 April 2007 (UTC)

The plan is to generate random numbers in a certain range and that will give us the edit number to investigate. We will then check out that edit and gather information about it.

We are currently debating what information to gather about that edit. Martschink believes that we should gather as much useful information about each edit as is possible so we can learn things like whether newer articles are vandalised more or whether new registered users vandalize more, etc. The debate is taking place above (but no one has really offered any opinions yet). Feel free to jump in and add your thoughts. Remember 13:07, 13 April 2007 (UTC)

I think I'm with Mart, lets get as much information per data point as possible. Measuring something is easy, but knowing what it is you are measuring is the hard part - interpretation seems to be the strong suite of this project and with good reason. The more paints we have to work with the more complete the picture will be in the end. :) JoeSmack ^Talk 16:01, 14 April 2007 (UTC)

Joe, how do you feel about the current table and data points? Remember 16:23, 14 April 2007 (UTC)

I think we need to make it look smaller or find a way to fit all the data on one screen and not have it drag off to the right so much (my browser has to sidescroll and i have a widescreen)! Any way to do that? It is a lot of info, and having it all infront of your eyes at once as opposed to having to scroll around would be important. JoeSmack ^Talk 16:48, 14 April 2007 (UTC)

That would be nice, but I don't know how to do it and have so much information in one row. Remember 17:16, 14 April 2007 (UTC)

Would it be weird if it was in columns? We could prolly fit like 10 data points wide doing that, and then just make a new table for the next 10. JoeSmack ^Talk 17:26, 14 April 2007 (UTC)

That's an idea but I like the ability to do sortable rows and that wouldn't be an ability with splitting up the table. I have revised the table to try to make it smaller. Any other suggestions would be welcome. Remember 18:26, 14 April 2007 (UTC)

(undent) I've been bold and done more revisions, as described above; it now fits (at least on my screen). -- John Broughton (♫♫) 21:58, 14 April 2007 (UTC)

just a few ideas.

Latest comment: 17 years ago2 comments2 people in discussion

I saw you guys were doing this, and some things i'd personally thought would be interesting for a study would be: repeat rate of vandalism by a single person (how many warnings within a time, was the user blocked after some time etc etc.) and another thing which interested me from the previous study... if 25% of the reverts is done by "anonoymous" users, then how many of those "anoymous" users are "not logged in"-users. I think a decent estimation of such a fact can be made with CheckUser??, I'm not sure. The same can be said of the reverse of course. How many of the anonymous vandalisms actions are by not logged in users. There are considerable privacy concerns here of course, i'm not sure how to deal with that, but i'm sure it's possible. Good luck with the study ! --TheDJ (talk • contribs • WikiProject Television) 00:27, 16 April 2007 (UTC)

The repeat rate of vandalism, and the impact of warnings, is really another separate study in itself.

As far as the extent to which reverts by IP editors (or vandalism) might in fact be done by registered editors who simply aren't logged in, that's an interesting question, but it's unclear how to answer it. CheckUser is really the only way to find out, I believe, but (a) that's pretty much limited to checking on sock puppets, because of privacy concerns, and (b) I'm not sure that CheckUser is used the way you think it is - that is, "Here's an IP address, tell me if it matches ANY registered user". Rather, it's my impression that CheckUser goes something like this: "I think X and Y and Z are really the same person; do the IP addresses match? (The difference is critical; we have no idea who an anonymous reverter or vandal might be, so we're really saying "does this IP address match any of the 3.7 million registered users?" That's a pretty challenging question to answer.) -- John Broughton (♫♫) 02:55, 20 April 2007 (UTC)

Clarification

Latest comment: 17 years ago8 comments4 people in discussion

Is the decision to analyze "random edits" in a given time frame, not "recent changes"?

If the goal is a large data set, we need specific designed programmed tools to gather data faster. Without that help 5000 edits is to much for 8 or 9 volunteers. It would be easy feed in random numbers for the edits to programs which could plug fields like article name, hyperlink, edit date, user name, +/- 10 edit date, and creation date. Maybe size of edit to. If we work from such generated data set, we can click link to each edit, code its edit type and edit notes, and also other data I do not know could be gathered well by programs because of judgement involved such as catalog count, # links to article and revert information, in fraction of the time. Also we can put more time on coding atypical edits, for example edits filtered because there are not 10 edits prior.Venado 15:43, 23 April 2007 (UTC)

Yes the idea is to do random edits and not recent changes. I totally agree with the second idea that a program would do this much better than we could and we should probably just focus on doing the stuff that humans can do well (like saying what is vandalism and what isn't). Any help in anyone figuring out how to set this up would be most appreciated. Remember 16:39, 23 April 2007 (UTC)

We have two bot coder volunteers up there, perhaps tap them on the shoulder via their talk pages? JoeSmack ^Talk 17:30, 23 April 2007 (UTC)

I did that with one, feel free to tap the other. Remember 17:51, 23 April 2007 (UTC)

Finito! :) JoeSmack ^Talk 18:06, 23 April 2007 (UTC)

Yes, if you have a set of edits we can probably get you some information - user, edit summary, time, article name, net defference in size. Do you have a list of edits? I can give you a list of random numbers as oldids, and use that as the bot's input. LMK. ST47Talk 14:22, 24 April 2007 (UTC)

Thank you for doing this. I can develop random number list if needed. We have not chosen the time frame yet but I can get the list when we do. Venado 17:06, 25 April 2007 (UTC)

I would suggest we study a recent discrete time period (e.g., all the edits in 2006 or if that is too big, all the edits from January through March 2007). User ST47, could you just provide an example of what your information your bot could provide on a random edit. For example, could you give a demonstration of all the info your bot can pull for edit number 87309971, which can be found here [4]. Remember 17:22, 26 April 2007 (UTC)

Think big

Latest comment: 17 years ago1 comment1 person in discussion

First, you guys are doing a wonderful job. Results of your first study will be cited in many places, I am sure. Second - I'd like to invite all to meta:General User Survey. Studying vandalism is important. But studying all of Wikipedians would be extremly useful... the project, unfortunatly, is stalled due to lack of interests among developers (the question part is mostly done, but nobody has a workable idea to carry out a survey (this being the best one...).--_{Piotr Konieczny aka Prokonsul Piotrus | talk} 07:30, 24 April 2007 (UTC)

New revision to planned study proposed

Latest comment: 17 years ago6 comments4 people in discussion

I was thinking now that the article's history displays how large an article is at each point in its history (by stating how many bytes it is in parentheses on the history page), we could limit our study to a time period that just has this feature incorporated into the wiki software and this might prove to add a lot of useful data easily gatherable data to our study. For instance, this would let us know the average size that a vandalized article was at the time of vandalism and how much information was added by a vandal when he vandalizes it. So I was thinking for this next study, we should just do a random sampling of all of the edits from May 1 to May 31, 2007 (the change to the wiki software took place on April 19). We could use the rest of this month to prepare and start the study on June 1 2007. What do people think of this? Remember 20:46, 6 May 2007 (UTC)

There's certainly a strong argument for doing it that way. We'd lose seasonal effects (like April Fools' Day) and we would be able to show changes over long periods of time, but otherwise I don't think the overall quality of the study would be compromised. Good plan. Martschink 15:44, 7 May 2007 (UTC)

Yes, I don't see any reason to use older dates that contain less data. So we'd be gathering data in May 1 - May 31 from random edits, and start analyzing data in June? If so, lets focus on gathering data right now, namely how? JoeSmack ^Talk 16:25, 7 May 2007 (UTC)

We would have to wait until May ends so that we could randomly select edits from all of May. Therefore, we can't start the study until after May 31. But we could get everything ready for the study. Remember 17:02, 7 May 2007 (UTC)

Agreed, we'd have to wait until the end of May to gather the data. Do we need to talk about numbers? Also, do we know how we're going to generate the non-repeating list of random edit numbers? And I'm still of the opinion that we should do a trial run. Maybe we should try a 100-edit micro-study of April so we can work any kinks out of the system? Martschink 19:03, 7 May 2007 (UTC)

Test run sounds good to me. As for a random number generator, how about this: [5]

Wait... this is all very silly... I already collect the size of the article at the time of the edit regardless... I shall go ahead and add in the difference in size between the selected revision and the prior one (if it exists) for the 2006 samples which I collected the first 1,000 of last night. --Auto^{(talk / contribs)} 13:15, 4 June 2007 (UTC)

Small script

Latest comment: 17 years ago8 comments5 people in discussion

I made a javascript script that did a whole bunch of queries to output these results:

page: WIHT
rev: 32782407
namespace: 0
user: NetBot
editsum: Robot fixing template calls
timestamp: 2005-12-26T18:12:00Z
anon: false
minor: false
pageLength: 1962
pageHistoryIndex: 295
pageHistoryLeng: 311
tenBefore: 2005-09-14T04:14:50Z
tenAfter: 2006-04-05T03:55:01Z

This information was received with the function call revInfo.getData('32782409'); (referring to this revision), and what this data reveals is the page, the revision number, the namespace number, the user who made the edit, the timestamp of the edit (in UTC), whether the user is an anon, whether the edit is minor, the page length in characters, what number revision the selected on is out of the total ones, the timestamp of the edit ten revisions prior to the queried one, and the timestamp of the edit ten revisions after the queried one. The script was 4310 characters long, and it could do other stuff, depending entirely on what the wikiproject wants in terms of data. Does anyone have any suggestions? Gracenotes^T § 00:15, 13 May 2007 (UTC)

First let me say: wow, this is awesome! Thanks for keep with this and putting the time behind this code. Some questions: how long does this take to run for each data point? Can it be done for random data points like previous discussed in the date range may 1 - may 31? In terms of suggestions: can we have a PageLengthDif, ie how much was added or subtracted for the revision? Would it be too taxing in terms of time of process to capture the timestamp of edits 1-10 before and after the data point? Can it be checked if any of those 1-10 edits after the data point contain 'revert(ing)', 'rv', 'rvt', 'rvv' ,'vdl' or 'vandal(ism)'? Would it be possible to check to see if the contribution that added or subtracted is still present ('jhonny is STUPID') or not present ('"The Grell" is an episode of The Outer Limits television show.') in any of the 1-10 edits after the data point? These later ones could give us more data on reverting vandalism. Again, thanks for all your hard work Gracenotes! :D JoeSmack ^Talk 18:18, 13 May 2007 (UTC)

The amount of time it takes to get the above data depends on the length of the history—I create an array whose items are revisions in the page history, including the edit summary, revision number, time stamp, and other information for each revision. So the edit summary checking is doable; I'd assume that we might want to check the number of times rv, rvt, rvv, Undo, etc. appear, but also take note of whether such a string is included in the edit after the examined one.

The script is a bit slower than might be expected, because I can only grab the information for 50 revisions at a time. It would be possible to get only the (maximum) twenty-one revisions I needed, but I can't get the length of the page history without counting the number of items in the above-mentioned array.

As for the page length (in characters), I get that by getting the wikitext of an article and measuring the length. So to get the difference, I'd have to get the wikitext of two revisions; no problem.

By "content added or subtracted", do you mean compared to the current revision of the article, or compared to next revision not by the user? The latter seems to make a bit more sense, I think.

Or I could just compare the revisions before and after the user's. This only indicates whether the edit was immediately reverted. As for edits that are reverted in the long term, there would probably be too many false positives/negatives for sensing those. Gracenotes^T § 00:10, 14 May 2007 (UTC)

As for categories mentioned above, I can see how many categories appear in the article by scanning through the wikitext. This does not take care of categories included in templates, although I can get the wikitext of an article with all of the templates fully expanded, but those would be the current revision of the template. Backlinks do not look possible. Hope this helps in deciding which information to include! Gracenotes^T § 00:07, 14 May 2007 (UTC)

This looks awesome! Thanks for all your hard work! Could you show me how to use this script?Remember 02:00, 14 May 2007 (UTC)

My thanks as well. That is a good go. If program catches next edit reverts that is big help, also recording revert editor and time of revert. That would reduce number of reverts we have to hunt for data personally.Venado 02:42, 17 May 2007 (UTC)

I see that autocracy has done some nifty work below. Meanwhile, though, anyone interested can probably include

if (wgPageName == 'Wikipedia:WikiProject_Vandalism_studies/Query')
  importScript('User:Gracenotes/vandinfo.js');

in their monobook.js, and go to Wikipedia:WikiProject Vandalism studies/Query to run the script. (I'll get it to work in IE as soon as possible.) Gracenotes^T § 19:48, 17 May 2007 (UTC)

Man these bot things look awesome. So I will ask for some more magic to be built in to them. You guys are saying you can detect "next edit reverts" right. The major task in this study is to tell if an edit is vandalism or not. If we find an subsequent edit that reverts the text exactly back to the text found in the edit in question then there are three possibilities (as I see it).

1. The edit was vandalism

2. The edit was reverting vandalism

3. The edit is part of a content dispute

To distinguish between these posibilities will require a judgement call by the experimenter. And to make this judgment call they will need to look at the diffs between the edit in question and the edit that did the revert and probably any intermediate edits too. So ...

Would it be possible to have the bot search the history after the edit to find any edits that compeletely revert the edit (not just test if the next edit is a reversion)

and

To display links that will bring up the diffs between the edits?

Another very useful link to have would be one that displays the diff between the previous edit and the edit in question.

Ttguy 22:21, 23 May 2007 (UTC)

Example

On the [Genetically modified food] page we have the following series of edits

131399491 - b4 version in question
132294711 - version in question
132294778 - intemediate version
132294870 - reverting version

and lets say our random edit picker choose edit number 132294711.

Then it would be good for the bot to come up with a series of links like this...:

Dif Previous -> Edit in Question

Sanity Check (Dif version b4 Edit in Question -> Edit that Reverts - should show no differences)

Bot Progress

Latest comment: 17 years ago8 comments5 people in discussion

So, finals are over... and I'm finally taking a night to crank through most of this. I spent some time enhancing Perlwikipedia.pm, and now I'm working on writing up a full module. I see Gracenotes already has something in JS, but I'm going to code mine anyway :) Once it is working, I'll hopefully have a DB with 5,000 edits in it. One outstanding question: how do we handle a revid that has been deleted? --Auto^{(talk / contribs)} 01:09, 17 May 2007 (UTC)

Good question. The deleted article list is by article name only. There are many speed deletes per day leaving little trail. We can probably get how many WP article edits per deletion, and can see general category reason for each given on delete table and do analysis from that. But we wont know data on editor responsible or any other data on article. Admins with access rights can still see much of that data but that would be big hassle on admins to get if there are many in the random sample. Question, can we assume deleted edits associate with a fixed edit number and if deleted edit came in sample it would be #null# hit? Can we presume non-article edit deletions, #null# hits, would be very rare, so any disappeared edit is probably to article? I have seen talk page edits permanent blanked to but I dont know if it is rare.Venado 02:32, 17 May 2007 (UTC)

Well, the thing I've run into is I can pick a revid that comes back "bad." Theres' no way that I'm aware of to track down the article or edit it relates to. Oh, and the bot is working awesomely right now. I have everything but creation, first edit, and number of edits done operating. --Auto^{(talk / contribs)} 03:44, 17 May 2007 (UTC)

At this unholy hour, I have now greatly expanded the capability of Perlwikipedia.pm, and made great progress.

Current Status

Edit Number: Done
Edit Date: Done
Edit Type: manual
Text Character Change: not implemented, but possible
Date of Ten Edits before: Done or last edit (e.g., if only 5 prior, than the 5th)
Date of Ten Edits later: Done same as 10 prior

Editor Type: not implemented, but possible
Editor Name: Done
Number of Edits: Done -- this is the Number of Edits to the Editors credit at the time of the edit right ?

That is correct.

Date of First Edit: Done -- This is the date the editor did his first edit ever anywhere on the Wiki right ?

That is correct.

All Reversion Info: Not implemented, intended to be manual
Article Size: Done
Number of Categories: Request not possible, implemented as current instead
Number of Links: Request not possible, implemented as current instead
Creation Date: Done follows full history of edits

Date of Collection: Done

--Proposed possible additional metrics

date of previous semi-protection of article
date of previous un-protection of article

I believe from the above two we can determine if the article was semi-protected or not at the time of the edit Ttguy 12:19, 23 May 2007 (UTC)

Api.php broken

At the moment, I have to hold off until i can get somebody to fix this. Specifically, whatever I query for revisions always returns the latest one for that page. This is a Wikipedia issue. --Auto^{(talk / contribs)} 18:01, 17 May 2007 (UTC)

I coded up a workaround on my end. I have one last outstanding issue that is related to api.php, and that is not being able to query the contrib history for an anonymous IP user. Thoughts please?

That's not possible with api.php, I believe. query.php might be more suitable, e.g. here. Gracenotes^T § 20:30, 17 May 2007 (UTC)

Sample Output

Notes: Some sampled revision numbers are deleted revisions... I decided that sampling those would be valuable, so I keep the data (examples #5, #12). Redirects behave weirdly (#9). I don't have contrib history for IP address (#1, #3, #4). I can format in colums to edit later... it is probably best if input for determining whether an edit is vandalism is put into an actual SQL database as we go along to prevent conflicts and allow us to read the data easily. It's probably more likely, though, that I'll just end up breaking everything into 100 sample sections and recombining with a bot. --Auto^{(talk / contribs)} 21:16, 17 May 2007 (UTC)

Data #	Edit #	Page Name	Date of Edit	Number of edits back (<=10)	Number of edits forward (<=10)	Time of ten edits before	Time of ten edits later	Size	Number of cats	Links to article	Create date	Editor Name	# of edits (ATOE)	First edit time	Date of collection
1	84878131	Zorro	2006-10-31 18:26:46	10	10	2006-10-19 17:08:49	2006-11-14 19:50:33	20679	14	258	2002-12-20 22:18:38	24.155.243.76	-1		2007-05-17 16:44:18
2	56772518	Animated Series Batman	2006-06-04 04:08:26	10	10	2006-06-03 18:08:16	2006-06-04 21:46:21	33088	0	10	2006-05-27 21:52:03	WesleyDodds	2930	2005-09-04 13:34:11	2007-05-17 16:44:39
3	39919430	Babylon	2006-02-16 20:53:17	10	10	2006-02-03 21:12:39	2006-02-17 06:20:47	10518	11	1132	2002-02-25 15:43:11	209.242.174.159	-1		2007-05-17 16:45:05
4	41785028	Spanish-American War	2006-03-01 19:02:15	10	10	2006-02-26 16:21:07	2006-03-05 06:46:17	40016	3	1917	2002-02-25 15:51:15	159.191.12.24	-1		2007-05-17 16:45:31
5	93078328	[[:]]													2007-05-17 16:45:32
6	41571970	Baraqish	2006-02-28 05:44:36	1	2	2006-02-06 22:22:39	2007-01-12 11:59:11	365	2	3	2006-02-06 22:22:39	Grutness	2497	2006-01-24 01:31:20	2007-05-17 16:45:47
7	33451482	Miami-Dade Transit	2006-01-01 05:19:20	10	10	2005-10-25 21:09:04	2006-01-31 00:11:21	16250	3	126	2004-04-03 16:59:29	B.Wind	1380	2005-10-16 22:43:54	2007-05-17 16:45:55
8	71348843	Little Big Horn College	2006-08-23 09:31:12	8	1	2006-01-23 00:45:18	2006-11-10 08:49:10	1244	5	5	2006-01-23 00:45:18	Bluebot	273575	2005-10-26 11:57:55	2007-05-17 17:04:39
9	81497099	Feltham and Heston	2006-10-15 00:28:49	0	0	2006-10-15 00:28:49	2006-10-15 00:28:49	62	0	0		BrownHairedGirl	497	2006-10-13 16:45:14	2007-05-17 17:04:43
10	84645174	Roads of Charleston, West Virginia	2006-10-30 16:41:45	10	10	2006-10-30 00:47:14	2006-12-19 01:34:40	7484	4	18	2006-10-25 21:06:07	Seicer	1498	2006-09-07 19:36:06	2007-05-17 17:04:54
11	74820016	Beyoncé Knowles	2006-09-10 01:21:44	10	10	2006-09-09 13:21:26	2006-09-10 18:22:49	30377	26	1386	2002-09-11 02:45:32	Peecee1978	889	2006-02-18 01:29:26	2007-05-17 17:05:49
12	93330194	[[:]]													2007-05-17 17:05:50
13	65243568	Tojolabal	2006-07-22 18:19:14	1	10	2006-07-22 18:13:33	2006-12-13 02:17:31	297	1	4	2006-07-22 18:13:33	Juan de Vojníkov	156	2005-12-02 18:29:42	2007-05-17 17:05:56
14	78927440	Post-normal science	2006-10-01 20:20:06	10	10	2005-05-31 10:58:54	2007-04-16 11:51:41	2921	1	2	2005-05-31 10:53:43	Bluebot	320189	2005-10-26 11:57:55	2007-05-17 17:16:09
15	74940077	Image:Aker Kværner logo.jpg	2006-09-10 18:11:59	0	1	2006-09-10 18:11:59	2007-04-24 04:21:59	116	2	0		Arsenikk	475	2006-07-27 16:53:37	2007-05-17 17:16:13
16	68234545	Minnewiki	2006-08-07 18:23:09	7	0	2005-10-19 02:12:16	2006-08-07 18:23:09	36	0	0	2005-10-19 02:12:16	Xyzzyplugh	1998	2006-03-04 09:03:49	2007-05-17 17:16:19
17	41663034	Glover Park	2006-02-28 22:37:08	10	10	2005-04-13 11:09:21	2006-09-21 17:17:24	1880	1	147	2005-03-09 20:26:51	168.158.224.29	-1		2007-05-17 17:16:23
18	42814089	Manchester United F.C.	2006-03-08 14:50:31	10	10	2006-03-08 12:26:11	2006-03-08 21:09:49	34557	7	3075	2001-09-28 15:39:23	213.130.128.197	-1		2007-05-17 17:16:53
19	40967212	Yair Lapid	2006-02-24 04:15:04	5	10	2005-08-13 23:23:31	2006-05-25 17:02:31	170	4	9	2005-08-13 23:23:31	72.224.95.121	-1		2007-05-17 17:16:56
20	68811659	User talk:LactoseTI	2006-08-10 13:24:34	3	10	2006-08-10 05:20:36	2006-08-11 04:50:41	6115	0	103	2006-08-10 05:20:36	J Di	497	2006-08-08 20:04:37	2007-05-17 17:17:02
21	93696786	Sigma Phi Epsilon	2006-12-12 00:12:12	10	10	2006-12-08 21:08:39	2006-12-14 06:00:37	19545	5	299	2003-08-26 22:57:50	Treyt021	710	2006-01-21 01:19:40	2007-05-17 17:17:09
22	47749590	User talk:68.85.237.95	2006-04-10 03:48:20	1	0	2006-01-19 07:24:46	2006-04-10 03:48:20	273	0	1	2006-01-19 07:24:46	70.95.59.46	-1		2007-05-17 17:17:12
23	80524199	Wikipedia:Requests for adminship/P.B. Pilhet	2006-10-10 00:47:09	10	10	2006-10-09 23:23:48	2006-10-10 04:24:14	7328	1	9	2006-10-09 22:42:15	Mike1	998	2006-07-31 13:13:45	2007-05-17 17:17:17
24	51345072	User:Sdoroudi	2006-05-03 10:24:29	10	10	2006-05-02 02:33:41	2006-06-04 01:37:20	4695	16	15	2006-03-22 02:49:19	Sdoroudi	165	2006-01-16 01:40:51	2007-05-17 17:17:23
25	50247800	Tricksta	2006-04-26 12:46:18	10	7	2006-04-26 02:42:26	2007-02-19 19:34:50	2491	2	7	2006-04-26 02:15:50	Yeanold Viskersenn	96	2006-04-21 16:07:07	2007-05-17 17:17:29
26	40103478	Mahdi	2006-02-18 03:51:43	10	10	2006-02-15 00:32:59	2006-02-19 16:52:37	6938	7	324	2003-02-27 00:32:49	Jeremygbyrne	998	2005-09-28 14:32:08	2007-05-17 17:17:42
27	71075002	Russell Peters	2006-08-22 00:47:09	10	10	2006-08-08 13:32:33	2006-08-27 04:12:13	3593	15	60	2005-07-05 19:45:00	68.147.239.62	-1		2007-05-17 17:17:46
28	48019959	Underworld (band)	2006-04-11 21:56:47	10	10	2006-03-14 10:22:23	2006-04-30 09:15:17	10609	15	297	2002-09-25 13:32:33	Night Gyr	996	2006-01-24 23:27:54	2007-05-17 17:17:53
29	68543813	Joe Lieberman	2006-08-09 04:20:35	10	10	2006-08-09 04:14:04	2006-08-09 04:27:00	57932	17	758	2002-08-21 20:44:26	Drjayphd	37	2006-04-07 06:46:15	2007-05-17 17:18:17
30	37390177	User talk:Smartech	2006-01-30 18:50:44	5	6	2005-03-22 23:51:37	2006-08-11 22:31:44	4997	0	0	2005-03-22 23:51:37	Bitola	225	2005-11-23 17:31:42	2007-05-17 17:18:25
31	69900405	Before Dark	2006-08-15 22:20:39	10	2	2005-12-26 04:44:14	2006-09-08 05:37:49	1018	5	12	2005-12-22 04:56:00	204.167.92.26	-1		2007-05-17 17:18:28
32	82199959	Treehouse of Horror IV	2006-10-18 12:57:39	10	10	2006-09-13 21:34:03	2006-10-20 13:13:41	10469	5	875	2005-02-20 17:10:43	L0b0t	1310	2006-06-23 15:37:20	2007-05-17 17:18:37
33	45035643	[[:]]													2007-05-17 17:18:37
34	74626402	Wikipedia talk:Protecting children's privacy	2006-09-09 00:33:59	10	10	2006-09-08 23:54:36	2006-09-09 03:54:02	73155	0	59	2006-08-26 14:27:48	JayW	496	2006-06-02 00:55:35	2007-05-17 17:18:43
35	68832744	Dundee United FC Season 1997-98	2006-08-10 15:43:45	4	10	2006-08-10 13:26:22	2006-12-31 15:46:54	945	1	22	2006-08-10 13:26:22	Fedgin	376	2005-08-13 20:52:11	2007-05-17 17:18:49
36	89397260	Talk:Iroh	2006-11-22 06:19:53	10	10	2006-10-15 01:40:02	2006-12-10 00:58:36	10958	6	4	2005-10-05 23:46:29	72.11.219.246	-1		2007-05-17 17:18:56
37	80042684	Disc image emulator	2006-10-07 15:56:31	0	0	2006-10-07 15:56:31	2006-10-07 15:56:31	33	0	1		PeFi	12	2006-10-05 22:15:34	2007-05-17 17:19:00
38	64974001	[[:]]													2007-05-17 17:19:00
39	90551502	David Fromkin	2006-11-28 00:25:50	10	10	2006-11-21 21:26:59	2006-12-12 09:03:06	2943	5	25	2005-09-05 14:30:33	63.164.145.198	-1		2007-05-17 17:19:04
40	42338287	User talk:Ruidh	2006-03-05 13:29:49	0	0	2006-03-05 13:29:49	2006-03-05 13:29:49	1691	0	0		Xy7	12487	2005-07-30 19:44:19	2007-05-17 17:19:32
41	44983806	Jeopardy!	2006-03-22 19:19:27	10	10	2006-03-19 05:48:50	2006-03-25 04:39:40	54928	18	1129	2002-12-11 05:08:37	65.215.131.2	-1		2007-05-17 17:19:45
42	85463926	Live Free or Die Hard	2006-11-03 13:10:44	10	10	2006-10-25 22:48:13	2006-11-05 17:36:05	7396	11	95	2005-04-17 04:05:10	87.250.172.175	-1		2007-05-17 17:19:50

This is great! Finally we're going to have some raw chunks of data to work with! Thanks for all the work on this! :D JoeSmack ^Talk 16:43, 18 May 2007 (UTC)

This is an archive of past discussions on Wikipedia:WikiProject Vandalism studies. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.