User talk:West.andrew.g/Popular pages/Archive 1

Archive 1Archive 2Archive 3

I think it would be cool if we could see 1) which entries weren't included in the previous cycle 2) how many 10 day cycles they've been on the top 5000 and 3) how much they changed from the previous cycle (up or down). Biosthmors (talk) 14:41, 3 December 2012 (UTC)

Not on my immediate agenda; but I did find wikitrends. It doesn't seem to go to great depth, nor can I vouch for its quality or methodologies -- but maybe its a good start for what you are looking for. Thanks, West.andrew.g (talk) 14:50, 3 December 2012 (UTC)
Thanks for the link. Biosthmors (talk) 14:52, 3 December 2012 (UTC)

Exactly how many days

Could we get a line that says how many days each count includes? I'm guessing it is 10 plus or minus. Thanks! Biosthmors (talk) 23:48, 14 December 2012 (UTC)

Per the page header: "The third column is the number of views the corresponding article received in the 10 days prior to the last list update (non-inclusive)". If you don't find this prominent enough, update accordingly. Thanks also for your copy edit of the page. Thanks, West.andrew.g (talk) 15:59, 15 December 2012 (UTC)
You're welcome! So it's a guaranteed 10 day count, just that sometimes a day or two might be skipped? Thanks. Biosthmors (talk) 20:04, 15 December 2012 (UTC)
Based on that understanding, I made another copy edit. Best. Biosthmors (talk) 20:10, 15 December 2012 (UTC)
No skipping. Every 10 days. Thanks, West.andrew.g (talk) 21:31, 15 December 2012 (UTC)
I apologize for the confusion; I now see why one might deduce it is not exactly every 10 days: (1) I forced an update to annotate good/featured articles, which occurred outside of the automatic schedule, and (2) In doing so, I broke the automatic update code. Another forced update is happening now, and things will soon operate in an orderly (*exactly* every 10 days) schedule. Thanks, West.andrew.g (talk) 05:49, 17 December 2012 (UTC)

I was wondering if we could get a gold star by the articles on this list that are featured (and perhaps good ones too?). (I posted on a related subject at WT:FA.) It would also be great to see month by month statistics to identify trends in these counts month by month. Maybe even a number of how many peer reviews each article in this list has had. Thanks. Biosthmors (talk) 20:44, 2 December 2012 (UTC)

Doesn't look like article quality is an immediately accessible field via the MediaWiki API. Where is the most convenient listing of currently featured/good articles? I assume there is an explicit category in place, though, that I could list out to achieve this? About how many featured/good articles are there? Thanks, West.andrew.g (talk) 21:30, 2 December 2012 (UTC)
Thanks for the reply. Category:Featured articles (~3750 per WP:FA) and Category:Good articles (~16,200 per WP:GA). Biosthmors (talk) 21:57, 2 December 2012 (UTC)
FYI, an audit was performed recently on FAs (see User:Maralia/FAC monitoring). I would imagine GAs need a similar audit. Biosthmors (talk) 21:59, 2 December 2012 (UTC)
For the sake of...posterity, I guess...just wanted to note here that the userpage of mine which was linked immediately above was something of a catch-all sandbox, and has since been broken out into several more specific pages. The relevant portion is now at User:Maralia/FA audits. Maralia (talk) 19:07, 17 January 2013 (UTC)
I'll see what I can do over the next couple of days. If/once I succeed in this effort, I'd appreciate if you'd publicize this feature in whatever locations might consider it relevant. I'll talkback you when that happens or if I have any other questions. Thanks, West.andrew.g (talk) 23:37, 2 December 2012 (UTC)
Absolutely. I think this kind of data is critical. Thanks so much. Biosthmors (talk) 23:45, 2 December 2012 (UTC)
Implemented! Should be pretty obvious. Thanks, West.andrew.g (talk) 13:11, 3 December 2012 (UTC)
That's fantastic! I'll go award a barnstar to the person who has worked on 2012 phenomenon. Is there a way to generate a count of the number of FAs and GAs each cycle? I'd like to start keeping track of this. Maybe we could even eventually have an automatic sentence at WP:FA and WP:GA that incorporates these statistics right after they mention the overall statistics. Biosthmors (talk) 14:31, 3 December 2012 (UTC)
I went ahead and added a counter for featured/good articles. It will appear just below the massive table. It doesn't appear there now, however, because I don't think its worth a couple hours of compute time and database thrashing just to produce these minor fields. They will appear at the next automatic update. Thanks, West.andrew.g (talk) 14:47, 3 December 2012 (UTC)
Thank you. Biosthmors (talk) 14:52, 3 December 2012 (UTC)
Thanks again. I think this is great. Would it also be possible to add a counter at the bottom for "estimated non-articles"? It could subtract out the red links, Main page, undefined, Undefined, and index.html so we could get a more accurate estimate of the percent of FAs or GAs each cycle. Biosthmors (talk) 14:56, 20 December 2012 (UTC)
And then maybe we could see a % for each of those values given at the bottom? That's what I'm most interested in. I want to keep tabs on the trend of GAs and FAs in the most popular content over time. Biosthmors (talk) 14:59, 20 December 2012 (UTC)

Hi again Biosthmors. Due to rapidly evolving real life considerations, this isn't a feature I'll be able to integrate any time in the immediate future. Fortunately, however, it isn't one that really needs to be addressed with serious back-end processing. Notice that the Mediawiki parser does most of the work for us when coloring links red/blue. This is annotated in the source code by the presence of "redlink=1", and one would just need to count those instances. Surely the "find" functionality of some text editors would make immediate work of such a sum. Alternatively, I wonder if this is something that could be handled with template coding (not my expertise)? Thanks, West.andrew.g (talk) 05:02, 22 December 2012 (UTC)

Well I regret that some of your advice went over my head, but I did get an answer:[1]. Now I just need to implement it. Biosthmors (talk) 04:00, 29 December 2012 (UTC)
I added to User:Biosthmors/common.js and I now see there are 46 red links. Very cool. Biosthmors (talk) 22:05, 30 December 2012 (UTC)
Glad to hear it worked for you. When life calms down in a couple months(?), I will be more than receptive to implement these types of things -- just can't find the free cycles at the current moment. Thanks, West.andrew.g (talk) 22:10, 30 December 2012 (UTC)
No problem. Thanks for all of your help. Biosthmors (talk) 23:24, 30 December 2012 (UTC)
  Done An automated "red link count" should appear when the statistics are next automatically generated (about 9 days from now). Please do a sanity check when this happens, and ping me if anything goes wrong. Thanks, West.andrew.g (talk) 14:32, 17 January 2013 (UTC)
FWIW, the script says there are 43 red links and the counter says 40. Biosthmors (talk) 04:39, 20 January 2013 (UTC)
This mornings update says "41" when computed automatically. I scanned it at a million miles per hour and counted "40". My scheme uses the API in a quite simple fashion, so I am inclined to believe its correct. Moreover, this shouldn't be quibbled over. The notion of a "red link" is dynamic, any of these pages could be getting deleted or re-created at any point, causing the actual number to shift, while the hard number at bottom stays constant. Thanks, West.andrew.g (talk) 15:25, 20 January 2013 (UTC)

Andrew, when you get a chance, could you add a counter for featured lists? There are at least 100 hits for "List of" in the list. It would be great to keep track of this too. Best. Biosthmors (talk) 21:01, 15 January 2013 (UTC)

When everything else happens, this will too. Thanks, West.andrew.g (talk) 23:35, 15 January 2013 (UTC)
Great. Thanks again for your work here. Biosthmors (talk) 23:56, 15 January 2013 (UTC)
  Done An automated "# of lists " should appear when the statistics are next automatically generated (about 9 days from now). Please do a sanity check when this happens, and ping me if anything goes wrong. Thanks, West.andrew.g (talk) 14:32, 17 January 2013 (UTC)
Some lists are featured (see WP:FL). Could they get a gold star and a counter too? Biosthmors (talk) 04:26, 20 January 2013 (UTC)
  Done Might even compute tonight if I squeezed it in under the deadline. Else, expect to see next week. Thanks, West.andrew.g (talk) 06:10, 20 January 2013 (UTC)

Total hit count

Could you also add a counter for the sum of the 5000 hit count, when you get a chance? Thanks. Biosthmors (talk) 21:01, 15 January 2013 (UTC)

As in, how many views the top 5000 pages receive, in total? Thanks, West.andrew.g (talk) 21:42, 15 January 2013 (UTC)
Yes, thanks. Biosthmors (talk) 21:58, 15 January 2013 (UTC)
  Done An automated "top 5k articles view sum" should appear when the statistics are next automatically generated (about 9 days from now). Please do a sanity check when this happens, and ping me if anything goes wrong. Thanks, West.andrew.g (talk) 14:33, 17 January 2013 (UTC)
  • The total hit count number is interesting -- do we know what % of all wikipedia views that represents, I presume a fair amount depending how the power law distribution of views applies?--Milowenthasspoken 01:21, 23 January 2013 (UTC)
I have these statistics. I don't have the time at the current moment to spew them out here, but as we collaborate on our Signpost article, this result should become apparent. Thanks, West.andrew.g (talk) 04:56, 23 January 2013 (UTC)

Random statistical ramblings; towards a WP:Signpost article

I showed your beta list to some admins at the London Wikipedia meetup and they say that if you could find a way to adjust your algorithm to get rid of the "undefined" and multiple main page listings, it could be incorporated into the site. :) Serendipodous 20:33, 15 January 2013 (UTC)

I like them being listed, FWIW. Biosthmors (talk) 21:01, 15 January 2013 (UTC)
Besides hard-coding exclusions, the best I will likely be able to do is exclude red-links. What does everyone think about that proposal? What are some examples of things you "like seeing" Biosthmors? Also what is meant by "incorporated into the site"? Thanks, West.andrew.g (talk) 21:41, 15 January 2013 (UTC)
Well it would be nice to see it linked on the history pages, for example. Serendipodous 21:45, 15 January 2013 (UTC)
I like seeing that index.html and index.php are both massive redirects to another page. I like seeing Undefined/undefined listed even though it is a disambiguation page. I like seeing Main page/ listed even though it is also a redirect. I like seeing the total main page count too. In fact, I like seeing all the red link hits, in case these are signs of DDoS attacks or whatever they are. It's just interesting to know where all the web traffic is coming in from. I think these "wild cards" give the list a sense of unvarnished big picture stats, which is refreshing and actionable. Biosthmors (talk) 22:07, 15 January 2013 (UTC)
Just as a matter of record, I think that "undefined/Undefined" are not* actually visits to that disambiguation page. Rather, this is some type of internal error that occurs on the WMF servers (someone screwing up link syntax, or trying to point to a directory that doesn't exist, etc. etc.). When the statistics are compiled this is just the label that is applied. It is merely a coincidence that a Wikipedia page exists of the same name. Similarly, you could create a page called "WSearch.api". The statistics would still be computing usages of the search function, although there would be a green link to a Wikipedia page which is not getting that quantity of hits. Thanks, West.andrew.g (talk) 23:34, 15 January 2013 (UTC)
If we're not getting the actual visits, then might this be something the WMF could fix, to make Wikipedia run faster or at least more efficiently? Biosthmors (talk) 17:55, 17 January 2013 (UTC)
I have no idea what "undefined/Undefined" actually represents. It's likely not terribly interesting. West.andrew.g (talk) 18:41, 17 January 2013 (UTC)
Speaking of DDoS attacks, might that explain the massive page views for curiously recurring template pattern lately? It's a weird sustained jump. Biosthmors (talk) 22:11, 15 January 2013 (UTC)
  • Chiming in, I also see a value in including the redlinked information. The DDoS attacks or whatever causes huge bumps in random articles are odd and probably deserves more scrutiny. Redlinks like 18k Gold Watch and Hollow Body Guitar are consistently on the list as well, I wonder who is paying for that happen! For public consumption, it would be nice to have a Top 25 list derived from this report which excludes the screwy entries, because that is what the casual person is interested in. I assume that would require manual creation, and could include commentary, like this quick example I drew up: User:Milowent/Top25.--Milowenthasspoken 23:09, 15 January 2013 (UTC)
I support the creation of such a "top 25 digest". Something like this would be extremely ripe for WP:Signpost inclusion and could drive considerable traffic for that digest and this larger list. Of course, the burden would be on you (or someone) else to put in the manual effort. Thanks, West.andrew.g (talk) 13:48, 17 January 2013 (UTC)
Andrew, how hard would it be to set the list to update every 7 days and cover the prior 7 days views instead of 10? I've been thinking about this since you started the WP:5000 project. As a matter of advertising and press interest, I think having lists for "the week of January x, 201x" is something people would really identify with more. Most of the world thinks in weekly increments (like they have for TV ratings) and not 10 day increments. So if the list was updated on the same day every weekend, we'd have a snapshot of the prior week for people to look back it. (Other editors please chime in if you agree/disagree.)(Alternatively we could do 14 day periods and have a bi-weekly report if 7 days is too burdensome.) I am willing to work on a top 25 digest and am sure I can find some other editors to share that load. The Signpost publishes on Mondays, so if the report was run on Saturday, that would give us a day to get the digest prepared.--Milowenthasspoken 14:01, 18 January 2013 (UTC)
10 days was chosen because it has some very nice scalability properties with my system; primarily the fact that each 10 day period is contained within a single database table. A query period of 7 days will require spanning two tables, and a massive JOIN operation on the PAGE_TITLE field (millions of entries). Technically this is easy code to write, but I will need to investigate whether or not it blows up in the scalability and memory management departments. I will try to get around to that this weekend and report back accordingly. Let's see also if we can get a commitment from the Signpost that they are interested and will publish this. Thanks, West.andrew.g (talk) 17:48, 18 January 2013 (UTC)
The report will now be generated once every 7 days, on Sunday mornings. I'm about to change the header text to reflect this change. The version for "last Sunday" is processing now and will appear momentarily. Then, tonight, a fresh version will be computed. Thanks, West.andrew.g (talk) 13:54, 19 January 2013 (UTC)
  • Andrew, two quick questions. Do you think Bryan Cranston is an anomaly? 1.09 million views on January 8, and otherwise much lower views. 2nd, if current report is for Jan 12-18, he shouldn't be so high on the report? It seems the 1.09 million is being included though its over a week old?--Milowenthasspoken 05:36, 20 January 2013 (UTC)

As of your writing, the version which is displayed covers the week of Jan. 6--12. In the next several minutes/hours a new version will appear that will cover Jan. 13--19. The last day included is always a Saturday. Reports are generated Sunday morning (EST, in case there is any time zone confusion between us). The new report is currently in the processing pipeline. Hopefully this also answers your question regarding the specific article count? Thanks, West.andrew.g (talk) 05:55, 20 January 2013 (UTC)

Hi quality stuff! Now lets lean on the WP:Signpost to make sure as many people as possible see our work. As I have previously shown in my research, page views on Wikipedia follow a power-law distribution (very few of the articles receive the vast percentage of Wikipedia's traffic). That means these articles, with respect to (1) their overall quality, and (2) our ability to keep them free of vandalism (and perhaps semi-protection all of them) have tremendous ramifications on how the general public view the project. I'm going to suggest we try to get: (a) A roll out article in the Signpost that introduces our work and perhaps includes some of my research statistics and (b) push for weekly inclusion in the Signpost. Thanks, West.andrew.g (talk) 05:54, 21 January 2013 (UTC)

We will be drafting this at User:West.andrew.g/Popular_pages/Signpost. Thanks, West.andrew.g (talk) 17:58, 24 January 2013 (UTC)

Differences with Henrik's tool

The following discussion is closed and will soon be archived: Henrik's tool has a bug of some kind
For some reason, I decided to download the nearly 5GB of data needed to check on this. Below is the raw data from Jan. 17. You'll see that it sums to nowhere near 800,000 hits. I can't speculate on what might be wrong with that tool, but I will note that its maintainer has been inactive for quite some time and does not respond to talk page messages. I also checked the results on each adjacent day (with similar results), but won't reproduce them here. Thanks, West.andrew.g (talk) 04:56, 25 January 2013 (UTC)
> grep "en Computer_virus" *
pagecounts-20130117-000000:en Computer_virus 134
pagecounts-20130117-010000:en Computer_virus 167
pagecounts-20130117-020000:en Computer_virus 190
pagecounts-20130117-030001:en Computer_virus 171
pagecounts-20130117-040000:en Computer_virus 667
pagecounts-20130117-050000:en Computer_virus 603
pagecounts-20130117-060000:en Computer_virus 302
pagecounts-20130117-070000:en Computer_virus 2630
pagecounts-20130117-080000:en Computer_virus 612
pagecounts-20130117-090000:en Computer_virus 1263
pagecounts-20130117-100000:en Computer_virus 2548
pagecounts-20130117-110000:en Computer_virus 965
pagecounts-20130117-120000:en Computer_virus 478
pagecounts-20130117-130001:en Computer_virus 820
pagecounts-20130117-140000:en Computer_virus 2518
pagecounts-20130117-150000:en Computer_virus 1135
pagecounts-20130117-160000:en Computer_virus 775
pagecounts-20130117-170000:en Computer_virus 381
pagecounts-20130117-180000:en Computer_virus 474
pagecounts-20130117-190000:en Computer_virus 295
pagecounts-20130117-200001:en Computer_virus 325
pagecounts-20130117-210000:en Computer_virus 284
pagecounts-20130117-220000:en Computer_virus 259
pagecounts-20130117-230000:en Computer_virus 281

Interesting. I see that the other tool gives the same results regardless of capitalization, so it presumably sums the results for all the redirects that differ only in capitalization. (This is not the case when there are redirects to or from accented version of names; in that case they are counted separately and so need to be manually aggregated.) LittleBen (talk) 09:34, 25 January 2013 (UTC)

I'll do this one more time just to make it painfully clear. That tool is broken in some way:

grep "en Comparison_of_Android_devices" *
pagecounts-20130121-000000:en Comparison_of_Android_devices 41
pagecounts-20130121-000000:en Comparison_of_Android_devices%091055702 1
pagecounts-20130121-010000:en Comparison_of_Android_devices 2144
pagecounts-20130121-020000:en Comparison_of_Android_devices 82
pagecounts-20130121-030000:en Comparison_of_Android_devices 55
pagecounts-20130121-030000:en Comparison_of_Android_devices%23Smartphones 1
pagecounts-20130121-030000:en Comparison_of_Android_devices%23Tablet_computers 1
pagecounts-20130121-040000:en Comparison_of_Android_devices 2151
pagecounts-20130121-050001:en Comparison_of_Android_devices 113
pagecounts-20130121-060000:en Comparison_of_Android_devices 40
pagecounts-20130121-070000:en Comparison_of_Android_devices 2011
pagecounts-20130121-080000:en Comparison_of_Android_devices 94
pagecounts-20130121-090000:en Comparison_of_Android_devices 61
pagecounts-20130121-100000:en Comparison_of_Android_devices 2561
pagecounts-20130121-110000:en Comparison_of_Android_devices 105
pagecounts-20130121-130000:en Comparison_of_Android_devices 2514
pagecounts-20130121-140000:en Comparison_of_Android_devices 108
pagecounts-20130121-150000:en Comparison_of_Android_devices 77
pagecounts-20130121-160000:en Comparison_of_Android_devices 2134
pagecounts-20130121-170001:en Comparison_of_Android_devices 157
pagecounts-20130121-180000:en Comparison_of_Android_devices 92
pagecounts-20130121-190000:en Comparison_of_Android_devices 2350
pagecounts-20130121-200000:en Comparison_of_Android_devices 112
pagecounts-20130121-210000:en Comparison_of_Android_devices 77
pagecounts-20130121-210000:en Comparison_of_Android_devices& 1 
pagecounts-20130121-220000:en Comparison_of_Android_devices 2418
pagecounts-20130121-230000:en Comparison_of_Android_devices 127

Once again, the sum nowhere approaches the total. Milowent already de-bunked the redirection theory. If anything I would pay attention to the "Comparison_of_Android_devices%091055702" line. Someone tried to use this section link (or whatever it is) to access the page. It would seem Henrik's tool mis-parses this line with "%09" as a character code, and then believes the remainder of the number (~1 million) to be the number of article views. This is incorrect, that section/item only got 1 view. Matter closed. Take this up with the other tool maintainers. Thanks, West.andrew.g (talk) 01:10, 28 January 2013 (UTC)

Possible WMF labs support for your good work

I mentioned your good work here; hope you don't mind. LittleBen (talk) 01:43, 29 January 2013 (UTC)

Thanks for the mention and consideration, LittleBen. I am currently not resource-constrained, my "personal computers" that do this work are a small cluster of machines hosted at the University of Pennsylvania and paid for with research grants. If this ever becomes insufficient, I also have an account on the Wikimedia "Labs" platform. So for the time being, I think I am all set! Thanks again for the thought (and good luck with getting Henrik's tool fixed). Thanks, West.andrew.g (talk) 19:49, 29 January 2013 (UTC)
  • I was thinking that tacking a graphical display frontend onto what you have already done, and using it instead of Henrik's tool, might turn out to be the only long-term "solution" for Wikipedia. Ranking could be displayed only for items on your "Top 5000" list—but if this were on Wikimedia "Labs", then maybe the top 10,000 could be calculated; official support would also make it easier for other people to add the same feature to other-language Wikipedias (cf. the earlier request). To improve a "Top 10,000" article so that it moves up into the top 5,000 would be an incentive for some people. (A small article that I wrote a year ago is slightly below the bottom of your "Top 5,000" list.) ;-) LittleBen (talk) 00:36, 30 January 2013 (UTC)
To be blunt, I have no intentions of expanding this service. My on-Wikipedia offerings are all spurned by my academic/dissertation research (I only began collecting statistics for anti-damage purposes) -- and I desire to keep pressing towards novelty and more challenging problems. Simple statistical tabulation is neither of these; and is probably a service that Wikipedia WMF should be providing its editors. I have provided the source code I use to process and aggregate these lists; as it would not be hard for someone else to do these types of task. I'm not "above the project", but I believe my skills are best invested elsewhere. West.andrew.g (talk) 01:23, 30 January 2013 (UTC)
P.S. I am willing to share my data and mentor a younger student who wants to work with it. I know Wikipedia participates heavily in the Google Summer of Code for instance and a really sophisticated visualization tool might be a worthy "Senior Design" project for CIS students. West.andrew.g (talk) 01:31, 30 January 2013 (UTC)
Toolserver already has WikiTop for the Top 100 viewed monthly articles (http://toolserver.org/~johang/wikitop/english.html) and WikiTrends (http://toolserver.org/~johang/wikitrends/) tools but few know of it. Regards, Sun Creator(talk) 01:35, 30 January 2013 (UTC)
  • Sun Creator, thanks for the info. I think too few people are aware of Andrew's Top 5000 as well. If more people were aware of these, and were clamoring to see a fixed version of the broken stats tool (and see it linked from other language Wikipedias), then maybe WMF would move on getting this implemented—as part of Google Summer of Code, for example. (I agree that this "is probably a service that Wikipedia WMF should be providing its editors").
  • It seems quite likely that the existing graphical frontend of the Pageview stats is something like Gnuplot <quote>mouseable web display formats<unquote>. If so, that part (adding a graphical frontend) would require a trivial time investment, so I might become interested in doing it if nobody else can be persuaded. ;-) LittleBen (talk) 02:56, 30 January 2013 (UTC)
Looks like WikiTop hasn't been updated for a year, last file was updated January 2012. (Note that #3 and #4 on that list are anomalies, the data seems to match Henrik's tool data for Dec 2011, but don't know if that's a data error or DOS attack reflected.)--Milowenthasspoken 03:42, 30 January 2013 (UTC)

Just a little delayed this week

Wikipedia:Village_pump_(technical)#Format_Change_of_Page_View_Stats -- Why things change around here with no notification is incredibly frustrating. Regardless, I caught and fixed the error last night, and this week's fresh report is now in the pipeline. Thanks, West.andrew.g (talk) 18:08, 3 February 2013 (UTC)

  • Thanks. Am preparing Top 25 now. I assume this did not change the time period covered by the report (e.g., it is still for 7 days, which would be Jan 27 - Feb 2 in the current listing), unless you indicate otherwise. Thanks.--Milowenthasspoken 14:38, 4 February 2013 (UTC)

Super Bowl entertainment

The biggest hour for this year's Super Bowl entertainment was 02:00UTC of Feb. 4. Note that the raw statistics do not resolve redirects, therefore we have: Beyoncé Knowles at 378,923, Beyonce at 12,877, and Beyonce_knowles at 4,149. This means that during that hour, the article was averaging 100-110 hits per second. These redirect cases also make some suggestions about traffic sources. It's safe to assume that no one would actually accent the "e" in Beyoncé Knowles when doing a casual search. Thus, those arriving directly at that page are likely via Google or another search engine that handles the redirect logic (i.e., the vast majority of visitors). Meanwhile, Beyonce and Beyonce_knowles are more plausible to be direct Mediawiki searches. Thanks, West.andrew.g (talk) 17:25, 4 February 2013 (UTC)

2MASS J04414489

It's definitely in the data. Every single hit for that article occurred in a one hour block: "Fri, 01 Feb 2013 22:00:00 GMT" certainly suggesting automated views (and presumably, there was a typo where they failed to include the "+" sign. Maybe the missing "+" and resulting red link is why whatever script kept on retrying, and retrying, to fetch the page). Given recently discovered issues with Henrik's tool (and the more recent one with the data in general, which I just brought up at WP:VPT) -- I don't think we should be trusting or troubled over what his output is. Thanks, West.andrew.g (talk) 18:43, 4 February 2013 (UTC)

WikiProjects

I would like to know (along the lines of this) if there is a way to pick out which articles belong to which WikiProject. Specifically, I am interested in finding out which WP:ANAT articles are listed here. Any ideas? I thought I'd post here before taking it to WP:VPT. Thanks. Biosthmors (talk) 19:06, 6 February 2013 (UTC)

This is a bit outside my scope. Is there an explicit (perhaps even hidden) category membership? Does the project page contain a list? Or is this a case where someone just drops a template on an an article talk page and it is considered "included" (and why doesn't that template addition trigger an implicit category addition)? If none of these things happens (and the question would be, "why not?"), one could crawl all talk pages looking for the template in question, but that is messy and inefficient business. Thanks, West.andrew.g (talk) 19:47, 6 February 2013 (UTC)
Good points. Thanks. Biosthmors (talk) 19:48, 7 February 2013 (UTC)
Well, FWIW, I did find the complimentary categories of Category:Medicine articles by importance and Category:Unknown-importance medicine articles. Also Category:Medicine articles by quality and Category:Unassessed medicine articles exist. That should cover everything. Biosthmors (talk) 19:53, 7 February 2013 (UTC)
And Category:Anatomy articles by importance, etc. appears to exist too. Biosthmors (talk) 19:57, 7 February 2013 (UTC)

Undefined?

So why is the undefined disambiguation drawing such a massive number of hits? And where are they being dabbed to? The articles linked from that dab don't seem to be drawing much traffic. Doesn't make any sense. Wbm1058 (talk) 23:57, 7 February 2013 (UTC)

Coding error in a script, where a programmer didn't properly distinguish between an undefined value and the name "undefined". --Carnildo (talk) 00:59, 8 February 2013 (UTC)
To clarify, the statistics report all requests to the WMF servers (including pages which don't exist; errors; and much else). Thus the "undefined" value doesn't actually refer to the disambiguation page undefined, but is the result of binning a common program/script output. The fact it gets "blue linked" is purely coincidental based on the fact that an article exists of the same name. Thanks, West.andrew.g (talk)

thanks

Just wanted to say thanks for making this list -- it's really interesting & useful. -- phoebe / (talk to me) 03:55, 8 February 2013 (UTC)

Reply on your talk page. Thanks! West.andrew.g (talk) 06:30, 8 February 2013 (UTC)

It would be useful to add the following:

Biosthmors has done some extensive copy editing of that section, so I will leave this for him to interpret. I will, however, note that the monthly statistics have not been updated for some time and should not be included. Thanks, West.andrew.g (talk) 05:08, 9 February 2013 (UTC)
I'm not sure it would be useful to have these... Biosthmors (talk) 18:23, 10 February 2013 (UTC)
I thought Wikitrends was good so I added it. One interesting thing is that Lists of deaths by year was down by 93.62%. This fuels my suspicion that Deaths in 2013 may have substantial non-human hits. Biosthmors (talk) 18:32, 10 February 2013 (UTC)
Are there any concerns over the statistics at Wikitrends being contradictory or possibly invalid when compared to the ones here? Biosthmors (talk) 19:09, 10 February 2013 (UTC)

Kevin Gates!

Wow, one of the top 5000 articles (#4999) is one that doesn't even exist, but I did a google news search and he's a rapper. Wow. Looks like we aren't "serving our customers" as well as we could. I assume Kevin Gates is notable. We need to track more of these "most desired" not existent articles. Can we? We have to adapt to reader demand to stay relevant. Biosthmors (talk) 21:49, 10 February 2013 (UTC)

I just created it. He had been getting hundreds of hits daily until a recent spike, I guess in line with a recent release of his. http://stats.grok.se/en/latest/Kevin_Gates How can we find these sooner? Biosthmors (talk) 22:00, 10 February 2013 (UTC)

Maybe we can get a top-500 list for the most popular non-existent pages? Biosthmors (talk) 22:01, 10 February 2013 (UTC)

I'll can see what I can produce. This will be a hard and bandwidth-intensive problem, though. I first need to compute a list MUCH longer than just the top 5000. Then, I must query to see if every page in that list exists. Of course, red-links should be quite rare amongst popular articles making the search depth incredibly deep. There will also be the usual issues in trying to distinguish what might be scripts/process/etc. versus actual page titles. West.andrew.g (talk) 22:27, 10 February 2013 (UTC)
I've written the code to do it. Due to some research stuff I'm running, it'll be a couple of hours (or more) before I can see what it outputs. Based on this we'll see if some refinement is needed and if it is viable as a public-facing list. Thanks, West.andrew.g (talk) 22:59, 10 February 2013 (UTC)
I'd be happy to see a simple list of red links and leave it humans to decipher what is notable (worthy of becoming blue). Biosthmors (talk) 23:23, 10 February 2013 (UTC)
  • Its pretty easy to scan through the WP:5000 for redlinks, though such a tool could be useful. As of now, I see Kevin Gates was speedy deleted, that's a shame, a 7 day PROD would have been fine. That's the risk of getting past new page patrol sometimes with a very bare stub. Anyhow, I suspect there are going to be short-term popular red-links from time to time which may be hard to show are notable.--Milowenthasspoken 05:26, 11 February 2013 (UTC)

As a matter of personal policy, I don't get too involved in content disputes (nor am I terribly familiar with the subtleties of the processes). However, I'm curious if you could push back on the deletion process using our statistical evidence? I mean, don't thousands of attempted views somehow imply "notability"? Thanks, West.andrew.g (talk) 05:40, 11 February 2013 (UTC)

  • Its been restored now. On a speedy deletion, a request normally leads to restoration, it can be subjected to a 7-day PROD or AFD (allowing for discussion) if necessary. Biosthmor's creation was a bit vulnerable to deletion because it was so short and the claim to notability not clear on its face.--Milowenthasspoken 05:53, 11 February 2013 (UTC)

Page ratings

In the future, I think it would be helpful to also add page ratings (A-class, B-class, etc.) to the chart as well, but I don't know how difficult the coding for this would be. Remember (talk) 14:48, 8 February 2013 (UTC)

There isn't a single over-arching category of "all A/B/C class articles", is there? Is it done at the subject level, i.e., categories of "A class sports articles", "B class sports articles", and so forth? I shouldn't have to look at the talk page to get this, should I? I assume that the text format of "A class" or "A-class" is reasonably consistent and could be captured with a regex. This is 5,000 pings of the API in order to get every article's category memberships, but that is do-able. Any suggestions on what might be appropriate icons to use? Thanks, West.andrew.g (talk) 19:20, 8 February 2013 (UTC)
On the icons, User:Senra has A, B, C and start icons listed. I think the stub icon is more red. I'll look. Biosthmors (talk) 19:24, 8 February 2013 (UTC)
So {{icon|FA}}, {{icon|GA}}, {{icon|A}}, {{icon|B}}, {{icon|C}}, {{icon|START}} and {{icon|STUB}} yield  ,  ,  ,  ,  ,   and  . Biosthmors (talk) 19:29, 8 February 2013 (UTC)
And it would be really cool to know which articles contribute to which Category:Wikipedia backlogs. Biosthmors (talk) 19:32, 8 February 2013 (UTC)
;-) How about this, while I'm dreaming? If any cleanup templates like {{citation needed}} exist in the article could we include them in the list like this, for example: Malaria[citation needed]. That way it would look a little messy, but it would prompt people to actually fix the issues and improve the quality of articles, in my opinion. =) Thanks for listening! Biosthmors (talk) 20:08, 8 February 2013 (UTC)

I know how wishlists can grow, and I reserve the right to delay/ignore accordingly :-). This is a pretty large bandwidth penalty (I need to actually go and parse 5000 articles that are several kb in size... but it may be do-able). You'll need to list any/all templates that would be of use and how they might be represented in the list (I am thinking a dedicated column might be needed if we keep adding stuff). Could you also help with some of my above questions regarding "A/B/C" class and how I would go about doing that? Thanks, West.andrew.g (talk) 20:19, 8 February 2013 (UTC)

Well Category:A-Class articles, Category:B-Class articles, Category:C-Class articles, Category:Start-Class articles, Category:Stub-Class articles, and Category:Unassessed-Class articles (with {{icon}}  ) all exist. There are sometimes articles ranked as both A and B class, by different projects, by the way. The complications begin! I hope that helps. Biosthmors (talk) 20:35, 8 February 2013 (UTC)
I think it would be the pretty much identical to how you added FA and GA statuses. Biosthmors (talk) 20:36, 8 February 2013 (UTC)
And are there other templates besides {{citation needed}} that you would be interested in? Thanks, West.andrew.g (talk) 00:38, 9 February 2013 (UTC)
Well, I could list too many from Category:Wikipedia backlogs. How many would be a good number to start with for consideration? Biosthmors (talk) 00:54, 9 February 2013 (UTC)

Okay, so the report will soon notate the A/B/C/stub/start/unassessed classifications. Could someone please update the lede/header section so it has the description for all of this? Once you do so, report here, and revert it back to normal -- and then I will restore your version once I actually run the report. Note that a page can have multiple icons next to its name i.e., "A and B class". An icon simply means that "1 of more projects have classified this article at level "x"".

I am going to temporarily hold off on Biosthmors suggestion. If you continue to want it, Biosthmors, you need to come up with a list of category memberships of interest (it must be categories (even if hidden ones) -- as I don't want to obtain and parse actual page content. Note that virtually all clean-up and maintenance templates implicitly add some form of category). You also need to come up with an icon or other terse system for representing membership. We could imagine a second version of the top-5000 with tons of clean-up details -- and that is fine -- but I am quite busy and need you to handle all the non-technical portions. Thanks, West.andrew.g (talk) 01:36, 10 February 2013 (UTC)

Never mind, I actually went ahead and took care of the lede/header update (though anyone is free to tweak). Thanks, West.andrew.g (talk) 14:59, 10 February 2013 (UTC)
Done and now things are back where you put them. Biosthmors (talk) 18:22, 10 February 2013 (UTC)
Actually I have a better(?) and hopefully more actionable idea for now. I'll put the other idea on hold. Could we please list Category:Delisted good articles ( ), and Category:Wikipedia former featured articles ( ), for the articles that were but are not now either GA or FA quality? (Some former featured articles are also currently featured articles, though I don't know if the delisted good articles category contains some that have been re-promoted.) This could prompt some efforts to get top-viewed articles that might be close to their former status' upgraded after some polishing. =) Biosthmors (talk) 20:28, 10 February 2013 (UTC)
And could we count them too? Thanks again for all your work here. Biosthmors (talk) 20:29, 10 February 2013 (UTC)
Will do it. Could you add some appropriate description to the header? Thanks West.andrew.g (talk) 21:17, 10 February 2013 (UTC)
Done and reverted back again. Biosthmors (talk) 01:39, 12 February 2013 (UTC)

New Top 25 coming

New WP:TOP25 list should be up late evening (pacific time) when I get back to the Internet. One odd highly-placed entry I can't explain at the moment is Ernst Litfaß - anybody have any ideas about that one? Mary Leakey, subject of a Google Doogle, will be #1, with a whopping 2.7 million views. And Illuminati will be an amusing entry, with popularity fueled by the ridiculous tabloid claim that the fabled group was related to the Superbowl power outage and Beyonce.--Milowenthasspoken 14:09, 10 February 2013 (UTC)

Puzzling. Averaged 10 views a day during January, but has huge (200k+) sustained popularity beginning February 1. Can't find anything internally or externally to justify this type of popularity jump. Even odder due to the presence of a special character in the title -- i.e., this isn't what I would use to test a script (unless I was testing for unusual character compatibility). Unlikely to be a DOS candidate, as well. Thanks, West.andrew.g (talk) 14:29, 10 February 2013 (UTC)
New list is up, and I decided to remove Litfass from the rankings, even his German wikipedia article saw no spike in views.[4]. Litfass invented the "advertising column", it would be an amusing spambot choice.--Milowenthasspoken 05:13, 11 February 2013 (UTC)
Great version, I smiled a few times as the Super Bowl humor continued onward. The WMF folks will never let me see it, but the "purely raw" server logs could help us make sense of this "Litfass" business and other un-explained spikes -- i.e., are all these hits coming from one IP address, or an obvious network/geographical cluster of addresses? My guess is "yes"; no one rents a global bot-net without some underlying economic or strong social motivation. These are likely mis-configured bots or crawler (even if someone does have a sense of humor, the exposure these types of things get is beyond trivial)... Thanks, West.andrew.g (talk)
Awesome! Maybe next time you can mention why the Illuminati did what they did to catch the internet's fancy. ;-) Biosthmors (talk) 20:45, 12 February 2013 (UTC)

Martin Rycak, who was posting last-24 hour trend charts on the German Wikipedia for a few months I believe, has expanded to wikipedia.trending.eu/en/index.html, which covers 11 different language wikipedias, allowing you to see the most popular articles on those wikipedias over the past day (and smaller slices). It also has a twitter feed @trending_eu.--Milowenthasspoken 06:35, 11 February 2013 (UTC)

Interesting. This is now linked to in the version of the header that should go live after the next update. Biosthmors (talk) 01:42, 12 February 2013 (UTC)

Next week's Top 25

Due to work travel, I will likely not be able to get the next WP:TOP25 report live until Wednesday morning UTC. If someone wants to help do it, I am welcoming volunteers! Essentially its done manually at this point, I create the new version using the latest version as my template. And I move the old version to its archive URL.--Milowenthasspoken 14:49, 14 February 2013 (UTC)

I agree. I am seeing single hours that exceed the weekly count that WP:5000 is currently displaying. I am looking into it. Thanks, West.andrew.g (talk) 05:08, 21 February 2013 (UTC)

Fixed. There were basically two entries for "Valentine's Day" in the raw statistics; one where the apostrophe was ASCII encoded and a second where it was not (and why the latter escaped the encoding is a bit puzzling). Regardless, the far less popular latter case was overwriting the former when it was encountered in processing. The software now accommodates the possibility of a title appearing twice by summing the views. An updated report will appear in a couple of minutes. Thanks, West.andrew.g (talk) 13:00, 21 February 2013 (UTC)

New format/list available

The exciting new annotated format is up at WP:5000 -- its explained in the header. We need to be watchful that this report gets done early Sunday mornings (per UTC). While everything is fine on my end, Mediawiki and its API seem to get a little moody at times over having to parse a 5000 line table that includes 10,000+ images. I've tried doing it as one big edit (as I am now) -- and putting it together piece-wise -- but it seems to get a little testy regardless.

There is also a page up at User:West.andrew.g/Popular_redlinks that lists the most popular redlinks (showing any that had 1000+ views in the past week). Lots of scripts and spam bot requests it would seem. I'll try to add some heuristics so we can toss out the things that are obviously script/computationally based. I'd also appreciate if someone could author a simple header that describes the page and maybe give it a shortcut from the Wikipedia namespace.

I am going to reject/decline/waitlist all new feature requests for the next two months or so. I apologize, but I have a dissertation to complete. I will entertain them after that point in time. Thanks, West.andrew.g (talk) 04:44, 15 February 2013 (UTC)

I think things look great (thanks again), and I started a header and added shortcuts. Good luck on the dissertation. Biosthmors (talk) 19:29, 15 February 2013 (UTC)

Porn spike

If anyone is wondering why cum shot and a number of similar porn articles are in the Top 100 this week, I think its due to an article on cracked.com, The 6 Most Terrifying Sex Illustrations on Wikipedia (nsfw).--Milowenthasspoken 15:45, 4 March 2013 (UTC)

Aho–Corasick string matching algorithm?

Why the massive number of hits on Aho–Corasick string matching algorithm, currently at #3 with 1,105,039 hits? Over a million hits and third place (!) in the past week on a small article on a subject that few people other than programmers and computer scientists would be interested in? It's not a new article either, having been around for over 10 years. —Lowellian (reply) 06:30, 4 April 2013 (UTC)

BTW, note that the #3 position effectively makes it the most visited article of all, since #1 is the main page and #2 is for "undefined" (non-existent page requests). —Lowellian (reply) 04:33, 7 April 2013 (UTC)
It's most likely an automated user bot. There are several of those on the list. Serendipodous 08:42, 4 April 2013 (UTC)
Why would anyone program a bot to retrieve "Aho–Corasick string matching algorithm" a million times?! —Lowellian (reply) 04:29, 7 April 2013 (UTC)
I don't know. Why would someone program a bot to retrieve Seether a million times? Serendipodous 14:27, 7 April 2013 (UTC)

These are scripts or bots that in all likelihood have some sort of bug or mis-configuration. When these crazy spikes occur without cause, I don't think we should be so quick to assume there is malice involved, or that the targeting of that particular page was intentional. West.andrew.g (talk) 21:58, 7 April 2013 (UTC)

You have been found to be involved in trafficking at Wikipedia

Hello. The title is made in jest, but if you had to look twice at it, you can imagine that I did the same when reading "These 5000 pages were the most heavily trafficked on the English Wikipedia". Trafficking is always said when referring to "trade or business, especially of an illicit kind", for example, trafficking in drugs or human beings. The use of the word in your sentence is not standard English, as far as I am aware. SomeFreakOnTheInternet (talk) 23:07, 21 April 2013 (UTC)

Trafficking is not "always said when referring to trade or business, especially of an illicit kind" - trafficking is the past tense or participle of 'traffic', "The exchange or flux of information, messages or data, as in a computer or telephone network". The word is being used entirely correctly. 78.149.172.10 (talk) 23:33, 21 April 2013 (UTC)
Could you cite a better authority than Wiktionary, please? A recent dictionary in print would be good. Thanks. SomeFreakOnTheInternet (talk) 23:40, 21 April 2013 (UTC)
Here are some entries in respected dictionaries online: Chambers, Oxford, Collins. SomeFreakOnTheInternet (talk) 23:46, 21 April 2013 (UTC)
Can I suggest that, to avoid being misunderstood, you amend the opening of the sentence to read, "These 5000 pages had the most heavy traffic on the English Wikipedia", or simply, "These 5000 pages had the most traffic on the English Wikipedia". This avoids the verb, which in current standard English usually has an illicit connotation, and instead settles for the safer noun. SomeFreakOnTheInternet (talk) 23:54, 21 April 2013 (UTC)
It's been changed to "accessed": [5]. Biosthmors (talk) 11:18, 22 April 2013 (UTC)
Thanks, Biosthmors. SomeFreakOnTheInternet (talk) 11:41, 22 April 2013 (UTC)

Mark Linn-Baker

Any idea why Mark Linn-Baker spiked this week? I find it hard to believe that a relatively minor actor got 4x the views of the Boston Marathon or Chechnya articles. -- phoebe / (talk to me) 04:18, 23 April 2013 (UTC)

  • Here's some more proof that folks can spam most any article into the most viewed list, though I personally don't know their mechanisms. I saw that Daniel Luke Barth was listed on Trending articles, but also saw that the article had been speedily deleted as vandalism/hoax. Yet it has gotten 1.7 million hits in two days, see [7]. A user page version remains at User:CreepyChanBrony/Daniel_Luke_Barth, which boasts how many views it got, so obviously it was a planned stunt. Someone could have just as easily been joking online aboutPerfect Strangers and decided to do the same to Mark Linn-Baker. Oh, and here is Daniel tweeting about the accomplishment:[8]--Milowenthasspoken 17:47, 25 April 2013 (UTC)
We should ask the analytics folks if they can query whether this is the work of a single or narrowly defined range of IP addresses. A simple script to keep re-pinging the same page is a piece of cake to write, but I don't know how high a single machine could drive the view count. Broadly distributed views are a more tricky matter. West.andrew.g (talk) 18:06, 25 April 2013 (UTC)

WP:5000 Submission for Wikimania Hong Kong!

Greetings fellow wiki statistic fans,

I am writing to inform you that I have submitted a proposal to Wikimania 2013 entitled Examining the Popularity of Wikipedia Articles: Catalysts, Trends, and Applications which is based on the earlier Signpost article of the same name. I am hoping this opportunity will provide the impetus to dig even deeper into statistical data, find some more fascinating examples, and make even more community members aware of our contributions at WP:5000 and WP:5000/Top25Report.

I don't want to unnecessarily canvass, but I will note that page does provide space for community members to indicate their interest in the proposal (and past Wikimania's have streamed/recorded the event to provide access to those not in attendance). Regardless the outcome of that submission, I will be in Hong Kong and would love to meet those who I interact with on this page -- or anyone who has ideas for how these statistics might be re-purposed to improve the project(s). Thanks, West.andrew.g (talk) 21:40, 27 April 2013 (UTC)

Real declines in popularity?

It appears the English Wikipedia might be getting less popular, from a trend of the article views of the last article on the list for the past several months (it has gone down from around 28K to 24K). Or maybe we're just receiving less non-human views, or maybe the construction of the list -- by excluding more non-results -- has also reduced the number. But an approximate 4K difference is fairly large.

I've also noticed an article I've worked on used to never get into the WP:5000 but now it does. It's a medical article and I wonder to what extent some topics are increasing/decreasing in popularity in relation to each other. Biosthmors (talk) 04:59, 4 June 2013 (UTC)

The authoritative source on such aggregate statistics is Erik Zachte's official site. A steady decline since January, it would seem. This will only be exacerbated soon as more and more elementary/middle/secondary schools begin Summer holiday/vacation. Thanks, West.andrew.g (talk) 23:00, 5 June 2013 (UTC)

Article counts and traffic by category

Thanks for your efforts here. Hope you don't mind I fixed a link on one of your user pages.

I'm wondering if you know of any source of statistics about Wikipedia articles (number of articles and typical traffic) that is categorized, for example: entertainment, sports, politics, science, finance, technology, etc. etc. The categorization method doesn't matter so much to me as long as it gives some kind of indication of what it is that people are interested in and looking for when they come here. Thank you. RenniePet (talk) 01:52, 5 June 2013 (UTC)

I am aware of no such resource. I know Wikipedia has some high level "portals" (e.g., Mathematics), but I am not sure that the category structure is well-defined up into these highest levels (if it were, one could begin to measure normalized category popularity -- but this would be a significant computational undertaking). As it stands, manually browsing the WP:5000 month after month should give you some indication of what people look at. Hint: Popular cultural dominates, other popular websites (Google, Reddit) drive huge amounts of traffic, and certain academic topics get more exposure than they might elsewhere (e.g., articles for classic American literature). See also our Signpost article, although it still probably lacks the sociological depth your looking for. Thanks, West.andrew.g (talk) 23:06, 5 June 2013 (UTC)
Andrew, thanks for your reply.
I can remember, many years ago (Internet time), seeing a chart that purportedly showed the content of Wikipedia, with "The Simpsons" being around 70% if I remember right. What I'm looking for, and think would be extremly interesting, is a chart like that, for real, not a joke. Or better yet, two charts, one for content and one for traffic, to see if there is any significant difference.
And I'm not really interested in doing it "month after month", just once, to get a picture of Wikipedia and what the human species is interested in anno 2013. (I'm guessing that what the human species is interested in doesn't change much over time.)
Thanks again. RenniePet (talk) 00:11, 6 June 2013 (UTC)
OK, I've found the old joke chart I was thinking of:
http://en.wikipedia.org/wiki/File:Size_of_English_Wikipedia_broken_down.png
and some alternative versions:
http://commons.wikimedia.org/wiki/File:Size_of_Wikipedia_broken_down.svg
http://commons.wikimedia.org/wiki/Category:Wikipedia_humor
More interesting, I've found a chart that is exactly what I was looking for, except
- it's five years old
- it only shows content, not traffic
http://sibelkutlucan.wordpress.com/2011/06/08/when-publishing-changes-so-does-society-and-education/
It's actually shown on Wikipedia's article about Wikipedia, and the numbers apparently come from a little scientific paper:
http://en.wikipedia.org/wiki/Wikipedia
http://www-users.cs.umn.edu/~echi/papers/2009-CHI2009/p1509.pdf
Assuming this is correct, I'm surprised "Health" is so low - but maybe not so many articles but higher traffic?
Anyway, thanks again for answering, and if you have any more insights I'd appreciate hearing them.
RenniePet (talk) 05:02, 6 June 2013 (UTC)
RenniePet (talk) 05:19, 6 June 2013 (UTC)
Were {{WPMED}} tagged articles counted as health? Biosthmors (talk) 18:10, 7 June 2013 (UTC)