Wikipedia talk:Category intersection/Archive 2

Archive 1Archive 2Archive 3

No interest?

Aren't there any sysadmins interested in implementing this? Why are there no actions to move to Semantic MediaWiki? --Chricho ∀ (talk) 11:05, 10 May 2011 (UTC)

ditto... Meclee (talk) 18:36, 7 July 2012 (UTC)

Level of complexity for our volunteer editors would be a bit of a stumbling block, I would think. - jc37 20:32, 2 September 2012 (UTC)

American women novelists

There is currently a collection of discussions going on, following some negative press in the New York Times and on Salon. A good place on-wiki to find a lot of the discussion is at Wikipedia:Categories for discussion/Log/2013 April 24#Category:American women novelists. Essentially, the problem is this: as an editor was some editors were populating Category:American women novelists, and simultaneously removing those articles from Category:American novelists, a project which if carried to its conclusion would leave the parent category populated only by men. This was upsetting to a lot of people, so now there's a great big argument about what to do about it. In the course of that discussion, a few people have pointed to category intersection as an approach that would completely obviate many such problems.

However this particular situation is addressed within the current system, it seems inevitable that similar problems will continue to arise. This is especially true since there is a kind of dogma in place that large categories are a problem, and that they must be diffused into smaller subcategories, in order to make categories more useful. However, it's not clear how useful it is for Harper Lee, an important American novelist, not to be in the category "American Novelists" simply because she's in "American Women Novelists", too. Thus, people are asking questions like "Why are large categories a problem, anyway?", and "Why should parent categories be diffused into sub-categories, instead of keeping redundant listings?"

Again, even if we address these issues within the current system, it's not clear that the current system is sustainable. Therefore, I'm very interested in category intersection as a way forward. Where does this idea stand, as far as getting the attention of the developers who would actually have the power to do something about it? -GTBacchus(talk) 04:06, 28 April 2013 (UTC)

  • The summary above is incoreect. It should read "as editors", since multiple edotirs were invovled in these moves. I was a significant contributor to the project, but I neither created the category nor was I the first to make the moves.John Pack Lambert (talk) 04:55, 28 April 2013 (UTC)
    • Corrected, with my apologies. You were the only one I knew about who was doing mass category moves. My mistake. I've never claimed you created the category.

      I believe the rest of what I wrote above is fine, n'est-ce pas? -GTBacchus(talk) 12:13, 28 April 2013 (UTC)

  • Regardless of what's happening at the moment and who is doing what, yes, I think this is good proposal and the way to address the current system. I've placed this on my watch and would honestly be thrilled to see this implemented. Truthkeeper (talk) 13:59, 28 April 2013 (UTC)
  • To get more specific, a question for any developers out there - what would be the performance implications if we started adding all bios to Category:Men or Category:American men, and then running category intersections again sexuality, job, religion, etc. on a daily basis? These top level cats would be massive, obviously, so the searches would be trolling over large amounts of information - for example Category:Living people has 615,000 people in it - by the time we added other bios, Category:Men and Category:Women could have perhaps millions? (not sure if the dead outnumber the living here)--Obi-Wan Kenobi (talk) 14:30, 28 April 2013 (UTC)
    • Using the CatScan tool http://toolserver.org/~magnus/catscan_rewrite.php to look for intersections of woman and poet in the German Wikipedia, to a depth of 28 subcategory levels, took five seconds just now. The category "woman" on de:WP contains 71,536 articles. Looking for intersections of man and poet to the same depth took 32 seconds. The category "man" on de:WP contains 411,690 articles. Andreas JN466 02:35, 29 April 2013 (UTC)
Thanks - useful data. Now what happens if we start multiplying those searches by 2x (assuming 1M men), and those 32-second searches are happening 10,000 times/day. Can the servers handle the load? This is where we need to bring the devs into the discussion - and consider linkage with wikidata (see below). --Obi-Wan Kenobi (talk) 03:26, 29 April 2013 (UTC)
Would love to learn more about how wikidata could help here. Can we get someone from there to put together a demo of how to tag up a bio with wikidata and then somehow use that for categorization purposes?--Obi-Wan Kenobi (talk) 02:29, 29 April 2013 (UTC)
Here is Marissa Mayer's page on Wikidata, and since this discussion developed from questions about categorizing authors, here is J. K. Rowling's. From what I understand, each Wikidata entity has associated claims, which consist of a property and a value for that property, as well as qualifiers that can, for example, limit the scope of the claim to a particular time period. So in theory a person's occupation could be included (with an associated period of time), ethnicity, place of residence, and so forth. I don't know if any development work is planned on providing search capability based on a set of property values, but I assume Wikidata exists precisely to make this type of metadata management and browsing easier. isaacl (talk) 03:12, 29 April 2013 (UTC)
Ok, that's a great start. So how do we pull up all female authors? If we can do that today, we're halfway there. I love the fact that there are claims, but the claims must be sourced - nice stuff...--Obi-Wan Kenobi (talk) 03:24, 29 April 2013 (UTC)
It seems like more advanced queries is being planned for phase 3. There is a proposal for a lists task force to guide phase 3, but I'm not sure of its state. However, perhaps a conversation with the primary author of the proposal would be useful? isaacl (talk) 03:59, 29 April 2013 (UTC)

Wider view: ethnicity- and gender-based categories are not for diffusion, or "thinning out" categories

It should be noted that we have the same problems with ethnicity as we have with gender. For example, James Baldwin, one of the greatest American novelists of the 20th century, is currently not listed in Category:American novelists because he is in Category:African-American novelists instead. Maya Angelou similarly is not represented in Category:American poets, because she is in Category:African-American women poets and Category:American women poets – admitted to two sub-ghettoes, but not the main banquet hall where people like Walt Whitman sit.

So Wikipedia should be doing two things:

  1. It should be made absolutely clear in policy that no one should ever because of their gender, ethnicity, sexuality or religion be removed from a generic, gender-, ethnicity-, sexuality- or religion-neutral category that others not belonging to their group remain in. In other words, these categories are not for diffusing or thinning out categories, but to provide researches with additional options they might want.
  2. Long-term, we should just apply simple categories or tags like "Man/Woman", "American", "African-American", "Poet/Novelist/..." to articles and avoid categories like "African-American women poets" altogether: users who want to see a list of African-American women poets should be enabled to use the CatScan and search for articles that have the tags "Woman" + "African-American" + "Poet" applied to them.

Now, if and when we have agreement on what approach to take, we need to think about where to raise that for community discussion. Thoughts? Andreas JN466 01:41, 29 April 2013 (UTC)

  • Thanks Andreas for putting this together. How broadly would this extend? For instance, would "nationality" also be considered the sort of category appropriate for CatScan? Or perhaps my question is really, does CatScan apply to all categories, or does CatScan apply only to a selected set of categories? And if the latter, what are the selection criteria? --Lquilter (talk) 13:42, 29 April 2013 (UTC)
    • I think it should extend across the entire categorisation system. So nationality would be a category/tag that users could combine with other tags (and there would be a fairly stable list of those). The page listings produced as a result of user-defined search queries would be the functional equivalent of present-day categories combining multiple attributes.
    • CatScan queries could actually be more granular than any category available today. For example, a user could search for LGBT + man + 20th century + American + essayist. Andreas JN466 14:58, 29 April 2013 (UTC)
      • I think it sounds great. It will be very flexible and allow WP readers to both search for new concepts as they become popular, and to search for intersecting topics that are not "defining identities". AND it would help us avoid all these tedious category delete discussions based on taking offense at the presence or absence of various ethnic/gender/religion/sexuality categories.
So what's needed to make this happen? --Lquilter (talk) 17:19, 29 April 2013 (UTC)

A working category intersection today

I wanted to share a new development from the Obi-wan labs. Take a look at Category:Singaporean poets. What I've done is, created an easy way for users to do category intersections. The steps were as follows:

  1. I've taken all of the poets in Category:Singaporean poets
  2. I added them to Category:Singaporean men or Category:Singaporean women - top level, generic cat
  3. Then I created a pre-populated link to do category intersections on "Singaporean poet" + "Singaporean woman" or "Singaporean poet" + "Singaporean people of Chinese descent" + "LGBT people from Singapore" - using the WP:CATSCAN tool developed by Magnus Manske.


Using this technique, we could make a big chunk of this problem go away, at least in the short term, while we're waiting for wikidata to get fully set up.

For any given category, we could just move people up to top-level national men/women/gay/straight/black/white categories (I'm thinking by-country, since that would reduce the search space somewhat), and then stick them all in non-gendered, non-ethnic, non-religious jobs, cities, what have you. And at the top of each category, editors could create pre-populated links to their favorite intersections, essentially replicating the categories that used to exist below. The cats on each bio would become vastly simplified.

Researchers would be able to intersect to their heart's content - something which is actually hard right now with hard-coded ethnic/gendered categories.

In the meantime, we would be bit by bit getting rid of *all* of the ethnic/gendered/sexuality categories, except at the very top level, because they aren't really necessary in the wikidata world - in a way we'd be priming the pump for wikidata by simplifying our bio categorization structure entirely. For now this is v0.1, but please take a look and let me know your thoughts as a possible approach to fix this mess. If all of those commenting just focused on being gnomes on this issue, we could change the whole world in a few weeks I bet, and show the outside world that we're changing, we're doing something about it, immediately.

In cases of more complex categories, we could also add links to recursively enumerate all subcategories, or even to enumerate all subcategories with a particular gender/ethnicity/etc. For example, show me all African-American men atheletes, no matter where they are categorized in the Category:American track and field athletes tree - this would be trivial if you have the right categories set up to start with.

Finally, many thanks to Mangus who wrote the catscan tool. If we did this, we'd want to reach out to him, to make the interface/display a little nicer for the newbies. But as a hack, I think it's not a bad start.

----> Category:Singaporean poets <----

cheers, --Obi-Wan Kenobi (talk) 00:38, 2 May 2013 (UTC)

References

A place to store links to other relevant conversations:

  • [1] Posting which suggests tagging/etc with wikidata in the works, but no concrete plans on the horizon.
  • [2] nice posting on the challenges of categorization

Known issues

  • The search is slow. We need to find out what the performance might be like for much larger categories. Any search time > 10 seconds is probably way too long.
  • The UI doesn't look like wikipedia. I am chatting with the developer to see if he can address this.

Bandaid category intersection discussion

  1. Set up a template - one of the first tasks to make this workable, would be to create a template, {{cat-intersect}}, such that authors could easily add the template to the top of a category, specify the cats they want intersected (and whether they want to recurse through subcats, or not, and if so, how deep) - that way the user wouldn't need to mess with catscan URLs. So we need someone who knows how to build templates. The template should allow multiple different intersections to be added, and perhaps have collapsible sections for less frequently used intersections.
  2. Refine the catscan UI - so that results come back looking a bit friendlier, and a bit more like a wikipedia page. I plan to get in touch with the developer on this point.
  3. Choose a tree - we should choose a category tree of biographies (perhaps the American novelists tree since it's gotten so much attention? :) ), but I rather think a much smaller tree like Category:American poets would be a better start, and then implement the final solution fully, as a full-blown prototype, and see what feedback we get.
  4. Decide on the top-level categorizations - Should we categorize by nationality + gender / nationality + sexuality, nationality + job, etc, e.g. Category:American men, Category:American catholics, Category:African-Americans, Category:European Americans, Category:American politicians, Category:American journalists, etc? It would reduce the size of the search space and perhaps increase performance. I believe on the German wiki, they put all men into the "Man" category - but with such a big database enwiki that might have major performance implications. The other advantage is, we could keep many bios right where they are - they'd just need to be added to some of the top level ethnic/gender/religion/sexuality cats (e.g. Category:American men, Category:American women, Category:American intersex people, etc.) We won't even have to do LGBT anymore - that can be split out if needed.
  5. Just do it Then we need to go and start re-categorizing articles, and decide on the first set of intersections - will there be a somewhat 'standard' set of intersections proposed (this could even baked into the template)? For example, always intersecting cat + men, cat + women? One challenge will be, since we have these sort of pre-formatted intersections, is the old rules of categorization are now gone - anyone will be able to propose any intersection. So epic debates may still continue...but even if someone loses, they can still do it, they just have to do it manually.
  6. Write a bot Once we're comfortable with the approach, someone could write a bot which would automatically de-populate gendered/ethnic cats and stick the bios in the appropriate high-level categories.
  7. Delete the old Once we've cleaned out the lower-level gendered/ethnic cats, we could delete them, and dance on their grave, celebrating the new dawn of category intersections. Won't it be fun when we delete Category:American women novelists, and yet still have the ability to pull up the list with a click, and also see all of the American women novelists sitting next to their men colleague in the same exact non-gendered category? We should invite Amanda to that party.
  8. Get feedback see what the broader community thinks. Tweak it, and then roll it out more.
  9. Change the guidance Once we have a workable pilot, several guidance pages will need to be changed to reflect the new approach.

The best part is, I believe we can do all of this ourselves - and we can salvage much of the existing category tree (we would just delete intersects like Category:African-American women poets and Category:LGBT writers from the United States, and it probably doesn't require much special technical skill or back-end database hacking. Wikidata is coming, but this hack can be here tomorrow.--Obi-Wan Kenobi (talk) 05:54, 2 May 2013 (UTC)

Comment. I like the idea in general, but: the toolserver (which runs catscan) has been quite shaky over the last months, and we should probably not expect it to become reliably stable again until everything is ported over to Wikimedia Labs. Now the problem with this proposal is that anything except for the top level categories will only be accessible if and when the toolserver is available. I'm not saying that kills the idea, but we should be aware of it when using the toolserver to replace rather than complement a functionality we now use directly on mediawiki. — HHHIPPO 07:09, 2 May 2013 (UTC)
Thanks, it's a good point - and I fully agree this would only be a complement. That's why I'm proposing to use it first and foremost just for the specific ethnic/gender/etc categories - the rest of the structure would remain, and I wasn't suggesting we'd replace *all* of the cats - it would really be an enhancement to the cats, especially gender/ethnicity/sexuality biographic cats (we'd delete most of those) - you've probably seen the recent drama around Category:American women novelists. So if the toolserver was down, people for example wouldn't be able to easily find African-American women poets or American women novelists, but they could find all American poets easily for example. I'm hoping this tradeoff is worth the benefits.--Obi-Wan Kenobi (talk) 07:17, 2 May 2013 (UTC)
Ah, OK. Yes, I've seen the drama, but I try to stay out of it :-). I didn't know the expected outcome is to delete those intersection categories. In that case this sounds indeed like a good test case for your suggestion. Of course on the long run it would be preferable to have that function integrated in MediaWiki itself, so one has more familiar formatting, internal links, related changes and all that, but until then this seems like a nice addition. — HHHIPPO 18:27, 2 May 2013 (UTC)
A technical note: Catscan should be reimplemented on labs. This can't happen yet, but it will be possible later in the year I think. Labs is much more stable than toolserver ever was, although it is more complex for the people developing the tools. I think this would work even better with a MediaWiki extension ... — This, that and the other (talk) 11:32, 3 May 2013 (UTC)
Ok - can you give more details? what does this mean? And more importantly, do we have to worry about it? If we create a template with a URL for catscan, could we just update the template once a newer/better version is in Labs? --Obi-Wan Kenobi (talk) 21:18, 3 May 2013 (UTC)
You can look at mw:Wikimedia Labs, but that page is quite technical in nature. The Labs project badly needs some simple, accessible, clear documentation, but I am not going to be the one to write it!
You shouldn't have to worry about Labs, but if you are making something that depends on the flaky Toolserver, my advice would be to wait until Labs is a viable choice. If the only way of viewing certain "categories" as we now know them is via a semi-broken server, people will not be terribly happy about it. — This, that and the other (talk) 02:07, 4 May 2013 (UTC)
Ok, thanks. I do think it's worth trying, at least at small scale within part of the tree, so we can prototype and get a feel for whether people like it in general or not. What would it take to move the cat scan tool over? Should we ask the developer if he's planning on moving it over to labs?--Obi-Wan Kenobi (talk) 05:18, 4 May 2013 (UTC)
That would be a good start. — This, that and the other (talk) 03:47, 5 May 2013 (UTC)
note I have refined the prototype, am now using Category:Singaporean poets as a better/more complex example. --Obi-Wan Kenobi (talk) 20:47, 5 May 2013 (UTC)

As the co-author of this page, I spent many months of my life on this topic, and it is nice to see that interest is still there in making it happen. Something like this proposal was tried in the past, and if I recollect correctly, more than once. We tried to populate the larger categories to illustrate how intersections would work, but also because there is a value in having large categories (they function as an index). Each time it was tried, it was quickly reverted, even if the categories were labeled as being temporary or demonstrations. Each time it got harder to implement, as the categorizations of the articles in the large categories were removed almost as fast as they were added by well meaning editors recognizing that the categorizations were incompatible with current best practice. I talked at one point with Brion Vibber to create invisible categories so we could try to do this experiment under the radar. Invisible categories were created quickly, but then all attempts to use them this way were quashed by the masses.

If you want this to happen, I'd suggest that the developers add a parallel categorization scheme in a new namespace, which I'd call tags. They would be created and populated exactly the same way that categories are, but their function would be different. The developers wouldn't have to create the means to create tag intersections from the outset, that could come later. The tags should be extremely broad -- Men, Women, British, French, Writers, Actors, etc... The guidelines for creating these tags should be well thought out in advance, along with the criteria for applying them, with the understanding that intersection would be implemented at some point in the near future. Then the developers can announce that this new namespace has been created, along with the guidelines for how they should work.

If it is not possible to create dynamic tag intersections on the fly, then the tag intersections could either be created manually or with a bot. Perhaps there is some way to cache the intersections so that they are only updated once a day, or once a week. This might greatly reduce the server load. The update timing could be dynamically adjusted as needed.

I don't think it is possible to get this to work using the current categorization system. It will be too frustrating, and ultimately fail. It might work the way I described. I'd be happy to help in the effort. --SamuelWantman 01:45, 8 May 2013 (UTC)

Thanks Sam - I appreciate your insights and would love your help. I think trying to get new significant dev work started on this might be a non-starter, as the future is apparently wikidata, which has some thoughts of doing something like this (but much more sophisticated) in the future - but I have no idea how far off that is - but in any case I'm not sure if WMF would be willing to throw dev resources behind it. I've asked at several places, but have been told to go to the mailing list, which I may do next. For me, one question is, what can we do that doesn't require major changes to media wiki software, today? Do you think, if we piloted this in a significant tree - say British novelists for example - and got buy-in from the users there, could we do a community-wide RFC to get approval to do this at scale wherever editors felt it would be useful (esp on bios, I mean). We could update the WP:Categorization guidelines to reflect this consensus. Then, you could point to that RFC + guidance as evidence of community consensus to de-genderize/de-ethnicize various categories. I've also just been reading a bit about how they deal with this in italian wikipedia - it seems they have a bio template, and this template generates categories automatically for the people. I've also been speaking with the developer of the catscan tool, and he has some ideas on UI improvements that might make it faster and feel more integrated vs going out to a separate page that looks different. In any case, I see your point about reverts - I wonder if the hidden category trick would work again? If you could also point me to past debates/discussions on this issue, that would be useful. Cheers,Obi-Wan Kenobi (talk) 02:59, 8 May 2013 (UTC)

IEG proposal on the category system in the English Wikipedia

I have submitted a proposal for an Individual Engagement Grant for the first phase of a project looking at the category systems in Wikimedia wikis. In this first phase I will research the nature of the English Wikipedia's category system, as the first step in designing ways to optimize category systems throughout WMF wikis. In later phases, I plan to

  • Research how readers and editors utilize the category system in the English Wikipedia.
  • Investigate the category systems in other language Wikipedias and in other WMF projects.
  • Explore the value and feasibility of using Wikidata as the basis for the category system across WMF wikis. If deemed appropriate by the community, work with the community to develop and implement this.
  • Utilize user-centered design methodologies to prototype various enhancements to the category system, including category intersection, to improve the user experience. If deemed appropriate by the community, work with the community to develop and implement such enhancements.

If you would like to endorse this proposal, you can do so here. I would also appreciate any other feedback, pro or con, which can be posted here. Thanks! Libcub (talk) 06:21, 7 April 2014 (UTC)

MediaWiki search + WhatLinksHere

Is there a way to combine these functions?

Like for example Wikipedia:Category_intersection#Using_MediaWiki_search_to_find_category_intersections shows how to find articles which are included in a certain category. So say I linked to Category:Time_travel_television_series

But then what if I wanted to intersect that with all articles linked from List of animated television series of 2012 ?

There is not a corresponding category for this list. The closest is Category:2012 in animation which isn't populated.

I hope it will be able to do subcategories since then something like Category:2012_television_series_debuts could be used more freely. Ranze (talk) 17:17, 8 January 2017 (UTC)

If you use WP:AWB, the "list comparer" function in that app could generate exactly that list (it includes both "articles in category X" and "links on page" as set options) — and you could then use the resulting list as either the batch-set for a processing job, or save it as a text file on your hard drive if you needed to do that for some reason. But there isn't anything at the server level that could generate that list for you as a Wikipedia page. Bearcat (talk) 21:40, 8 January 2017 (UTC)

Integration with Wikidata

Could I use a category intersection to create a list of topics, then feed that list into a wikidata search and retrieve additional data? Could I then create a (external) Web page that displays this info in the form of a table? Could this be done automatically? Thanks. SharkD  Talk  13:30, 12 March 2017 (UTC)