Wikipedia talk:Category intersection/Archive 1

Archive 1Archive 2Archive 3

Old discussion

--Copied from user talk:SamuelWantman#Subcategories--
Hi Rick, Haven't talked to you for a while. I'd like your opinion about something I've been thinking about quite a bit recently. As you probably know, I've been pushing for wider acceptance of duplication of articles in parents and their children for quite a while. Here's my present thinking about this, and what I'd like to propose to the developers:

  • Categories are a tool for browsing.
  • Categories are sometimes useful as an index of a subject, but often are not available as an index because they have been broken into subcategories and depopulated.
  • Many of these subcategories are in essence intersections of larger categories. For example, Category:American film directors can be though of as the intersection of Category:Film directors and Category:American people.
  • It would be useful to have categories fully populated at the "level of notability", by which I mean that directors are much more likely to be notable as "film directors" than as "American film directors".
  • There are many category intersections that do not exist that some people might find useful. Adding more and more intersections clutters up the category listings for articles.

To address all of these things I propose the following:

  • Categories be fully populated at the level of notability.
  • The software be modified so that category intersections get created on the fly.

Here's how it would work:

  • All the categories that are intersections would be deleted and their members moved to the larger categories at the level of notability. Some of these categories would be rather large (like Category:American people.
  • New wiki-markup would be added to the software to create dynamically created subcategories. Here's how it might look:

[[Subcategory:American people:Film directors]]

This markup would be added to the page Category:American film directors. The markup would initiate a database comparison of the categories listed to find the articles and subcategories listed in both categories. The page would be displayed as a "Sub-category" instead of as a "Category" which would indicate that it was dynamically created. There might be automatically generated text that would say something like, "This sub-category contains all the articles in Category:American people that are also in Category:Film directors. Additional text for the page could be created as normal, and the subcategory could be categorized as normal.

Articles could be placed in the category directly. For example List of American film directors could still be put in the category. There should be some visual indicator of the articles that are in the category directly and those that were from the intersection of the parents to help alert editors of miscategorized articles.

Articles would only list Categories on the bottom and not list all the Subcategories that they may be found in (unless they have been put in these categories directly by mistake). Perhaps, each category listed might have a check box, by clicking on some of the check boxes and then clicking on a link to "display subcategory" the user could go directly from the article to the dynamically created subcategory.

Does this sound like a good idea to you? Comments? Suggestions? Thanks. -- Samuel Wantman 10:26, 22 July 2006 (UTC)

Hi Samuel - yes, it's been a while. I hope things with you are going well. I haven't spent much time on categories lately, except for adding some comments about the naming conventions just recently. So, on the fly intersections? First, the general notion has been around for quite some time. Looking through the wikitech-l mailing list archives, someone even wrote the code implementing a version of category intersection ("category:Film directory/American people" would be the intersection of these two categories). The ensuing discussion pointed out that "/" was not a great choice, and brought up concerns that without including subcategories in the results that this would be of only limited use. Looking at requests currently open in bugzilla, there's bugzilla:5244 and bugzilla:2285. user:Hiding and I had a discussion a while ago about using flickr style tags (which I think exists nowhere except in the VPT archives) which I think is at least similar to what you're thinking. Let me turn this around - what do you think about the flickr sort of idea (and, if you've never visited flickr, give it try)? -- Rick Block (talk) 14:35, 22 July 2006 (UTC)
So, what do you think? -- Rick Block (talk) 04:12, 27 July 2006 (UTC)
I have been thinking about this quite a bit. I think the current system is a mess and needs changing. Have you seen the latest about category duplication at Wikipedia talk:Categorization. After months of work on this, I feel like I'm starting all over. The reality seems to be that it doesn't matter what decisions are reached through discussion, what matters is what common practice is and who is the most insistent.
So if there is going to be a change, I think it should try and hold onto the aspects of the current system that are good, and enhance it. I have looked at the flickr system, and I find it to be very disorganized. I suspect that unless handled well it could be a real mess here. But there are some good things about it, and I have been thinking about how to incorporate it into what we have. I don't want to abandon what we have, and I don't want people adding oodles of meaningless tags to thousands of articles. I'd like to keep the multiple taxonomies that we currently have, and encourage additional taxonomies. So If you take the ideas I mentioned above, I think it could work like this: Sort of a cross between what we have and flickr and apparently already doable and most of the code written according to the links you posted above.
As I mentioned, there would be no subcategories posted with articles only primary categories so instead of this (using Laurence Fishbourne as an example):
Categories: 1961 births | African-American actors | American film actors | American television actors | Best Actor Academy Award nominees | Living people | M*A*S*H actors | Miami Vice actors | A Nightmare on Elm Street actors | People from Augusta, Georgia
You would have this:
Categories: 1961 births [ ] | Living people [ ] | American people [•] | People from Georgia (United States) [ ] | People from Augusta, Georgia [ ] | People of African descent [•] | Film people [ ] | Television people [ ] | Actors [•] | Best Actor Academy Award nominees [ ] | M*A*S*H [ ] | Miami Vice [ ] | A Nightmare on Elm Street [ ]
Show Sub-category matching all checked boxes
These categories are sort of like Flickr. None of them are intersections of other traits. Each listing would have a check box next to each category. You could check off what ever category you'd want and then click below to get the sub-category. In this case it is Category:African-American actors. It is interesting to me that most of these categories already exist. This adds a small amount of category "clutter". There are a few more categories than originally. But with this set up, ALL of the primary categories listed here would be fully populated and so would all the possible intersections of these categories. You would be able to see the intersection categories even if nobody had created the page for it, such as Category:African-American film actors from August, Georgia who appeared on M*A*S*H. This is just like when somebody puts an article in a category without creating the page. The sub-category would be created dynamically by finding the intersection. If the page hadn't been created yet, it would list all the articles and also have links to the primary categories used for the intersection. Editors could continue to create pages for these intersections and structure them however they want, just as done now. There would probably need to be a new way to indicate how to code an intersection, as I mentioned above.
I also think some of the process for categorization could be automated. For example, if someone just created the Laurence Fishbourne article, and put him in Category:African-American actors, perhaps the system could look at the page for the category and see that it is an intersection sub-category that has three parents. The software could make the changes to the article so that it gets categorized in all three parents.
In reading through the links you posted above, I notice that this proposal might not have the problems that were discussed. Perhaps it might actually be easy to implement. -- Samuel Wantman 07:28, 27 July 2006 (UTC)
At first reading (and I'll read it again) this sounds almost exactly like the flickr setup, but using "category" as the name rather than "tag". I agree it would be good to keep the name as category. As it stands, categories are internally a page tied to a database search with a combination of user entered text (the "prologue" bit you enter when you edit a category) and dynamically generated content based on the database search (the list of articles or subcategories). Allowing "intersection categories" be able to be explicitly created seems like a reasonable idea as well (it's the same setup as an existing category, but with a more complicated search than a simple "all articles in this one category"). As you suggest, if these are only intersections the software could allow adding articles to such categories (by actually adding the article to all the categories that are intersected).
A couple of the Fishbourne example categories bother me a bit, but I'm not quite sure what to do about it. Specifically, the "born in" hierarchy (if he was born in Augusta, George, he was obviously born in Georgia and obviously born in the U.S so Category:People from Georgia (U.S. state) and Americans seem to be implied), similarly Actors and Best Actor Academy Award nominees seem to overlap. These are strict subsets rather than intersections which means the "parent" category could theorectically be done as a union, but I'm not sure if most people would immediately understand the difference.
I think it might fundamentally be a quibble, but I'd prefer the intersection UI to be on the category page, rather than the article page. Perhaps the categories are all listed (on the article page) and you can click on any of them individually or click on the "categories" header (which takes you to the intersection of all of them). Then the "category page" shows the current list of "intersection" categories, each clickable to show all the articles in that category and with something to click (trailing "[-]"?) to remove the category from the current "intersection set"). The available intersection categories would be in a separate spot in the display (like underneath), again each individually clickable but also with something to click (trailing "[+]"?) to add them to the current intersection set. In any event, however the exact UI details get worked out I think the operations of refining or expanding the current intersection set would need to be available. -- Rick Block (talk) 14:07, 27 July 2006 (UTC)
I too am a little bothered by the Fisbourne example and don't exactly know what to do with it. One criteria in designing this is that it should remain easy to get to the categories that you can now get to. Since there are many nationality sub-categories currently (Category:American actors), and virtually none by state or city, it seems useful to make them easy to create. The Oscar one does not bother me, because it is already acceptable as a place to duplicate listing people. I also think that all people categories should be populated using an ALL or NONE rule. By this I mean, if you are going to have a few people in Category:People from Georgia (U.S. state) (which I just fixed in the examples above). It should contain EVERYONE in Georgia or NOBODY from Georgia because they are all in subcategories. The reason for this is so categories can be used as subject indexes. So having the multiple categorizations shows that there is community acceptance of having this duplication.
I have reservations about taking the intersection off of the article page. If I'm understanding you, you'd fist go to the category which is the intersection of everything and then remove categories from the intersection. I suspect for most articles there will only be one article listed in the intersection category. I doubt there are any other Oscar nominated African-American actors from Augusta Georgia that appeared on those TV series. So what you are in essence suggesting is that you go to another page to make the intersection selection. If we can come up with a good interface for doing it on the article page, I think that would be better than doing it on a separate page.
Here's another idea I've been kicking around. What if there there is some built in categories for all articles. The set I'm thinking of is PEOPLE, PLACE, THING, TOPIC, LIST, EVENT. Every article would have to be classified as one of these things. Perhaps there is a name-space for each of these things, and the first thing you have to do when you create a page is decide which name-space it belongs in. For example, Suspension bridge would be a topic, Golden Gate Bridge a thing, and San Francisco a place, Battle of Gettysburg an event, World War Two a topic. etc... Along with this, when you create or edit a category there would be a checkbox that would say what namespaces would be allowed in the category. There would be a checkbox for CATEGORIES, IMAGES, TEMPLATES, WIKIPEDIA PAGES, TALK PAGES, PORTALS, PEOPLE, PLACES, THINGS, TOPICS, LISTS, EVENTS. So if Category:Entertainers does not have PEOPLE checked, you would not be able to put a PEOPLE article in the category. Perhaps, the Entertainers would show up in grey to indicate that it was not put in the category. If you clicked on the grey link you'd get a message that explained that you could not put PEOPLE articles in Category:Entertainers and to look in the subcategories of Category:Entertainers for categories where PEOPLE belong. There could also be separate sections for each of these namespaces for the category listings.
It is a clear consensus to not put people into Category:Entertainers, yet I think it would be useful to be able to se a complete index of what is in Category:Entertainers. So I've been wondering about having the ability to turn any Category into an INDEX. Perhaps there is a link at the top of each category that say "View as an Index". When you clicked on the link, you'd see the category presented as an outline. All the subcategories and articles would be combined into a single alphabetical list. The subcategories would be formated differently from the articles. There'd also be another option that said "Show contents of all subcategories" Clicking on this would add the contents of the subcategories to the category or list. If both options are selected the subcategory contents would be indented and listed directly under the subcategory heading. Indexes would only go a set number of levels deep. Perhaps the depth of the index could be a user preference. -- Samuel Wantman 22:29, 27 July 2006 (UTC)
In rereading what I just wrote, I noticed that my new idea could change things a little. If there are separate namespaces as described, and if each is shown in a different section on category pages, then you could redo Fishbournes categorization like this:
Categories: 1961 [ ] | Living people [ ] | United States [•] | Georgia (United States) [ ] | Augusta, Georgia [ ] | African descent [•] | Film [ ] | Television [ ] | Acting [•] | Best Actor Academy Award nominees [ ] | M*A*S*H [ ] | Miami Vice [ ] | A Nightmare on Elm Street [ ]
[GO]
This scheme would combine many categories together. For example, American people would be part of United States. To make this work, perhaps each section of a category could have a show/hide button. By default, perhaps any section with more than 50 entries starts out hidden. If not, then the categories would probably be too huge. -- Samuel Wantman 22:46, 27 July 2006 (UTC)
From the point of view of a general sofware package, I'm not sure I like the people/place/thing classification. Showing a category as an index is interesting, but I suspect it only works for subset hierarchies. It might be possible to have both explicit "intersection" categories (e.g. American actors) and "subset" categories (people born in Augusta, GA), although this might get pretty complicated pretty fast. I think since there is an example (flickr) that shows a way to deal with intersections, it might be worth keeping these notions separate and address only one (at first).
OK. So where would you like to go with this? We could enter it as a bugzilla request, or write something up as a proposal in wikipedia space to solicit more input, or post it to the Wikitech-l mailing list. Do you have a strong preference between these, or any other ideas for what to do next? -- Rick Block (talk) 17:55, 29 July 2006 (UTC)
I think we should come up with as good a proposal as we can and then invite people to come and discuss it, especially the developers. I have not been involved with the mailing lists or the irc channels, so I have no opionion about them. I notice that virtually all the old-timers who used to hang out at Wikipedia:Categorization and WP:CFD are no longer around. Things seem broken. There have been two discussions just today at Wikipedia talk:Categorization about this problem. One involves breaking up categories into English, Scottish, Welsh, etc... vs. just using British. Another is about Category:Board games.
Would you mind copying what you think makes the most sense from what we have written and starting a proposal? That way I could understand better where you are, and see if we are close to being in the same place. -- Samuel Wantman 08:51, 30 July 2006 (UTC)
Sure. I'll draft something up today. -- Rick Block (talk) 14:54, 30 July 2006 (UTC)
I'm working on it, but not done yet. I'll let you know when I have something that I think is reasonable (might be a few days even). It's harder than I thought to come up with something that's easy to use (and playing around with Flickr I can't figure out how to make it do intersections - I could have sworn this at least used to be possible). -- Rick Block (talk) 04:08, 31 July 2006 (UTC)

Start of proposal

{proposal} The current state of categories in Wikipedia is somewhat chaotic due at least in part to the lack of a category intersection feature. Many categories are in essence intersections of larger categories. For example, Category:American film directors can be thought of as the intersection of Category:Film directors and Category:American people. Use of these "subset" categories makes it difficult to find all members of a "higher level" category; either articles have to be added to both the "subset" and "higher level" categories or the members of the "subcategories" (and, recursively, their subcategories) have to be enumerated. Precisely defining the circumstances in which articles should be added to both "lower level" and "higher level" categories, and even whether this is ever appropriate, remains a source of continuing discussion among editors (see, for example, Wikipedia talk:Categorization/Archive 11).

Category intersection has been a desired feature for quite some time. Looking through the wikitech-l mailing list archives, someone even wrote the code implementing a version of category intersection. With this change, "category:Film directors/American people" would be the intersection of these two categories. The ensuing discussion pointed out that "/" was not a great choice, and brought up concerns that without including subcategories in the results that this would be of only limited use. Looking at requests currently open in bugzilla, there's bugzilla:5244 and bugzilla:2285. The CatScan tool on the toolserver machine is used as a current workaround, although a feature implemented directly in the MediaWiki software itself would have applicability to all users of the software.

How should such a feature work?

First, any existing "intersection categories" would be decomposed into primary categories. So instead of this (using Laurence Fishbourne as an example):

Categories: 1961 births | African-American actors | American film actors | American television actors | Best Actor Academy Award nominees | Living people | M*A*S*H actors | Miami Vice actors | A Nightmare on Elm Street actors | People from Augusta, Georgia

we would have this:

Categories: 1961 births | People of African descent | American people | Film actors | Television actors | Best Actor Academy Award nominees | Living people | M*A*S*H actors | Miami Vice actors | A Nightmare on Elm Street actors | People from Augusta, Georgia

Clicking on any of the categories would act very much like a category does today (more on this below). However, note the categories link. Today, this link goes to Special:Categories (which is a relatively useless list of all categories that exist). With this proposal, this link would go to Categories, which would be interpreted as the dynamically created "intersection category" of all categories the article Laurence Fishburne is in. From any category listing, the total number of articles in the category would be displayed (truncated to some reasonable number, like 999) and, instead of "subcategories", an interface would be provided to reduce the number of matching articles (by adding a category to the current intersection set) or to increase the number of matching articles (by removing a category from the current intersection set).

Staying with Laurence Fishburne, clicking on the "categories" link (not a specific category) would show the articles that are in all the same categories Laurence Fishburne is in (likely, just the one article). The list of categories comprising the intersection would be displayed, perhaps near the top of the category listing. Clicking any of these would remove the category from the current intersection set, and recompute the intersection (resulting in more articles being displayed). To add a category to the current intersection set (reducing the number of articles being displayed), a list of "subset categories" would be displayed plus an input box for entering an arbitrary category. The "subset categories" would be manually added as meta-data to each category. The list displayed would be the union of the subset categories added to all categories in the current intersection set.

What would the user interface for all this look like?

<need to work out more details>

New discussion

My big question to you is what is wrong with putting the interface on the article page. Working from my previous example it could look something like this:

Categories: People | 1961 births | Living | United States | Georgia (United States) | Augusta, Georgia | African descent | Film | Television | Acting | Best Actor Academy Award nominees | M*A*S*H | Miami Vice | A Nightmare on Elm Street
[Create subcategory from the selected categories]

The check boxes should look better (I don't know how to code them), and "|" is probably not the best way to separate them. This is almost a full flickr implementation. I'm trying to think of a way to implement my primary categories (People, Places, Things, etc...) I'm thinking that these would ALWAYS be the first categories, and that the software would require all articles to be placed in a primary category. Also, for the interface, the People category would always be checked and could not be unchecked. It might be possible to make it a pull down list that would let you select the other primary categories, but this seems to be a complication that is not required. -- Samuel Wantman 09:16, 1 August 2006 (UTC)

In reading through the discussions about the patch that was created to do intersections I found this explanation about why the patch would not be useful:

"I don't see how this can be more than marginally useful unless it also searches all subcategories to infinite depth (with recursion checks?!).

This assumes the current system of putting articles into the lowest level of subcategories and removing the articles from the parents. If this were no longer the case, then this would not be a problem. Since we are discussing a system where each category is fully populated, the code will work just fine without having to search through any subcategories. -- Samuel Wantman 09:37, 1 August 2006 (UTC)

I've changed the checkboxes to Unicode check box characters (you can't check and uncheck, but the look is probably closer). The concern I have is that putting them on the article page makes the software change bigger. I think whatever we do, we're affecting the code that generates a category listing. Adding the selection mechanism to an article page affects the basic page presentation code as well. Not that this can't be done (and, I agree that it might be nice to be able to directly select the categories you're interested in), but the magnitude of this change is a little daunting. I'm OK with writing it up this way (selection from article pages).
I don't think adding a mandatory primary category is very feasible. There are over a million articles in en.wikipedia, none of which currently have a primary category. I think we have to propose a change that doesn't require touching all articles at the point the change becomes "live". We're talking about recategorizing probably every single article, but this doesn't have to be done immediately. Even if we did have primary categories, I don't understand why "people" could not be unchecked. Wouldn't you want to be able to show all articles related to, say, Augusta from the Fishbourne article?
I agree with your comment about the prior implementation limitation. I think we're talking about completely flattening many of the existing ccategory hierarchies, although I don't think we have a solution yet for the augusta/georgia/u.s. sort of issue (other than add to all). "Television actor" vs. "film actor" is another one of these, although I see in your example above you split these into "television", "film", and "actor" (acting). Without some sort of additional semantics, doing intersections with these are likely to have non-obvious results (for example, if you're looking for folks in the movie MASH "film x actor x MASH" will likely include anyone associated with the TV show as well as long as they were also in at least one movie). So long as "category" is just one dimensional (a single value), there isn't any way to fix this (we could have categories be a type/value pair, but this would be a MUCH bigger change than we've been talking about). -- Rick Block (talk) 19:18, 1 August 2006 (UTC)

I see what you are getting out. I hadn't really thought about this much, but this is one of the problems of the Flickr system. If you put people in an actor category and you also put people in a film category, the intersection of these is not just film actors. The intersection is film actors plus actors who worked on a film but never acted in one (like a stage actor who is also a film director). My first take on this is that a pure Flickr system will not work well for what we are trying to do.

So one way around this is to leave the categorization system pretty much the way it is, but remove those categories which clearly are the intersections of other categories.

So this would change the category structure to this:

Categories: | 1961 births | Living People | American people | People from Georgia (United States) | People from Augusta, Georgia | People of African descent | Film actors | Television actors | Best Actor Academy Award nominees | M*A*S*H actors | Miami Vice actors | A Nightmare on Elm Street actors
[Create subcategory from the selected categories]

The general rule would be: If a category can be completely and totally determined by finding the intersection of a single set of a small number of other categories it should not be populated. If there are articles that relate to the topic they can get linked manually to an intersection category by adding a "See also" comment. For example there might be a comment to see List of American film directors in Category:American film directors which would be populated with the intersection category of Category:Film directors and Category:American people.

I'd reword the general rule a bit: If a category can be completely and totally expressed as the intersection of other categories, it should be defined only as this intersection.

This isn't as much of an overhaul as I was hoping for, but perhaps that is a good thing.

To make this system work the software would need the following upgrades:

  1. An interface to allow a user to easily choose categories to intersect.
    Both on the fly and "statically" (e.g. a "precreated" intersection category, thay can have intro text). Category intersections need a URL and wikilink syntax as well. While we're at it, I'd like to see the search interface extended to include the ability to find articles in specific categories as well.
  2. Mark-up code to add the display of intersections to categories. (I'm wondering about using double colons to delineate between categories.) I'm assuming that once defined as an intersection, no articles will remain in the category. It would still be possible for the category to have subcategories. The procedure to make subcategories does not need to change.
    Hmm. Seems like there are three topics here. One is how you precreate an intersection category (so it can include a text intro and "see also" links). This includes the issues of how do you get to the create this page interface from an intersection display, what the syntax is for specifying the intersection set, and whether an intersection category can have a name (other than the intersection syntax). Another is how to display the intersection set when you're viewing an intersection category, either a precreated intersection or one done on the fly. I think this is much like the checkbox interface we're presuming is on the articles (right?). The third topic is how intersection categories relate to subcategories. I think intersection categories should probably be treated like subcategories of every category in the "intersection set". Beyond that, I suppose they should be able to be explicitly added as a subcategory to any other category. However, I don't think you can make a category a subcategory of an intersection category except by adding the category to each of the categories in the intersection set.
  3. A database to match category pages to their intersected categories. This is needed so that when someone checks off three categories for an intersection, the page to display can be found. This way every intersection category can have the same names they now have, following normal category naming conventions.
    I've been thinking more about how this is done on the fly, but if we're going to have precreated intersection categories as well then there needs to be a way to find the precreated one. I think this gets a little tricky since an "intersection" is not order dependendent, i.e. [American people, Directors] is the same as [Directors, American people]. When we get 3 or 4 or 5 categories in an intersection the number of combinations that are simply different by order grows pretty fast (it's n!). I think to support this the software would have to store the intersection set in some canonical order (alphabetically sorted, perhaps) and then when looking for a precreated intersection put the desired intersection set in the same canonical order before searching. Per below, the pages for these intersection categories probably have to be in a new namespace as well, since if they're in the category namespace there wouldn't be any way to prevent someone from explicitly adding articles or categories to intersection categories.
  4. If someone tries to put an article in a category defined as an intersection, perhaps the software automatically puts the article in the parent categories. This would also have the added benefit of recategorizing the entire database of articles as they get edited. Without this feature, this will be a very difficult system to maintain.
    If intersection categories are in a different namespace, adding articles or categories to them can simply be disallowed. I think this is a better idea, but does create a transition issue. I guess if the names are distinct enough (like include "::" as you suggest), they could effectively be treated like they're in a different namespace.

Some sort of protection scheme will be needed to keep people from wrecking havoc on the system by turning existing categories into intersection categories that contain categorization errors or vandalism. For example, someone could go into Category:Living people and change the code so that it becomes the intersection of Category:LGBT Wikipedians and Category:Gay actors. This would vandalize 1000s of articles at once. Intersection code should probably only be added by an admins. Perhaps all recategorizations as intersections need to be agreed to by the community when this proposal gets implemented. Once underway, I would think that anybody could create the page to go with an intersection as long as they could not change the intersection. -- Samuel Wantman 08:11, 2 August 2006 (UTC)

If intersection categories are in a different namespace, you can't turn an existing category into an intersection category except by creating the new intersection category and removing the old category. You could change the intersection set for an existing intersection category (e.g. change "American Directors" to the intersection of Americans and Murderers). Maybe this is a reason to have the intersection set only be in the name and not as editable metadata. -- Rick Block (talk) 14:21, 2 August 2006 (UTC)

Putting this all together

I had to start a new section. This was just getting too long.

I've been thinking about a set of criteria for what we are trying to do. Would you agree to the following?

  1. Many topic level categories that now only hold subcategories should be fully populated.
  2. Many subcateogies can be created automatically by finding the intersection of their parents.
  3. The current categorization structure should not be affected by this proposal. The only perceived differences might be:
    • The ability to create category intersections on the fly. Many of these subcategories do not currently exist. All users should be able to create categories using these intersections if possible.
    • Articles will show only primary (topic level) categories on their pages.
  4. An interface will be needed for users to create intersection categories.
    • Preferably, this will be possible from any article page.
    • If possible, articles miscategorized into categories that are intersections should be automatically fixed.
  5. The system needs to be protected from vandalism.

So how about this:

  • The mark-up for creating a category intersection will just be the automatic transclusion of a page from a new namespace (sub-category? intersection? I'll use "Subcat" for the examples). Pages in this new namespace will just be lists of the categories to be used for an intersection. They will have the same name as the category page that uses them. There won't be any markup for transcluding. If a subcat page exists with the same name as a category page it will be automatically transcluded. For example Category:African American actors would have the corresponding page Subcat:African American actors which would have the following editable text:
Actors
American people
People of African descent
  • This list would only appear when the subcat page is being edited, and should always be in alphabetical order. The software can alphabetize any lists not entered in order. When not being edited the page will look like a list of links to all the articles that are the intersection of all the categories listed, so when it is transcluded it will be the contents of a category. It might also have a header with links to the categories that were used for the intersection.
  • This page can be created several ways:
    1. Administrators can create or edit the page manually.
    2. Anyone can create a category intersection by selecting categories listed under an article by checking off the desired categories and then clicking on a link to view the intersection set. The user would then see the subcat page, but would not be able to edit it. It would look like a blank category page, without a title, just displaying the categories being intersected and links to articles that are the results of the intersection. If there was already a category page created for this intersection it will be displayed, so to the user it will appear that they have just moved to that category. If this is a new subcat page the user will be able to save it and create a new category using it by selecting an option that says something like, "Wikipedia does not currently have a category like this. If you would like to create a new subcategory that is the intersection of these categories: {list of selected categories}, enter a name for the category here _______ and select 'create'". Etc... with insturctions and links to the relevant policies. If the user enters the name of a category page that already exists they would be informed and asked to enter a different name, or abort. Once a valid name is entered the category page would be created, and so would the subcat page with the intersections. The user would be able to edit and categorize this page just like any other category. The only difference is that the subcat page gets transcluded as well.
    3. Using the procedure above, it would be possible to create a subcat page by adding the desired categories to a sandbox page, previewing it, checking off all the categories and saving it as a category.
  • It would be possible with this system to have the software or a bot automatically move miscategorized articles to their parents. I would suggest that this be done by adding a tag or flag to the subcat page that does not get displayed. Only admins would be able to set this flag. The flag would be needed because it would be possible to create intersections of categories that were not meant to be populated with articles. It would not be possible to vandalize existing categories, because once created the categories used for the intersection would only be editable by admins. Admins would also be able to rename and delete subcat pages if necessary.
  • The subcat pages will simply be a database of lists of category names along with te name of the page that contains the lists. When a user creates a category intersction on the fly, the categories selected will be matched with the lists. Since the lists are in alphabetical order, the selected categories can be compared in the same order to quickly find a match. If there is a match, the category that uses the subcat page will be displayed.

Does this address all your concerns? -- Samuel Wantman 08:37, 5 August 2006 (UTC)

I'm not sure if this is "stop the presses" or not, but I just ran into m:DynamicPageList (a MediaWiki extension that is not currently installed here). Hmmm. I think I need some time to think about this. -- Rick Block (talk) 18:53, 5 August 2006 (UTC)
I would not be surprised if everything we need to implement this already exists. --Samuel Wantman 20:59, 5 August 2006 (UTC)

I generally agree with the criteria, although I think I might tweak the wording a little bit. For example, your first two are related and could be combined into:

  1. Many existing categories are logically the intersection of attributes for which "primary" categories exist, for example Category:American actors is logically the intersecton of Category:Actors and Category:American people. Although these "primary" categories are today generally subdivided into subcategories, if they were directly (fully) populated the "intersection categories" could be automatically generated.

I like explicitly listing the criteria for the solution. One more, perhaps implied by your #3, is that the software change to implement the new solution must fundamentally be an "add-on" not requiring wholesale changes to existing articles or categories. Other additional ones might be:

  • articles should not be permitted to be directly added to intersection categories
  • both "on the fly" and "static" intersection categories must have a URL syntax, and both should have a wikilink syntax

My understanding of the substance of your proposal is that an admin creates a "static" intersection category by editing the intersection list maintained in a parallel, protected, namespace ("subcat"), while non-admins could create new intersection categories but not edit the intersection list, right? So, for example, to create Category:African American actors as an intersection an admin would edit Subcat:African American actors and include in it the intersection list. If a non-admin user was currently viewing a "non-existent" (equivalent to red-linked) intersection category, he/she could "save" this intersection as a new pseudo-category by giving it a name.

Our understandings are exactly the same.

Following this through a bit, the previously created intersection categories would have a URL and wikilink syntax exactly like existing categories, so when the software generates a category listing it has to check if the parallel name exists and, if so, then treat the category as an intersection rather than as a "regular" category. Part of treating a category as an intersection might be to disallow adding articles to the category. For statically created intersections, I think this could clearly work, although the ability to turn an existing category into an intersection category seems problematic (what happens to the articles that are already in the existing category when this is done?).

The ability to turn an existing category into an intersection category is part of the elegance of the system. First of all, only an admin would be able to do this because an on-the-fly subcats cannot be created with a name that already exists. So the process is that there has to be agreement to change the nature of the category, perhaps occuring at WP:CFD, and then an admin creates the subcat page and sets the flag to move the articles. The software will move all the articles to the categories listed in the subcat page as the converted category is depopulated. So part of the process of changing a category to a subcat would be checking to see that all the articles in the category will actually belong in the parents, and creating a "See also:" section for articles that should remain associated with the category. In the examples we've been talking about it might be an eponymous article or list (List of African-Amreican actors). If someone miscategorizes an article into a category that has a subcat list, the article would be in the category until the software or bot moved it. Perhaps the category would be listed in grey to show that it will not be there for long. This way of implementing subcats will make it possible to undertake the massive repopulation of categories that will be needed.
There should be a new section in category listings for the subcats. I'm thinking the new section would be after the display of subcategories and before the display of articles. It might say "Articles that are in category:xxx, category:yyy and category:zzz". I can see some possible uses for not setting the flag to depopulate the category. If there are a fair number of articles about the topic they could remain in the article section. But, I suspect this won't happen much and perhaps this type of category should be discouraged. The more likely use I see for this is a way for Admins to preview what the category will look like without implementing any changes that would be hard to reverse.

For "on the fly" intersections, I don't quite see how this works. I think the list of categories being intersected has to be provided in the URL, which means an "on the fly" intersection would have to have a different URL syntax than a statically created one. I think I like using the new namespace for this rather than than using it to parallel the existing category namespace, so perhaps something like "Intersection:" rather than "Category:" could be used for intersection categories. Then, intersection:American people::Actors could mean the intersection of these two categories as a wikilink, leading to the URL http://en.wikipedia.org/wiki/Intersection:American_people::Actors (which might or might not lead to a "previously created" intersection). The ordering issue, where intersection:American people::Actors should be the same as intersection:Actors::American people could be addressed using a completely hidden intersection list like you suggest. In fact, the actual internal name could be in canonical (sorted) order and all other permutations effectively treated as redirects (more like synonyms) to this name (before doing the lookup, the software would parse the URL and then sort the category list, and then do the lookup). Doing it this way would keep intersections completely separate from existing categories (which I think would be a good thing).

Using "intersection" as a separate namespace resolves the vandalism issue as well. The categories being intersected are embedded in the name, and can't be changed (by anyone). This would mean it wouldn't require any special permissions to create or edit intersections. -- Rick Block (talk) 17:28, 7 August 2006 (UTC)

I think we are on the same page. -- Samuel Wantman 20:13, 7 August 2006 (UTC)
I'm not sure if you picked up on this, but I'm suggesting using the "Intersection:" namespace all the time, even for statically created intersection categories. I think there enough issues with recasting an existing category as an intersection that we should avoid this. Perhaps an existing category could be turned into a REDIRECT to an intersection, but manipulating all the articles when this is done seems like a pretty big deal. What I've suggested makes intersections truly an add-on feature, related to, but without any direct impact on the existing category feature. One issue we haven't talked about is sort order. If an article (or category) has two different sort keys for the categories that constitute an intersection, what happens? This has to be algorithmically specified, and should not be too complicated. I don't know exactly how this is done with regular categories, but I suspect the article's sort key is stored in some database record associated with the category (in addition to the source category reference, which is in the article). -- Rick Block (talk) 21:01, 7 August 2006 (UTC)
I'm not sure if YOU picked up on this but I was proposing all along in this section was that the subcat: or intersection: (or whatever it is called) namespace would be used all the time, even for statically created intersection categories. There doesn't need to be a redirect. We can use the existing category pages and simply transclude the subcat page into a new section. The first thing I proposed at the top of this section was "The mark-up for creating a category intersection will just be the automatic transclusion of a page from a new namespace (sub-category? intersection? I'll use "Subcat" for the examples)." and then later said; "Anyone can create a category intersection by selecting categories listed under an article by checking off the desired categories and then clicking on a link to view the intersection set. The user would then see the subcat page, but would not be able to edit it." But perhaps I'm not understanding what you are getting at. I am saying that all the intersections happen in the new namespace, and if there is a category page with the same name the subcat page gets automatically transcluded. I think we are in agreement. This would truly be an add on feature of the existing category structure.
As for the sort keys, it is possible to add some parameters for the categories listed in a subcat page so that the software or bot can decide how to sort the articles when doing an intersection. This sort of thing already exists in WP:AWB. I suspect the easiest way to implement this is to simply select which category's sort key will be used. It might be as easy as adding an empty pipe to the category you want to use (e.g. [[Category:American people|]]). Going in the other direction, when the bot or software moves articles to the parent categories the piping can just be copied to all of the parents. Since most of the intersected categories deal with people, most of them will all be piped the same way, so I'm guessing this won't be a big problem. -- Samuel Wantman 01:23, 8 August 2006 (UTC)
If names in the new namespace are "tied" to names in the existing category namespace there has to be some other mechanism to specify (and manipulate, etc.) the list of categories that are intersected. If the name itself specifies the categories to be intersected, this avoids the issue of permission to change the "intersection list" and I think would lead to a simpler implementation. Creating an indirection between the name and the intersected categories works well for existing categories that could be done via intersection (and, likely, for most statically created intersections), but doesn't provide a solution for "on the fly" intersections. Assuming there has to be some solution to "on the fly", I'd reuse the same mechanism for static intersections if there's a way to make it work. On the flip side, this means there wouldn't be a convenient way to do something like the sort key trick you suggest. -- Rick Block (talk) 03:21, 8 August 2006 (UTC)
If the name of the subcat page specifies the categories used to find the intersection set how do you specify which category uses the subcat page? I think we have to set it up that everything that needs to be specified (which also need to be protected) happens in the subcat pages. Our criteria is that average users should be able to do as much as possible. If we have to create a redirect to a subcat page, it has the potential to be abused. That was why I proposed transcluding from a protected namespace. I don't think we want to display the category as being "Category:Actors::American people::People of African descent" This is not going to mean anything to most people. We want to call it "Category:African-American actors" and allow people to edit the page. So I don't see how we can avoid some way of pointing from the category page to the list of categories. If that pointer is not in a protected namespace, we could be creating the potential for widespread vandalism.
The only alternative I can think of at the moment is to switch the name of the subcat page with the list of categories. Perhaps this is what you are suggesting? So instead of the page containing a list of categories, it would contain a link to the category page. Since the name is not identical to the category page, you'd have to create a database to know which subcat page gets displayed. If a category used to create the subcat gets renamed, the page would have to be moved. The subcat page contents could still specify which category to use for piping and still have the switch for depopulating and moving articles. So in essence this is pretty much the same -- we'd just be switching the subcat page name (and URL) with the list of categories. One disadvantage I see is that an admin would need to know the categories used for the intersection to be able to edit the page, or there would have to be a "subcat" tab that only appeared for admins when looking at categories. But the biggest problem is that it becomes much more complicated to prevent the situation where a category ends up with two different subcat pages. It seems more natural to have a one to one mapping. I'm not seeing an advantage to doing it this way.
I guess I don't understand the problem you are trying to solve. It seems you are concerned about what the URL for the on-the-fly intersection set would be. It could get some temporary name, perhaps generated from the category intersected as you suggest. When the user decides to save it as a category, it would get renamed to the category name (except in the subcat namespace) and the user would then be able to edit the category page. Why is this a problem? -- Samuel Wantman 04:47, 8 August 2006 (UTC)
This is a problem because the fundamental way a wiki works is that there's a URL corresponding to the page the user wants to see. However you pick the categories comprising an (on the fly) intersection, the result is a URL. The URL must include the categories since it's the only thing the web server sees. There basically can't be a "hidden" file (of any sort) that stores the list of categories to be intersected for the "on the fly" case. It could have an invented name, but the list of categories has to be included in the URL. Hence, just invent a namespace (I'm suggesting "intersection:") for this purpose where the name, in this namespace, is a list of categories separated by some specific separation string ("::"). Once we do this, I think we're done. I think this means the name doesn't match any name in "category" namespace. It seems you're trying to ensure there's a correspondence between names in "category" namespace and "subcat" namespace. I think this is not a fruitful approach. This clearly can't be the case for an "on the fly" intersection. So, why not just let the names in the new namespace simply be different? -- Rick Block (talk) 05:16, 8 August 2006 (UTC)
OK. I understand what you are getting at. So I'll write it up as the intersection pages will contain the link to the category page. Functionally it is the same. -- Samuel Wantman 05:49, 8 August 2006 (UTC)

Interesting discussion guys. I happened to find it because I am repeatedly frustrated by the non-existence of the intersection feature! But I wonder how useful it is to actually discuss what the categorization should be if we had that feature. I think it's premature to discuss the topic (which will inevitably be controversial) before the feature actually exists... Pascal.Tesson 01:48, 12 August 2006 (UTC)

Hello! I was very surprised to see another person commenting and wondered how you found it until I realized that by putting the proposal tag on top it was automatically categorized. I've removed the tag.
I'm coming at this from the background of being a software designer. This is a feature that has been talked about for a long time, but there hasn't been much discussion about how to do it, and what it would look like. The assumption previously was that it could not be done effectively because of how categories were structured (parent categories were depopulated into their children). I have found the depopulating of parents to be a very annoying feature of categorization and with Rick, we've been looking into coming up with an alternate view of things. It looks like most of the code for doing what we are proposing is already available, and that this proposal is probably doable. We have been talking about inviting more eyes to look at this when we had a proposal together, and if we can garner support, present it to the larger community. -- Samuel Wantman 02:32, 12 August 2006 (UTC)

Differences

OK. So now we get to the point where I'm pretty sure we disagree. Starting at Wikipedia:Category intersection#Defining categories as intersections I have a very different idea for how this should work. Rather than tie intersections to what might look like an existing category in any way, I'd prefer if intersections were essentially completely separate. Once you're on an intersection page, all that needs to happen is you click edit and you're editing the "description" (which I think simply defaults to the null string). You seem to want to tie the intersection to a "regular" category, which requires creating the correspondence, resolving the permission issues, resolving the issue of what happens if a user "adds" an article to a category/intersection, auto-transcluding an "intersection" page when the associated category page is viewed, etc., etc. This section and the whole next section on "Managing categories" go away if the intersection is an independent entity (perhaps there should be something very simple saying how to redirect a category to an intersection). Intersections have to be fairly independent of existing categories because of the "on the fly" issue.

Yes, I think we do disagree. I've read over what you have written and I have some questions and comments. I can see many disadvantages to what you are proposing, and few if any advantages. My primary concern is that what you are talking about turns a significant portion of what had been a community edited categorization system of organized multiple taxonomies into a more technical and hard to understand system of intersections. I think most users of Wikipedia will have no idea what an intersection is and how to create one. What I'm proposing is that a user gets to Category:African American actors (now linked to my mockup) the only thing that will look any different from any other category is the comment that says something about being "Pages having the following categories: Actors, American people, and People of African descent". What you are proposing is that they get to Intersection::Actors::American people::People of African descent. In your scheme, all the names of the pages will only be understood by wiki-geeks. I don't think that would be user friendly. If you abolish all the subcategorization categories and all the intermediate categories such as Category:Occupations by nationality and the like, you have created a much less friendly environment for browsing. People have spent quite a bit of effort (yourself included) coming up with naming conventions for all these categories. Perhaps I'm not understanding you, but it sounds like you want to throw all of this away. I don't.
I DO want to tie the intersection to a "regular" category, which requires creating the correspondence. I think I have resolved the permission issues. all of the dangerous editing happens in the new namespace which is restricted to admins. The issue of what happens if a user "adds" an article to a category/intersection is also resolved and I think is a benefit to have the software move it. Auto-transcluding an "intersection" page when the associated category page is viewed is no more complicated that adding articles to categories. In this case we have added an intersection to a category. I am not seeing the complete picture for what you are proposing. If there is "something very simple saying how to redirect a category to an intersection" then how does it work? What happens when people add articles to that category? I don't understand the concern you are talking about when you say "Intersections have to be fairly independent of existing categories because of the "on the fly" issue." The "corresponding" category WILL already exist for all the currently existing categories that will need to be replaced by intersections, I'm sure that it's alot simpler to not even try to create a mechanism to tie a name in category space to intersection space but would it be a user friendly system? If Category:American film directors basically goes away and ultimately becomes a redirect to Intersection:American people::Film directors) and if as you say, Category:American film directors won't show up on any articles, then it matters whether it exists as a "category" because Category:American film directors also appears in several categories including being subcategories of Category:American people by occupation, Category:Cinema of the United States and Category:Film directors by nationality and being linked from Category:English-language film directors. These categories help the non-technical types find what they are looking for.
I'm hoping we can resolve this. So explain to me how this would work without category names that make sense in English?
Let's start with some things we can agree on. Primary categories won't change (except for becoming fully populated). Intersections, however they're named, won't show up on articles. "On the fly" intersections (ones that are not precreated) have to look different from "primary" categories (they don't have a description, they aren't subcategories of any higher level categories except the categories in their intersection set, they don't have a name in "category" namespace, etc.). The taxonomies that exist are incomplete and inconsistent. A replacement system should be easy to use. When looking at an article, a user will be able to easily get to any primary category or any combination of the primary categories listed. From an article if a user clicks a "primary" category they'll get to a category like we know today. If they pick and choose some categories (from the checkboxes, that you suggested) and then click "show articles in all selected categories" they get to an "intersection" category, which may or may not be a "precreated" intersection.
So far so good.
I assert the checkbox system on articles will be the primary mechanism used to get to "intersections" (90% of wikipedia's users are browsers, not editors - they get here from google hits - categories are rarely visited compared to articles - hence, traversal from articles is the primary mechanism a user will get to anywhere). I think we agree an "on the fly" intersection will look different from a "primary" category, and the issue is whether a "precreated" intersection should look more like a "primary" category or more like an "on the fly" intersection. To get to a "primary" category a user clicks an individual category. To get to an "intersection" (precreated or not) they click "show articles in all selected categories". If we distinguish "precreated" intersections from "on the fly" intersections (as you want), there are two (more or less noticeably) different result pages that show up when they click "show articles in all selected categories". One looks very much like a "primary" category. One looks different. How will a user know which sort of page they should expect? If, instead, all "intersection" categories look essentially the same (but unlike "primary" categories), I suspect they'll easily understand the difference. The trick is to make the equivalent of the existing "intersection" categories easily accessible from the relevant "parent" categories. I'm not sure I know all the answers, but I think I'd rather have precreated intersections look like "on the fly" intersections and find a way to make them accessible from parent categories than make precreated intersections look like primary categories. It seems like most users won't know what to expect if we sometimes show them a "category" and sometimes show them an "intersection" when they've done approximately the same thing (picked a set of categories from an article and clicked "show articles in all selected categories"). -- Rick Block (talk) 04:39, 15 August 2006 (UTC)
Perhaps you are not understanding what I am proposing. I am expecting that all of the categories that currently exist that can become intersections will be converted quite quickly, and they will look like this, which is hardly any different from what they look like now. For the moment, let's ignore the broader/finer feature you've come up with because it could be added to either way of doing this, yours or mine.
As new intersections are created by checking off categories from article pages they would look very similar to category pages, but stripped down, and they would see instructions for how to turn the intersection into a category. Imagining the first time someone tries the intersection of actors, Afr-Amers, and People of Afr decent, it would look like this. So they enter the page name at the bottom, click the button to save the name and they'd go directly to the category editing page. So I suspect that very quickly, huge numbers of intersections will be named and categorized.
Most users would very rarely see an uncategorized intersection. When they do, I don't think they would think it strange because it will look very much like a category and there will be an explanation about why it doesn't have a name. The un-named intersection will be viewed as the way into the category system, just like when you search for an article that doesn't exist.
It occurs to me that if the categories selected for an intersection do not generate any articles besides the one you started at, there should probably be a message that says "No other articles found" and you would not have the option to save the intersection as a category. There is no point in cluttering up wikipedia with empty categories or categories with just one member.
I'm hoping you will put some thought into how your system would work. I had many of the same questions, and thought about it for quite some time. You might come up with a better solution than I have. I think I have answered the questions.
I'm considering your finer/broader stuff because it might add some needed power and flexibility. Perhaps, it could juar list primary categories near-by, or even have a way to navigate up and down the structure to ANY category without leaving the page. Another possibility is to edit the intersection page and add categories that could be added to the display. Yet another possibility is showing recently displayed or visited primary categories. --Samuel Wantman 10:53, 15 August 2006 (UTC)
Yes, this is the point at which we disagree. Reading through everything you've written it seems you want to generally hide intersections behind what mostly look like ordinary categories (close enough that a casual observer would not realize the difference), expect that most intersections will be precreated, and (I suspect) want to shield most users from even knowing about intersections as opposed to "regular" categories. My view is that we want users to know about this feature and hiding it as you suggest would tend to make it unknown. I doubt we'll ever have "most" or even "most of the interesting" intersections precreated (I think the number of possible combinations is something like 2**N, which grows very large, very fast). The taxonomies we have now are incomplete and many of them are contentious (e.g. the recently deleted Category:Academy Award Nominated Black Performers). I'd like to essentially get rid of lots and lots of "intersection style" categories, not preserve them, and encourage users to use the intersection feature to slice and dice articles however they'd like (mostly "on the fly"). Yes, it's useful to be able to traverse from Category:People by occupation to Category:American people by occupation (and vice versa), but I'd vastly prefer we invent a new way to do this, likely involving moving between category space and "intersection" space, than hide the existence of intersections from most users.
One reason I'd like to keep categories and intersections obviously different is to help prevent the occurrence of users mistakenly adding articles to intersection categories. I know you've attempted to address this, but it distinctly has a price. If a user adds an article to an intersection category you're suggesting the article automatically (or by a bot) be added to the primary categories comprising the intersection. I think users who do not understand the difference between "primary" categories and "intersection" categories will be confused by this (they well may be attempting to add an article that's already in the primary categories to an intersection). IMO, deliberately masking the intersection feature will lead to a tremendous amount of confusion of this nature. Cleanly and obviously separating intersections from categories introduces a certain amount of complexity to the mental model of what's going on, but I think the basic concept is a familiar one to most people and if it's "in your face" (so to speak) they'll readily understand it. I share your concern about the usability aspects, but I'm still hopeful that we can resolve them.
I notice in your proposal there's no particular way once you're at an intersection category to traverse to a different but perhaps related intersection category (specifically, any "upward" category tags will lead to primary categories and any "downward" intersections have to be precreated and have to belong to all categories in the current intersection). For example, from Category:American film directors you'd be able to get to Category:American people and Category:Film directors (and possibly Category:People by occupation) since they're all upward primary categories, but not Category:American people by occupation (upward, but not primary) or Category:Film directors by nationality (also upward but not primary) or Category:Academy Award winning American film directors (downward, but doesn't exist). I think you would be able to get to category:African-American/Black film directors because it exists and would be categorized in both Category:American people and Category:Film directors, so it would show up as a subcategory. I think this means there has to be a checkbox sort of interface available on categories (both "primary" and "intersection"), mirroring the one on articles, allowing "direct" traversal to other intersection categories.
You've misunderstood this part of the proposal, and perhaps I need to make this clearer. The categories would not be categorized any differently than they are now. I don't want to clutter up primary categories with scores of subcategories. Let's keep the structure we've evolved. You'd go from Directors to Directors by nationality to American directors to American directors by ethnicity to African american directors etc... The only change I see down the line is that some primary categories may need to be split (due to size) and there will be more "xxx by yyy" categories. So the page displayed for Category:American film directors would have TWO sets of category links. The first set appears with the transcluded intersection which has links to the primary categories. The second set would have the same categories as present. This would still be at the bottom of the page. Also all subcategories would remain, and there would probably be many more of them. So the upward Category:American people by occupation and Category:Film directors by nationality would still be the parents of the category. Category:Academy Award winning American film directors would likely be a subcategory. This is in my mockup. -- Samuel Wantman 06:58, 16 August 2006 (UTC)
You're saying "intersection categories" (the ones in category space) can be a parent category for categories, but not any articles? I think this would lead to even more confusion about whether these are "real" categories or not. How can a category be a member, but not an article? I suspect this gets back to different goals that we have. I think maybe I'd summarize your goal as "do something to make the current intersection style categories work better". My goal is to introduce a new feature that would make the current intersection style categories completely unnecessary, dramatically simplifying the current category structure (flattening the static hierarchy we've developed while replacing its function with a more powerful searching mechanism). Perhaps we disagree about how broken the current hierarchy is. IMO, it's basically a rat's nest. I have an approach/avoidance thing with CFD, occasionally thinking it's woth paying attention do but any time I do I quickly realize what a complete and utter mess the current categorization "structure" is - at least part of this feeling comes from these static intersection categories (which leads to the issue of whether an article should be in both a "precise" descendant category and various "parent" categories). If we can make intersection categories go away, I think we've accomplished something very significant. My thought all along has been that intersections should be essentially orthogonal to categories, i.e. "categories" become used only for "primary" attributes, while any combinations can be found (dynamically) with intersections. -- Rick Block (talk) 14:11, 16 August 2006 (UTC)
I'm saying that categories can have whatever we decide to put into them, subcategories, articles, or intersections. We'd decide through discussion and consensus what is appropriate. It may be that you are correct, and we'll get rid of all the categories and just have a simple way to make intersections. It might be that we keep the complete system of transversing categories the way we do know and simply repopulate some categories with intersections instead of individual articles. There might be some categories where it makes sense to have both intersections and articles in them. It might be that we just use categories as a way to give intersections a better name in English instead of in Wiki-markup, and add descriptive text. We could just use my system this way, and to help maintain categories. The intersections transcluded can be emphasized or de-emphasized as things evolve. All of these would be possible in my system, and we can see how it evolves. It would also allow for this evolution from where we are today. I think it would be hard for us, discussing this here, to know what the best solution is. So why not create a system that allows for enough variation so that we can evolve the best policies and creative solutions? (I'm also wondering if you saw my most recent comment below) -- Samuel Wantman 19:52, 16 August 2006 (UTC)
I'll think more about traversing between categories and intersections. I think I'm seeing a hint that you'd be OK with creating a distinction if the traversal issues can be solved. I didn't spend any serious time on the broader/finer checkbox thing I came up with, but it seems like perhaps something in this vein might work. -- Rick Block (talk) 01:23, 16 August 2006 (UTC)
Yes, spend some time thinking about the tranversing question. Here is what I think our differences are: We both have the intersections. They work the same. How they appear is similar, and that can be adjusted. What is different is that I don't like the names they have, and I want them to be able to be reached by the categorization structure we already have. If they are to be reached from other categories they should have a name that is intelligible, and I end up back where I started. I can see the value of having categories pages for adding comments, links, some introductory text, etc... As you propose, it is possible to have this information on the intersection page, and perhaps we can give the intersection page a secondary title that is used when it is put in categories. There could be a separate section in category pages that are just for showing intersections that are in the category. I can see this as a possibility. But then I think, you'd have to put intersections into other intersections so you could transverse to even finer intersection levels. So in essence intersections would be categories of intersections. We'd also probably still want the bots to move articles from the categories that will be removed and replaced with intersections. Then there is another big problem: You won't be able to delete these categories because they may be needed as place keepers in the intersection scheme. For instance, you would have to keep Category:African-American actors because it will be the parent of Category:African-American actors by state and similar "African-American actors by xxx" which will all have to be a categories because they are not an intersections of anything. So either you have to allow categories to be put into intersections so you have a structure where you go back and forth between categories and intersections as you transverse up and down the categorization structure, or you have to keep everything as categories, and categorize the intersections. This is essentially the system I'm proposing, except that the intersections are automatically transcluded. The intersection page could just be listed as a link, but I don't see that as an advantage. So I'm back where I started.
But lets go back to the alternate version where you go back and forth between intersections and categories. In this version, the intersections work like categories because finer level categories can be put into the intersections. What does the mark-up for that look like? This seems like a mess of confusion. You would put a link to the intersection page in a finer category which would be an instruction to categorize the page in an intersection page (are you confused yet?). Then you have to deal witht the problem that people will try to create category pages that are essentially the same as the intersection pages, so we will be spending quite a bit of time tagging and deleting categories because they are duplicates of intersection pages. So what is better about this? I just find it confusing.
I can see your point about helping educate users about intersections, so how about if I just add a line to my categories in the section that is transcluded that says "This category is defined as the intersection of the following categories." Users could follow the link and understand how to use and create intersections. -- Samuel Wantman 02:24, 16 August 2006 (UTC)

Is the issue the traversal from "primary" categories? If so, then perhaps adding a [[category:xxx]] line to an intersection's description causes it to be listed in some special way in the "parent" category listing (or perhaps this should happen automatically, since we know the "parent" categories of an intersection).

Yes, this is the crux of the issue. What will this look like for users -- especially new users. What you suggest here is essentially the same as having this happen in category space without the added benefit of the category being readable in English. Why make a distinction between these two spaces that most users will not understand?

We haven't talked much about non-primary categories (categories whose members are exclusively categories) - maybe we need a way to provide a hint that certain intersections may be useful, like category:Film directors and any direct subcategory of category:people by nationality. Maybe creating a description for intersection:American people::Film directors causes it show up as a selectable intersection when viewing either category:Film directors or category:American people. It'd be cool to have a way to provide a hint that that this category and any member of another category might be a useful intersection (like category:Film directors and subcats of category:people by nationality). It seems like this should be done from category:Film directors and category:People by nationality, but I'm not quite sure what the syntax should be. -- Rick Block (talk) 05:05, 14 August 2006 (UTC)

What you are suggesting here is what I have tried to design. It seems to me that the best way to get to these intersections is to leave everything looking like it does now and have the category contents include the intersection. Until we can agree, let's work on this in parallel. I've added links to my mockups. -- Samuel Wantman 09:13, 14 August 2006 (UTC)

Laurence Fishburne

I haven't read all of this, but shouldn't the Laurence Fishburne article in your example end up with categories such as "MASH", and not "MASH Actors", "Miami Vice", and not "Miami Vice Actors"? If someone wants to know the actors in Miami Vice, they should be able to check the intersection of those two. Just a thought. --Kbdank71 11:01, 30 August 2006 (UTC)

I think the counter-example here is an actor who later becomes associated with Miami Vice, but not in an acting role. For example, if a famous actor produced or directed Miami Vice. Looking at the "American people" example, you might similarly ask why you can't intersect "American" and "people". I think these sort of cases will make it difficult to pin down the concept of a primary category, but I do like this proposal. I just haven't had time to read it again and decide which bits I like best. Carcharoth 11:09, 30 August 2006 (UTC)
It is possible to have the category "Actors" and the category "MASH". People would just need to realize that the intersection is not necessarily "people who acted in MASH", but is instead "People involved with MASH who also are Actors". I suspect this will get to be a commonly understood distinction. CFD will probably have discussions about whether a category should be replaced with an intersection or vice versa. -- Samuel Wantman 19:54, 30 August 2006 (UTC)

Already existing extensions or tools

Some of these may have been mentioned already, but just in case....

Existing software that's good to be aware of:

  1. The intersection mediawiki extension - aka DynamicPageList2.
    This extension now (February 2007) has some new features. See the new homepage at mediawiki.org and the special DPL demo website.
    I (mw:User:Algorithmix) continued the work of Unendlich, Dangerman, Haykinson and Amgine. I am interested in making DynamicPageList even more powerful and eventually positioning it as an official part of Mediawiki software.
    -- mw:User:Algorithmix 14:03, 17 February 2007
    I don't have experience with this... you can try pinging the contributors (n:en:User:IlyaHaykinson, n:en:User:Amgine, w:de:Benutzer:Unendlich, m:User:Dangerman) and seeing if they have comments about real-world use (eg. is it slow? what sort of categorization schemes do they find work well?)
    Pong. Two comments:
    1. Real world, server load and speed. Logical operations on categories generate complex database requests. We tried to keep the SQL queries to a minimum in DPL2 and I am not aware of any complaint on speed or load. But it doesn't mean there may be issues with big wiki communities like Wikipedia. However, for comparison, the DPL (not yet DPL2...but they have the same load and speed concerns) seems to be used successfully on Wikinews (see n:Wikinews:DynamicPageList). For other real-world info, there is a stub list of DPL2-powered wikis on the extension page. I hope this helps a little bit.
    2. User interface. I don't think the user interface - or I'd say the way users would do category intersection as mentioned in the proposal - is very flexible, the separate Intersection space in particular. One day, users may want the same thing for Union (you mix up articles from 2 categories together - different from showing articles in category 1, and then in category 2) or Negation (articles not in a category) or all operations together. That's why, on the long term, offering the user syntax of Category_math_feature seems like a more logical and flexible solution. --(m:User:Dangerman) Dangerville 13:21, 3 September 2006 (UTC)
    There are two distinct use cases we're attempting to address and I think a DPL (or DPL2) technique only addresses one of them. The two are 1) where there is an "intersection category" a user creates and adds to the category "hierarchy" (which DPL would address) and 2) where a user effectively "browses" (by some mechanism) to a dynamically created "intersection" that no one has previously set up in any way. CatScan is more like the latter case. To make it more general, the new namespace could have some name other than "intersection" (perhaps "operation" or something). The real benefit of using a namespace is that it comes with a URL and wikilink syntax almost automatically. -- Rick Block (talk) 19:46, 4 September 2006 (UTC)
  2. Duesentrieb's CatScan on the toolserver
    Lets you experiment with category intersection using the current categorization scheme. Allows searching to some limited depth, but it can get pretty slow
  3. Semantic MediaWiki extension
    The ultimate point of where this proposal is going... lets you answer questions like "list all female mayors who were in office after 1950". If you're serious about wanting complicated UIs and such, you may want to look into when it might be feasible for it to be used on something like Wikipedia. (mp3 from wikimania 2006)
    They also mentioned that their queries can be pretty slow.
This proposal calls for reworking most of the categorization system. I wanted to note that at least some reworking is required to make categories useful for automated tools, because the current categorization system is set up to only support human use. Automated use of current categories (eg. CatScan) can work up to a point, but currently, if you try to search as far as possible in an effort to get complete results, you get increasingly unrelated results the deeper you go. --Interiot 12:08, 30 August 2006 (UTC)
Thanks for the link to the Category Ladder tool. I'd been looking for something like that for ages! I was having to manually construct category trees, and this does it for you! Carcharoth 19:37, 30 August 2006 (UTC)
Yes, this proposal calls for reworking the categorization system. The German language Wikipedia has already repopulated primary categories and has removed intersection categories. I think it would be a difficult battle to get that to happen here before category intersection is implemented. Once implemented, more and people will climb on board, and categories can be quickly depopulated, repopulated and redefined. -- Samuel Wantman 20:07, 30 August 2006 (UTC)

This is very interesting

I need time to think about it, but it looks like a great extension to the category functionality (with possible caveats of "people could be resistant to change" and "I wonder what the processor load would be?"). Syrthiss 12:13, 30 August 2006 (UTC)

Categories are currently implemented with a database, and the "large category" performance issue was addressed by limiting the search result. The intersection search results would be similarly limited, so I think the question boils down to how efficiently the underlying database can do intersection queries and how much such a feature might be used. In a real time sense, I think an intersection query could be quite fast (google seems to do OK, for example). It might be interesting to know what percentage of database queries are currently category related. I don't know, but I'd guess not much (I'd guess there are way more history, watchlist, and previous version related queries than category queries). -- Rick Block (talk) 13:55, 30 August 2006 (UTC)

Support

I will have to check the implementation, but I wholeheartedly support this concept. Subcategorization constitutes, I dare say, 50% of the controversy at WP:CFD ... and boolean categories constitute the other 50%. But more importantly than making CFD easier, this will actually make categories useful to researchers -- which must be the guiding purpose of any project on Wikipedia. Right now, our category system is barely an improvement over traditional text searches. --M@rēino 14:03, 30 August 2006 (UTC)

I add my full support! -- SatyrTN (talk | contribs) 15:28, 30 August 2006 (UTC)
I want this as soon as possible. please implement Chendy 21:38, 2 November 2007 (UTC)

Query on intersecting

You have in your Laurence Fishbourne example the categories "People from Georgia (US State)" and "People from Augusta, Georgia". Now forgive me if my US geography is wrong, but isn't it possible to extrapolate the former from the latter? If we put someone into "People from Augusta, Georgia", shouldn't it be possible to make the database populate Laurence Fishbourne into "People from Georgia (US State)", and even "American people"?

Other than that, yes, it looks like a good idea, and would solve a lot of issues with categorisation. Be interested in a developer's comments. Well done! Hiding Talk 21:55, 30 August 2006 (UTC)

The issue about these categories, "People from Georgia (US State)" and "People from Augusta, Georgia" is how they are defined, and how they will be used. We're assuming that people will want to find "Actors from Georgia" or "Politicians from Augusta" etc... So you need to have an intersection that arrives with the correct population of articles. The intersection will depend upon how the categories are defined. Right now, the subcategories of "American people" have people who are American citizens, but the "People from Georgia" are not necessarily citizens. Politicians from Augusta may be people who are neither citizens nor people originally from Georgia. The distinctions between these categories can and should be made clearer, and they might change when category intersection is implemented. It would be possible to categorize any person closely associated with a location with the location. So ex-mayor Willie Brown could be put in category:San Francisco not because he was born there, but because he is a person very much connected with the city. He might also be put in the categories "People born in Texas" and "American people". There are other ways to do this as well. These sorts of conversations need to happen with quite a few categories. The criteria for making decisions about this should be, "What will be people be looking for?" and "What are the primary distinctions?". -- Samuel Wantman 23:38, 30 August 2006 (UTC)
Hmm. How would you stop Willie Brown showing up as a mayor of Texas? And I'm still unclear as to why it wouldn't be possible for the database to extrapolate back and see someone categorised as a mayor of san fransisco as being categorised as a mayor. Hiding Talk 00:00, 31 August 2006 (UTC)
The extrapolation is much more involved. We have not put that into this proposal. I'm thinking Willie might be categorized as "Mayors of San Francisco", (perhaps duplicated in "Mayors"), "People born in Texas", "San Francisco", "California Assemblyman", "American people", "Mineola, Texas", "Members of the California State Assembly", "People of African descent", "Politicians", "San Francisco State University alumni", "Alpha Phi Alpha brothers", "Freemasons", "1934 births", "Living people". This is more categories than currently (my mistake using an undercategorized article as an example!) I don't think you can say that "San Francisco mayors" is the intersection of "San Francisco" and "Mayor", He is a San Franciscan and a Mineola, Texan who was a mayor. That doesn't make him mayor of both places. So "San Francisco Mayors" is not an intersection and would be a primary category. The value in this proposal is not that it is going to lower the number of categories of all articles. For many articles, like this example, there will be more categories. The value is that you will be able to find the intersection of these categories. "San Franciscan African-American members of the CA State Assembly", "Freemasons from Texas", "San Francisco St. Univ. alumni born in 1934", "Politician Alpha Phi Alpha brothers", etc... It might be possible to add "Category unions" as the next step in the process. Unions would remove the need for duplications. I don't think duplication between parent and children primary categories will be a big problem. Perhaps only 2 or 3 levels in the hierarchy if at all. To make this work, categories must be fully populated. The first option describes methods of keeping categories fully populated that are the union of other categories. This system could be used in the other options as well. -- Samuel Wantman 02:08, 31 August 2006 (UTC)
It might be possible to automatically "upward populate", but I think this would be a little tricky to get right. The focus of this proposal is intersections, and basically doesn't address the issue of primary categories that are subsets of other primary categories (e.g. mayors of San Francisco and mayors, or people from Augusta, Georgia and People from Georgia). It's fairly clear for intersections to work, these categories have to be fully populated. This proposal just doesn't include a solution. -- Rick Block (talk) 02:14, 31 August 2006 (UTC)
Okay, so you're proposing that only specific categories become union level categories, for example category:born in foo. It seems like this is mostly going to work for geographic categories and top level occupation categories. I can see a use in this, yes. It's going to be interesting working out the union level cats, it would certainly allow the cat system to develop a better structure. It's a shame we can't also work on getting the database to extrapolate, since this would reduce the number of categories that would need to be on a page at any given time. My hope for the structure was that we could see Willie in Mayors of San Fransisco and this would automatically place him and display in Mayors of U.S. places and Mayors. Hiding Talk 09:12, 31 August 2006 (UTC)
It is possible to have software or bots take care of this. This is from the first option, but something similar could be added to any of the options. I'm not sure it is as tricky as Rick thinks. -- Samuel Wantman 09:58, 31 August 2006 (UTC)
I'm not saying we can't do this, just that I'd prefer to consider it to be outside the scope of the current proposal. If we decide on an approach that results in some namespace having permanently protected "metadata" content about categories, it likely makes sense to keep this in the back of our minds as another use of such metadata. And, BTW, I think it's tricky because any time the software automatically does something in addition to what you've specifically asked there's an issue of whether the automatic thing should be undone if you undo what you originally did. For example, if (through metadata or any other mechanism) adding an article to "Mayors of San Francisco" causes the article to automatically be added to "Mayors" (and, probably, "Mayors in the U.S." and "Mayors of California cities"), what happens if the article is removed from "Mayors". Do we automatically populate but not automatically remove? How many upward lines do we allow, only one or multiple? Isn't anyone who's a mayor also a politician? What happens if we introduce a new level of hierarchy affecting existing articles? Again, I'm not saying this can't be done but there are a fair number of issues that would need to be resolved and I'd rather not have this discussion sidetracked. -- Rick Block (talk) 14:33, 31 August 2006 (UTC)
Sorry, I don't mean to sidetrack the discussion. I'm happy to support the proposal. Hiding Talk 15:05, 31 August 2006 (UTC)

Wikipedia:Category math feature

Have you seen Wikipedia:Category math feature? Not sure how similar this is, but there's a good consensus on the talk page Wikipedia talk:Category math feature#Straw poll, although drawn from a small pool which has grown over time. Hiding Talk 22:38, 30 August 2006 (UTC)

Prima facie support with question

Having just scanned the (seemingly very well-prepared) proposal, I think I'd support its development; the Laurence Fishburne examples suggest it might be very useful. I've printed it out and will read more carefully anon. One question for now, with apologies if I've scanned past the answer: Is there an intention to provide folk with (say) an option on their preference page to use or ignore the system...?
Looks like some sterling work!  Best wishes, David Kernow 01:12, 31 August 2006 (UTC)

We're not thinking of a user preference setting to use or ignore the system. The basic point is to replace manually updated "intersection" categories with automatically dynamically generated ones. -- Rick Block (talk) 01:58, 31 August 2006 (UTC)
Ah, okay; thanks for clarification. Regards, David 02:45, 31 August 2006 (UTC)

automatic generation vs the old hype for xml

In the dark ages when xml was first coming out, it was proposed by some as a way to search and get good hits. So you could search for say duck as recipe=duck to only get hits on duck when it was in a recipe. I always thought that was a needed improvement. So I see this proposal as a step in the right direction. It would be nice if it was dynamic rather then something generated. But I guess that will depend on performance and something is better then nothing. Vegaswikian 03:02, 31 August 2006 (UTC)

I'm not sure I understand the distinction you're making between dynamic vs. generated. The intent is that these intersections reflect a database query executed at the time the page is requested to be viewed, i.e. the list of matching articles is generated dynamically from the current contents of the intersected categories. -- Rick Block (talk) 04:14, 31 August 2006 (UTC)
We are both looking for the same result then. I was a little worried by the use of automatically generated above. As far as I'm concerned, the result is dynamic. Vegaswikian 04:17, 31 August 2006 (UTC)

Can't wait...

Good work. I hope people won't bicker too long about the details of this proposal because such a feature is way overdue. I think it makes sense to fast-track this as soon as possible and then reflect in a few months on how well the proposed interface works so that we can make improvements on it. In that spirit, I support the proposal as is. Pascal.Tesson 07:53, 31 August 2006 (UTC)

Locations

How will locations work? I'm thinking that you would need to define locations with a great deal of precision for this to work. You could not simply list someone in a category of Valencia but you would need to provide exact details. Valencia, Spain or city=Valencia and country=Spain for the Spanish city and Valencia, California or city=valencia, state=California, county=United States to be able to extract the intersections correctly. I'm assuming that United States would map to North America and Western Hemisphere somehow. You would also need to have multiple entries for birth, death and other significant events. Being from foo, as pointed out above, is not of much use in many cases. This would require more work and a lot of thought, but would make it easier to build intersections that are virtually 100% accurate. As pointed out above, choosing the primary categories would be critical here. Vegaswikian 00:43, 1 September 2006 (UTC)

I've been mulling something over that directly relates to the Location question. There's a slight mod that might make things a whole lot easier. It may have already been covered here, so forgive me if I'm stating the obvious.
What I'm thinking is if taxonomies are more rigorously applied, the concern over searching through sub-categories is alleviated. From the example above, it's a fact that Valencia,_California is a subcat of California, and that Valencia,_Spain is a subcat of Spain. If the taxonomy is "hardwired" or something, then when the article on Tom Smith is put in the Valencia,_Spain category, the "super-categories" (those that his cat are a member of, not relative worth) are pre-determined. Thus a search for "people from Spain" knows what cats/sub-cats/sub-sub-cats... to look in.
As an added bonus, a "stricter" taxonomy would make the queries from the database much easier and more efficient.
As was just stated, though, determining primary cats is key. -- SatyrTN (talk | contribs) 01:11, 1 September 2006 (UTC)
The problem here is the semantic meaning of category membership is not defined. "X is in category Y" might mean "X is a subset of Y", or "X is a type of Y", or (really) nearly anything. The software has no clue what the universe of meanings is or which meaning any particular categorization carries. -- Rick Block (talk) 02:33, 1 September 2006 (UTC)

What's up with...

So I visit a category this afternoon and there's a new toggle for "Show as tree | Show as list". Is this new? Any info, anyone? -- SatyrTN (talk | contribs) 00:29, 2 September 2006 (UTC)

See Category Tree !   Regards, David Kernow 03:01, 2 September 2006 (UTC)
Or maybe the next link along: Meta:CategoryTree extension. Carcharoth 02:17, 3 September 2006 (UTC)
See also Special:CategoryTree! Carcharoth 02:20, 3 September 2006 (UTC)

It looks different now. I guess when they are done deciding how it will work, I'll update the mockups for this proposal to include the feature with intersections. -- Samuel Wantman 09:52, 3 September 2006 (UTC)

So what's next?

Where do we go from here? I've put requests on the talk pages of several developers, including the author of meta:DynamicPageList2 and Brion Vibber. I put a copy of the proposal in Jimmy Wales hand when I saw him at a meet up two weeks ago. His response (via e-mail) was:

I read it. I support it, it seems really powerful. It brings up a lot of interesting questions, but one of the things I realized while reading it is that too much "a priori" thinking is not needed. The value of it is clear, and the actual day to day epistemic questions that I start having really will have to be sorted out in the old fashioned wiki style of negotiation and compromise.

So what is the next step to make this happen? Rick and I have not publicized this widely. It seems premature to do so before we can get confirmation that this is not going to cause problems with the server load. Any suggestions as to how to proceed? -- Samuel Wantman 08:35, 2 September 2006 (UTC)

I think the next logical step is to make more people aware of it. It is going to need to have many more questions asked/answered before it is adopted. Having Jimbo's support is certainly a good start. The technical feasibility questions need to be evaluated by the MetaWiki Developers before getting too far along. Cacophony 09:43, 2 September 2006 (UTC)

One thing that needs to happen is that categories need to be changed from "multiple overlapping hierarchies" to "non-overlapping hierarchies". The current data isn't in a usable form. For instance, just try to simply "list all company names" or "list the names of all actors", nevermind the intersection part... it can't be done right now. The data is a mess and needs a lot of fixing, which isn't necessarliy surprising, but before you can start fixing data, you have to get agreement that policy can be changed to allow us to start changing the data in that direction. I brought it up before at Wikipedia talk:Categorization#Category relationships, and current consensus seems to be "categories aren't meant to be used like that". --Interiot 10:10, 5 September 2006 (UTC)

That might be contentious to do (as it is quite philosophical) unless the implementation is about to be done, and done for sure. I think intersecting categories would be very cool and would really like to see it, though... I forsee a lot of the category reshuffling as amenable to bots (put everything in a sub cat in both supers seems automatable to me), but probably not all. The bots should take notes on what they did, I suspect (more than just their contribs log, notes on why they moved something from X to Y) so humans can check their work easily. ++Lar: t/c 14:33, 5 September 2006 (UTC)
This is exactly the reason why Rich and I have not been posting links to this page from large community forums. We first wanted to see if the people who spend a bit of time working with categories would approve of the proposal, then we need to get developers on board. At that point we can present this as the new software feature that is on the way. Does anyone read German? It might be useful to know the history of categorization at German Wikipedia. Jimmy mentioned that they had already made the decision to move articles into primary categories and deleted intersection categories. I'm wondering how the decision making process went. --Samuel Wantman 17:38, 5 September 2006 (UTC)

Simplification

Many ideas in the proposal sound great to me, but much of it sounds too complicated. I think that adding new pages to the wiki namespace seems unnecessary. "User created category intersection" (the Flickr-like checkboxes) hits the nail on the head. This could be a very powerful tool for browsing categories. The user would just select the desired categories, then click a "go" button to open a new search that displays all articles that are contained within all the selected categories. This would require less work to implement, lead to less categories being required, and it wouldn't have "holes" (any category could be cross referenced with any other category (or multiple categories) without editors having to create intersection pages for every possible combination).

The other big issue here is fully populated "Primary" categories. The real trick is to make (some/all?) subcategories appear automatically in supercategories. For example, Lawerence Fishburn should not need to be included in both Cat:People from Agusta, Georgia and Cat:American people. Being a member of Cat:People from Georgia should automatically include him in Cat:American people. Ok, I guess that is where it starts getting complicated. What would be really cool would be to have a simplified category tree appear when scrolling over the category name and then be able to broaden/narrow the intersection "search" that way. Cacophony 09:29, 2 September 2006 (UTC)

It's not quite as easy as that because as I understand it, the People from Agusta is loosely based residency, and American people is more based on nationality. I think there may need to be some sort of tag on the parent that says what kind of relationship is implied. -- ProveIt (talk) 02:08, 8 September 2006 (UTC)
I came across the User:Duesentrieb/CatScan tool which does basically what I was talking about. It looks like this has been around since the beginning of this year. I've had limited sucess with it, but it is there. Cacophony 05:32, 14 September 2006 (UTC)

Too long

Page is too long and has too much extra prose and too many options. Trim it down a bit? --Brion 14:01, 4 September 2006 (UTC)

I think it's important for it to be long enough to expose the various possible options. The idea is not for this to become a policy (in which case conciseness would be an issue) but rather a proposal for major changes in the way categories are handled. Once the talk page has generated a consensus on what options are prefered by users and developers it will be possible to cut it down. Pascal.Tesson 17:00, 4 September 2006 (UTC)
It's nine pages printed. That's really too long for a feature request. :) --Brion 10:06, 6 September 2006 (UTC)
The interface could certainly use special pages for intersections, it is not currently in the proposal. Until we know what is possible from the developers, we don't need to make decisions about specifics. -- Samuel Wantman 20:12, 5 September 2006 (UTC)
  • (unthreading) I saw that, but a namespace implies (1) being able to write things in it, and (2) having to create pages for all conceivable combination. Neither, imho, is true. The intent seems closer to a query like Special:Shortpages. But yeah, technical details are less important than the idea. >Radiant< 20:43, 5 September 2006 (UTC)
If the intent is to replace a largish number of existing categories, using a namespace is convenient since it allows things to be written into it but does not require creating pages for all conceivable combinations. You can't edit a special: to annotate it or put it in a category. Per my response to Dangerman (above), we could make it more general by using a namespace that does not imply the operation (perhaps "operation" or "catquery"), but I do think it would be useful to use a (writable) namespace. -- Rick Block (talk) 14:14, 6 September 2006 (UTC)
That's all very nice, but doesn't this just not say anything new? Everybody loves the idea of category intersections; implementation details are what need to be proposed if you want to get anything done. --Brion 10:07, 6 September 2006 (UTC)
I'm confused. Is it too long or not detailed enough? Or is it just too many words without enough code? -- Rick Block (talk) 14:14, 6 September 2006 (UTC)
Technical advice about the "implementation details" was what I was hoping you (Brion) would bring to the discussion. From what I'm reading on the wikitech list, it seems that the technical issue is speed. Is there any other technical considerations we should be concerned about? I'm assuming that the technical issue can be solved, either with cleaver code, or advancing technologies. Flickr seems to be able to do what we are talking about, so it seems just a matter of time before this is doable. As we are a group of people who spend quite a bit of time dealing with categorization issues, it seemed reasonable to start envisioning the type of interface we would find useful and helpful. If we have a sense of where we are going, we can start moving the masses and categorization policy in that direction. I began work on this proposal out of the frustration of trying to facilitate categorization policies that would make sense, be useful and reduce conflicts. As the project grows this becomes more and more difficult if not impossible. What do you suggest we do to help fix things? -- Samuel Wantman 18:15, 6 September 2006 (UTC)
The question I'd like to ask Brion, is what are the cost in terms of database query, for an example how expensive would it be to find the intersection of all mathematicians (aprox 2300 entries, Category:Mathematicians and its sub cat) with category:living people? How would the cost change if the intersection of three categories was used? Is it feasiable to have live queries (uncached, allowing a general search) or will they need to be cached beforehand? Or basically would wikipedia grind to a halt if this was implemented? --Salix alba (talk) 20:02, 6 September 2006 (UTC)

Question

Hello. I already posed this question on the stub project page. The question was, in order to see which philosophy articles need expert attention, i.e. to search articles BOTH in the category 'Philosophy' and 'Needs Expert Attention'. I'm gathering from this page that it may not be possible. If not, add my name to the list of people who would like it. You can see the enormous power of this for those of us trying to make this place better. Dbuckner 15:49, 4 September 2006 (UTC)

Amen to that. I often try to cleanup the backlog of articles with unclear importance and I often would love to be able to group articles together. This would help for instance when asking help, say, to the Wikiproject chemistry for the numerous chemical compound pages tagged as unimportant. Pascal.Tesson 17:03, 4 September 2006 (UTC)
CatScan may be sufficient for purposes like this (and I believe it is well used by a variety of folks at the stub sorting project). My real intent for this proposal is to replace many (all?) of the existing manually maintained "intersection categories" with dynamically generated content. I'd prefer if a side effect were that many of these categories completely ceased to exist as well. -- Rick Block (talk) 19:35, 4 September 2006 (UTC)

Simetrical's response from Wikitech-1 mailing list

From: Simetrical

On 9/4/06, Samuel Wantman wrote:
I'm writing to ask a developer to take a look at a proposal being developed at English Wikipedia at:
http://en.wikipedia.org/wiki/Wikipedia:Category_intersection...

See http://bugs.wikimedia.org/show_bug.cgi?id=5244 and the various things duped to it. I'm pretty sure performance would be a major issue here; for instance, finding the first 200 pages in a category is limited to iterating over 200 members of the category, and likewise for all other operations currently supported by categories (as well as unions), but finding the first 200 pages in the intersection of two categories has no upper bound on the number of iterations required: you have to go through every page in each category in the event that they have fewer than 200 shared pages and neither is a subset of the other.

Has anyone written code that can handle this efficiently? Is such code even possible? Storing and updating expensive and often-used intersections as sort of "virtual" categories would probably be a good idea to begin with, but I'm not exactly knowledgeable on either databases or caching. When I mentioned it on IRC, Domas (database person, works for MySQL) was pessimistic. In addition to what I noted above about an unbounded number of checks, he also pointed out that intersection tends to make categories larger, which also affects performance.

So, as someone who has little personal knowledge of the issue, I'd hazard a guess that if one of the few devs who are knowledgeable enough about efficiency and databases and the MediaWiki schema (most likely Tim, I'd imagine) were willing to write the code, it could maybe be good enough to be acceptable, but otherwise I doubt this will be implemented.

(copied from e-mail -- Samuel Wantman 01:37, 5 September 2006 (UTC))
I think this all depends on how the database works. SQL syntax certainly supports "AND" queries, but MySQL might not have a tremendously efficient intersection algorithm. I think a "perfect" algorithm for a x2 intersection results in at most 2N database fetches where N is the number of elements to be displayed in the intersection. The database fetches are likely of blocks of some number of consecutive elements, so fetching N consecutive elements results in N/blocksize fetches. I think this means an intersection might be considerably more expensive, potentially involving a factor of (blocksize*2) more fetches than a simple list of N members of a category. Assuming a reasonable blocksize (maybe 4K bytes), this factor might total anywere from 20 to something like 1000. Note that this is worst case, but assuming a perfect intersection algorithm. Less than perfect and it gets worse. Worst case is no overlap at all, and all database records for each category being intersected must be fetched. Consider intersection:even numbers::odd numbers. On the other hand, my guess is that "intersection categories" are accessed infrequently enough that this factor really doesn't matter unless it makes it perceptably slow. -- Rick Block (talk) 04:05, 5 September 2006 (UTC) (update -- Rick Block (talk) 17:36, 5 September 2006 (UTC))
Of course efficient implementation is a big problem but isn't there some hope that the structure of the categories and the fact that the same intersection queries will likely be asked over and over can be exploited here? Pascal.Tesson 13:58, 5 September 2006 (UTC)
It might be possible to cache the results of intersection queries and before executing the query again check if any of the categories being intersected have changed. In general, this doesn't help the worst case analysis because there must always be a first time any given query is executed. Caching the database records (in memory) in the Apache servers would help offload the database and, in practice, might be sufficient. I think there's some caching of database records that is already done, but I don't know the details of this. -- Rick Block (talk) 17:36, 5 September 2006 (UTC)

Looks nice

I've given it a cursory glance, and I like what I see. I'll write a detailed opinion later. Dread Lord CyberSkull ✎☠ 03:38, 5 September 2006 (UTC)

Interim step?

I'm wondering about the possibility of creating an interim step on the way to category intersection. According to meta, DynamicPageList can be turned on for any wiki. Wikinews is already using it (see n:Wikinews:DynamicPageList). If we turned it on for English Wikipedia, we could replace intersection categories with dynamic lists. This would keep things looking pretty much the way they are now, as primary categories get repopulated. Many of these categories already use templates (like Template:Fooian fooers) which could be modified to include a dynamic page list. The page list would be in the top section of the category, and miscategorized articles would still appear at the bottom. The next step would be to create bots to recategorize the articles. I think this is all doable now. How does DynamicPageList get turned on? How does a decision like that get made? -- Samuel Wantman 17:51, 5 September 2006 (UTC)

DynamicPageList has a lot of advantages. First I'll describe some problems with a general intersetion as proposed here.
A general intersection across the whole of en-wiki involves a lot refactoring of already existing category trees, take category:Mathematicians, with three main sub trees, by nationality, by centuary and by sub-field. To make this work all 2000 mathematicians would have to be placed in an category:all mathematicians category, as I suspect it would be very inefficiant to try to query all the 100+ sub categories, which go about 4 levels deep. Then there are cases where sub categories are not strict subsets of the parent, say Category:Donald Knuth part of Mathematicians has as sub cat Category:TeX, whose articles are no longer mathematicians. We will find a lot of boundary cases like this. Moreover, a general scheme does not follow the incremental approach which wikipedia tends to follow. It becomes one large change, all or nothing to make it work.
DynamicPageList on the other hand could be introduced incrementally, if someone creates a page using DPL then its their responsibility to check the results are accurate. There is a lot of useful things which could be done with DPL today, for instance it would be quite easy to implement a project specific AfD watching page, similar to what the mathematics wikiproject does using a bot at Wikipedia:WikiProject Mathematics/Current activity. This would allow us to monitor performace, as the introduction would be gradual, it would not be a problem to switch if off if things grind to a halt. --Salix alba (talk) 20:47, 6 September 2006 (UTC)
  • The problem with DPL is that if it's an interim step, it creates a lot of work that we'd have to undo later. Recatting 2000 mathematicians is also a lot of work, of course, but it could be automated by bot or by suggestion #3 above. Imho. >Radiant< 21:40, 6 September 2006 (UTC)
    • A DPL to Category intersection transition might not be so difficult. The labour intensive part is reorganising all the categories. doing DPL first, might actually make the heavy lifting easier. Once there is a sufficent number of DPL'd categories, it should not be too hard to trsform a DPL enable category to whatever namespace is chosen. DPL might also help in ironing out the ontologial problems, which I don't think have been fully enumurated yet. --Salix alba (talk) 22:30, 6 September 2006 (UTC)

Category TOCs

The statement was made that since catagory TOCs were created that there is no longer any problem with large categories. I strongly disagree. There is still the problem that the subcategories and the pages are forced to use the same TOC. For example, if cat:American people were to be fully occupied then the first page might have articles for people with names going from Aa to Ab. Chances are there will be no subcategories with names within that range. This makes it look to the casual user like the category has no subcategories and even if the user does realize what is going on, it would be impossible to find all the subcategories (as visiting each page just wouldn't be feasible).

This problem must be fixed before really large categories are made common. --JeffW 20:55, 5 September 2006 (UTC)

A workaround (not a permanent solution) is to pipe-sort the subcategories so they appear before "A" (ie. sort by * or " " or something), and hence appear on the first page of the many pages for the articles (in large categories). Carcharoth 22:57, 5 September 2006 (UTC)
That's ok for a work-around. But I'd like to see a real solution in place before we start to fully populate large numbers of primary categories with articles. --JeffW 15:44, 6 September 2006 (UTC)
Right. I agree. Who would be best to ask? Carcharoth 18:36, 7 September 2006 (UTC)
This problem is already specifically addressed in the proposal. If a category like American People, gets really big, it would make sense to put all the subcategories in a different category (for example "American people by type", or rename "American people" to "Americans" or something similar). This already happens in many categories, ususally by having one categorie singular and the other plural, like "Opera" and "Operas". I don't know the best way to name these categories, but I'm sure we'll work out a good solution. -- Samuel Wantman 05:56, 6 September 2006 (UTC)
I just don't see how this addresses what I wrote at all. --JeffW 15:44, 6 September 2006 (UTC)
Try and find an example to show what you mean. Find a large category and ask people to find all the subcategories in it. Hmm. Actually, the recently implemented "unified tree view" of categories partially solves this problem. But some subcategories can still be "hidden" further down the alphabet. Carcharoth 18:43, 7 September 2006 (UTC)
I guess the proposal is not clear enough about this. Here is what it says in this section:
Some categories may need to be split because they are both primary categories and navigational categories. A navigational category is a category which contains subdirectories. An example of this is Category:American people by occupation. Navigational categories should not contain any articles. Currently, Category:American people functions as both a primary category and a navigational category. It probably should be split into Category:American people (which would be fully populated with articles about Americans,) and Category:American people by type or something similar (which would have all or most of the subcategories. Category:American people by type would then be a subcategory of Category:American people. This will make it easier to navigate through the subcategories, especially when primary categories are very large and have many subcategories.
There are other ways this could be structured, and I'm not sure what the best way is. It would also possible to have a fully populated primary category called "American". "American" would contain all things American -- people, things, etc... "Americans" would be the intersection of "American" and "People". "American People" could remain as a navigational category that only contains subcategories and links to intersections. The intersection to"Americans" would be one such link. --Samuel Wantman 19:42, 7 September 2006 (UTC)
OK, I can see how that helps alleviate the problem. However, if you have Category:People that contains all of the biography articles, I think under your plan it would also contain Category:People by nationality, Category:People by occupation and all the rest of the navigational categories. The navigational categories are still going to get lost, spread out among the thousands of pages needed to hold all the articles. Perhaps navigational and primary categories should be split off into a separate hierarchies? You could then have cat:People (navigation) contain cat:People (primary) and a link could be added in the other direction that would be guaranteed to show up on the first page. --JeffW 20:03, 7 September 2006 (UTC)
All the categories you mention are intersection categories (Rick likes to call these index categories) If you intersect "People" with "Nationalities" you end up with a set of intersections which would be what we now call "People by Nationality". In the first option this intersection gets transcluded into a category. In the second and third options this intersection page is either linked with a "See also" comment or is categorized. If it is a "See also" it remains prominently displayed on the top of the page. If it is categorized there would be a separate section for intersections (I don't think this is in any of the mockups). The third option also has a way to navigate directly to this intersection. Important links can always be edited into a page and they will always appear. Things that are categorized can be piped to appear at the top of the heap.
There's another possible way to handle this problem. Since there are now the plus signs that expand subcategories, pehaps what is needed is a plus sign to expand or collapse the entire subcategory section and article section. If you collapse the articles, you'd see all the subcategories. -- Samuel Wantman 06:47, 8 September 2006 (UTC)

American people a topic level category

Another thing that struck me as I read the proposal is why would the American people category be a primary or topic level category? Surely cat:People would be the primary category with cat:People by nationality as subset of that? Under your proposal should both American people and People be fully populated and is there a mechanism to ensure that is done? --JeffW 21:01, 5 September 2006 (UTC)

There are a couple of issues around this question:
First issue, what makes "American people" a primary category. I think there are two reasons why it is primary. The first is that it does not fit the general rule for an intersection: "If a category can be completely and totally expressed as the intersection of other categories, it is not a primary category and should be defined only as this intersection." If you intersect People with United States, you might not end up with Americans. There may be many people who's biographies are significant to the US without them being American. This is similar to how the intersection of "Mayor" and "San Francisco" is not necessarily a mayor of San Francisco. What seems to be implied in the "nationality" categories is citizenship. This is the primary distinction being made about "People by nationality". The second reason has to do with notability. If people are noted for being "American people" it is worth having a category for them. For some categories, like "film directors", nationality seems somewhat irrelevant. For others, like "Politicians" it is very relevant. Since there are so many relevant intersection categories that use intersections of people by nationality, having this category is essential. Intersecting "United States" with "Politics" and "People" is ambiguous. It could mean "American people" intersected with "Politics" which would include Americans who write about the politics of other countries. It could mean "Politicians" intersected with "United States", which wouldn't necessarily be Americans. It could also mean "US Politics" intersected with "People" which would include all non-Americans who write about US Politics. So "Politicians" and "American people" are both needed.
I think what this implies is that a person should have categories for the city, state or province, and country they were born in. Perhaps we also need categories for the cities, states, and countries that they are strongly associated with. I suppose they should also be placed in categories for their occupation and their class of occupation (news anchor and media occupations) as both of those would be useful to intersect on. This is beginning to look like each article will be in a lot of categories. I think that for this proposal to be useful you will also need to propose a mechanism to allow a person to be placed only in the most restrictive category and automatically be propogated up the chain of super-categories (but without the super-categories actually being added to the article). I'm thinking of some syntax put on the category page that means to propogate all the articles in the category to a specific super-category). --JeffW 15:36, 6 September 2006 (UTC)
Careful! :-) You are sliding down the slippery slope towards categorising an article based on nearly everything that could be linked from it instead. ie. "Y is a person from the USA and from the state of Georgia and the city of Augusta. They have worked as a film actor and as a roadsweeper..." Carcharoth 01:29, 7 September 2006 (UTC)
I think that just illustrates my point since if you thought it important to catagorize a person into People from Augusta then you will be required (or a bot will do it) to also place the biography into the People from Georgia, American people, and People. Otherwise it wouldn't appear in the intersection of Plumbers and People from Georgia, for example. --JeffW 16:06, 7 September 2006 (UTC)
What this illustrates is that all categories need to be well defined. They all have to have clear distinctions about what belongs and what does not belong. Once we have made those distinctions, categories should be fully populated with every article that would be helpful for people looking in that category or the intersections that use the category. One of the reasons the categorization system is a mess is because many categories get populated before they are defined. -- Samuel Wantman 07:09, 8 September 2006 (UTC)
Second issue, should "People" be fully populated. Rick and I actually discussed having some basic category tags, such as PEOPLE, PLACE, THING, TOPIC, LIST, EVENT. We could perhaps even require that at least one be assigned for every article. (This discussion is in Archive 1) This is sort of like how libraries had card catalogs for Subject/Title/Author. In the best of all worlds, I think these categories would be useful, but for our purposes they also have some problems. The biggest problem is they would be quite huge, bigger than the biggest categories we now have (Living people?) In print, a master index is needed to find something, but we have the search box for that. If you know a name you look it up. These categories would not be very useful for browsing. If you are browsing, you probably want to browse through similar things, not similarly spelled things. However, there's great utility to having these categories available to create intersections. By having "List" as a category, we can get rid of ALL the list categories because they will be intersections. For some topics (like World War II) it would be very useful to have intersections using these categories such as "World War II events", "World War II people", etc... So there is a utility to having these categories fully populated, and there is a down-side. If possible, I'd say have them fully populated.
Third issue: Keeping categories fully populated. I think this proposal will require that bots or software help maintain things. Option one describes how this might be done. Rick mentioned the problem of undoing categorization if larger categories are automatically populated from subcategories. I think this could also be handled somewhat automatically. For example, when an article page is saved, the software already looks at the categories and adds them or removes them from category pages. It could also look at the flags or bots set to see if any of the other categories in the page will end up duplicating the category being removed. The user could then be warned of this duplication and have the option of keeping the removed category or removing the child categories as well. Even without the technology, people can be warned to remove parents and children together. --Samuel Wantman 07:09, 6 September 2006 (UTC)
Bots can solve the problem of keeping parent categories fully populated, but the don't solve the problem of massive numbers of categories on each article once you have to put it in every category up and down the hierarchy.--JeffW 15:39, 6 September 2006 (UTC)
It will add some categories, but many will also go away since they are intersections. Many categories will only need one or two levles in a hierarchy. Perhaps articles with 10 categories now will have 15 in this new system. But from those 15 you will be able to generate hundreds of intersections. So even if we double the clutter there will be much larger increase of utility by several orders of magnitude. Ultimately this is a huge reduction in clutter. Many categories are now routinely deleted at CfD because they contribute to clutter. This is a reality of the current system because if left unabated, articles would accumulate scores of intersection categories. With the proposed system adding a single category yeilds multiple taxonomies with minimal additional clutter. -- Samuel Wantman 07:09, 8 September 2006 (UTC)

Namespace vs Specialpage

As to the namespace/specialpage debate, let me just point out that

    1. There really isn't anything you could state on a Intersect:Foo:Bar page that isn't already on Category:Foo and Category:Bar
    2. Most category pages (and the corresponding talk pages) are relatively unused; that's what {{mainarticle}} is for
    3. If we were to put text on Intersect:Foo:Bar, the same text should be on Intersect:Bar:Foo. This gets worse if people intersect three or four categories
    4. And also, we shouldn't have to manually create pages for each plausible intersection - and if a bot were to do it, it might just as easily done on a specialpage.
  • In other words, as I see it, it's a lot more work for virtually no extra gain. The intersection is kind of like Special:Shortpages in that it dynamically lists a number of pages with a certain characteristic. Rebuttals are, of course, welcome. >Radiant< 19:55, 6 September 2006 (UTC)
Multipoint rebuttals get awkward, but:
  1. Consider an article List of American film directors. Probably not in both category:Film directors and category:American people, but likely to be a useful link from Intersect:Film directors:American people.
  2. I completely agree (except I think you mean template:catmain).
  3. Yes, and the proposal addresses this. In the intersect namespace the "article" content has to be accessed indirectly based on the name. The "real" name (name in the database) is in canonical (sorted) order. So, before fetching intersect:D::E::C::F::A::B the pagename is manipulated into (canonical) A::B::C::D::E::F format, so any order matches the same "page".
  4. I completely agree. The proposal includes three versions of how this might work. All include the ability to traverse to an arbitrary intersection, with no manually created anything.
In other words, other than point #1, I don't know that I care whether it's a page in a new namespace or a special: page. I can see some value in allowing annotations, but it does indeed boil down to generating a list of pages that meet a certain characteristic. The "intersection selection table" concept (from the 3rd version of the UI) is something that could be added to the output of a special: page. -- Rick Block (talk) 02:46, 7 September 2006 (UTC)
To add to what Rick said, I think there are some potentially important uses for the intersection pages. In option one, they are used for category maintenance parameters. Things like how categories should be linked, duplicated, etc... (This could be done in other ways.) In option two and three, the pages are used to create links to other intersection pages. Option three has the ability to go pretty much anywhere in "intersection space" but that doesn't mean it would be easy or obvious for the user to know where to go and how to get there. By adding links to intersections like "Nationalies by occupation", you could help direct people through the space. There will also be cases where you are at the intersection of two categories and there are useful intersection with a third. This requires links in all three options. Links are more important in Option two where full navigation is more constrained. This option relies on links to make useful connections between intersection pages. There are examples of this in the mockups. -- Samuel Wantman 06:13, 7 September 2006 (UTC)
  • I see... however, if the intersects were a specialpage, they would not need maintenance parameters. I'm somewhat worried about having the software automatically convert C:A:D:B into A:B:C:D because no other namespace works like that (KISS principle). If you're at the intersection of two categories, any intersection with a third would be a subset thereof... I think an easier way to cover this is to create a few menu screens with pulldowns where you can select e.g. country, profession and birth year. It might also help to create redirects from (current) categories to the specialpage (note that contrary to popular belief, redirects to specialpages work fine). Anyway I'm not yet convinced, and I think neither are you, so let's think and talk on this some more. >Radiant< 21:32, 7 September 2006 (UTC)

Another example (Disasters)

Is it possible to have another set of examples to see how this would work? For no other reason than I am familiar with it, I propose Category:Disasters. It is not fully fleshed out, but there are categories by location (mostly country), by date (mostly year) and by type (many of these). How would this change with this proposed system? Carcharoth 10:06, 7 September 2006 (UTC)

Here's my take on what would change:
  • category:Disasters would be fully populated as a primary category.
  • The "by-type" subcategories would be fully populated (as primary categories).
  • The "by-country" and "by-year" "index" subcategories could be turned into intersections, by adding individual disasters to an "events in <country>" category and an "events in <year>" category (this could be done by a bot).
  • Then, for example, Category:1971 disasters becomes Intersection:Events in 1971::Disasters, and the category either goes away or is left as a category linked to the intersection (depending on which version of the UI we're talking about).
So, how is this better than what we have today? For disasters, at least superficially, I don't think it makes much difference. An individual disaster would be categorized into category:Disasters (which would be an addition), [[:category:<some type of disaster>]] (just like it is today), [[:category:events in <some country>]] (rather than [[:category:disasters in <country>]]), and [[:category:events in <some year>]] (rather than [[:category:disasters in <some year>]]).
This immediately makes some things available that currently aren't like "fires by year" or "fires by country" (and please don't go create these!). I don't think this is an overly huge deal, but changing to this sort of structure would also include disasters in the event-by-year-by-country categories/intersections (which they aren't today). We could fix this by fully populating the Category:Years by country category hierarchy (don't do this either). As it stands today, I think we have a lot of "x by y by z" index categories that aren't complete and would take a huge amount of manual effort to create and maintain. -- Rick Block (talk) 15:13, 7 September 2006 (UTC)
Thanks for explaining all this. It sounds great, and if this can all be implemented fairly simply and painlessly, then I think it is a great idea. I started to create lots of "<year> disasters" categories, but it was taking a long time. The problem is actually that not all the disasters are tagged with their year yet. That would still need to be done manually, right? Also, when the bot fully populates the primary disasters category, it will show some miscategorisation. An example is the eponymous categories like Category:Columbine_High_School_massacre and Category:Beslan_school_hostage_crisis and Category:2004 Indian Ocean earthquake and Category:September 11, 2001 attacks. These all contain articles that appear under the "Disasters" hierarchy, but are not disasters - rather, they are related to a specific disaster. Also, the term disaster is used in a rather broad and vague sense in this category, so I'm not entirely sure how that would work. Articles that happily sit together under "Disasters" as examples of "damage to life and property" are currently labelled accurately by their immediate designation (eg. earthquake or terrorist bombing), but applying the broad term "disaster" to all of them would not work so well. I think that is more a problem with the semantics of this particular area though, rather than the system you are proposing. But it is something to think about. I've also noticed things like Category:Giant monster films creeping in! :-) Carcharoth 18:28, 7 September 2006 (UTC)

Tossing some thoughts out

So what we could well be looking at is a complete overhaul of categories, for example having categories that are defined as born in fooand naturalised citizen of foo, which sit inside nationality people and would also be top level categories? I'm not sure how to catch the overlap though. What sort of definition doe we place on our nationality categories? Are we going to have to completely overhaul the category structure? Are we going to have to be more rigid in our categorisation policy? Hiding Talk 09:11, 8 September 2006 (UTC)

I'm not sure we need to plan every detail about categorization policy before implementing this. There will be a natural desire to make categories that create useful and meaningful intersections. Perhaps "useful and meaningful" is all we need to have as a policy. Things can evolve. I do wonder about the issue about the "x is a y" and "x is related to y" duality to our categorization structure. Clearly, "x is a y" is not a problem when creating intersections. I don't think "x is related to y" would be a problem either. For example, if you take Category:George W. Bush which is a "related to" category and intersect it with Category:Children's books you end up with an intersection that contains The Pet Goat. If the categories are "related to" the intersections will be as well. But as this is a source of confusion and conflict with our current system, it will remain so with an intersection system. -- Samuel Wantman 09:56, 9 September 2006 (UTC)

Some technical bits

Ok, let's forestall the namespace vs specialpage bit and see what we do agree on...

  • What we want is a way of getting the intersection of two or more categories.
  • Obviously, intersection of A and B equals intersection of B and A.
  • Because of performance reasons, it might be useful to restrict this to the intersection of, say, five or six categories maximum.
  • It would be nice to have some browser page, that makes it easy to get from "actors" to "italian actors" or "actors born in 1950".
  • It would be nice if this functionality was available from the category list at the bottom of each page; for example, we could add a checkbox to each cat, and an intersect button that, if clicked, shows the intersection of all 'checked' cats.
  • Thoughts please? >Radiant< 10:20, 9 September 2006 (UTC)
  • Also, I've written this script as a possible addition to MediaWiki. It currently uses Specialpages but it would not be problematic to rewrite it using the Category routines, to create a namespace. >Radiant< 16:08, 9 September 2006 (UTC)
    • And it might interesting to look at the current query code for categories, which limit the search to 200 hits at a time. I assume running a "null" result intersection query against a very large category (like Category:Living people intersected with Category:Ancient Greeks) would have roughly the same database impact as the unlimited category search used to have. The unlimited category searches presumably had other problems as well, but I suspect Tim might have some insight here. -- Rick Block (talk) 20:13, 9 September 2006 (UTC)
Looks like Tim's insight might have to wait a bit. See User:Tim Starling/Get Well Soon. :-( Carcharoth 22:14, 19 September 2006 (UTC)
  • I spoke to Brion yesterday and he likes the idea. The main problem seems to be server load: if two huge (>10000) categories are intersected but have no actual matches, that would be problematic. Suggestions on how to alleviate this would be appreciated. >Radiant< 22:27, 19 September 2006 (UTC)
Without prebuilding indices in the database for any conceivable intersection search, I don't think there is a simple solution. A ∩ B for large sets A and B that have no intersection requires reading all elements of A and all elements of B (even if A and B are just indices). The only alternatives I can think of are:
  1. run such queries in a non-blocking manner at a lower priority than other work going on in the database (but it looks like MySQL doesn't support this, see [1])
  2. slough them off to a different physical server (which would likely need to have its own copy of the category database)
  3. expand the query into multiple queries that could be run in series to obtain the result (for example run 26 queries restricting the intersection by first character of the category name).
None of these are particularly easy, and #2 even requires throwing hardware at the problem. -- Rick Block (talk) 01:43, 20 September 2006 (UTC)
Is it possible to prebuild indicies for intersection categories on a daily basis, or the first time an intersection is requested each 24 hours, or whenever the servers have low loads? This would mean that the information would not be totally current, but that is better than non-existant. -- Samuel Wantman 08:57, 21 September 2006 (UTC)
You're basically asking for (number of categories) squared tables to be built every 24 hours (and that's only for two way intersections) which doesn't seem feasible. Also, the vast majority of those category intersections will never be requested. Perhaps, since real-time intersections of categories with a low number of entries (less than 200?) wouldn't be a load on the server, you could limit the tables built to just the intersections of large categories. Or maybe something like creating a daily list of categories that don't intersect would be useful as that information could be stored in one simple list for all such intersections. --JeffW 15:57, 21 September 2006 (UTC)
  • I figured it might be possible to cut off the query after a second or so, or to cache a list of queries that are known not to yield results so that they need not be recalculated (of course this cache should be timestamped). >Radiant< 16:54, 21 September 2006 (UTC)

From the discussions on wikitech-l and some personal experimentation, I think the issues are exactly 1) almost entirely technical - how to actually get the pages at the intersection at one or more categories and 2) demonstrated in the worst case by those very large categories mentioned (Category:Living_people for example). There are a few differnent ways to implement using existing tables, but so far it looks like, in the worst cases, these all take at least several seconds to complete. So, that means either new tables (likely with millions and millions of records - it's not exponential, but it is big) or perhaps a way to identify unreasonable queries or cut them off as Radiant mentioned. I'm going to try a few more experiments... --Aerik 00:06, 12 October 2006 (UTC)

Anime and Manga case

I find this proposal the right approach to solve supercategorization and long discussions about where and how put an article. On italian Wikipedia, I have made a simple SVG that can be useful to explain the differences in this two categorization approaches in Anime and Manga case ((in Italian) read the discussion), where genres are the same for both and things like shounen anime and shounen manga are redundant, confusing and create an huge amount of subcategories. Since I've put that SVG in Public Domain, you can copy and modify it without restriction (using an SVG editor like Inkscape) to represent other supercategorization cases. --Dzag 10:03, 4 November 2006 (UTC)

Also check out Faceted classification

Much of what you (and I) are looking for is described as Faceted classification, which is fundamentally about intersections of semantically orthogonal facets.

Of course, we could have lots of arguments about what constitutes a "proper" facet as well as whether a "faceted approach" should include multiple picks from the same facet. -- PhilMurray46 16:44, 14 November 2006 (UTC)

DynamicPageList

I see it mentioned a few times on this page. Is there a reason why meta:DynamicPageList or meta:DynamicPageList2 are not turned on in en-wikipedia? It seems like it would be very useful. Carcharoth 10:41, 12 January 2007 (UTC)

Maybe DynamicPageList could be of some help. I put a lot of effort in improving it over the last weeks. Some of the features discussed here are already available, others could be added I think. You may want to have a look at the DPL demo website. -- Algorithmix 13:49, 18 February 2007

Temporary flat category solution?

I like the proposal at Wikipedia:Category_intersection#Category_viewed_as_an_outline. I was vaguely aware of this, and there is currently a way to rather kludgingly do this in a manual and non-updateable way. See what I did at Category:Earthquakes, Category:School massacres and Category:Anti-communists. What I want is category union. Intersection seems to talk about category A AND category B, when I am also interested in category A OR category B, with tweaks to allow simple union to a specified depth (ie. specify a depth of 3 go down three levels into the subcats, and unify and list all the articles in the subcats and the original parent category), as opposed to, say, unifying two unrelated categories. What I really want is feedback on whether the linking to page history that I did is acceptable, and whether such lists of all articles in a set of categories are useful. Carcharoth 10:53, 12 January 2007 (UTC)

Some test code for intersections implmented in a fulltext index

I have a page up for testing some code which does category intersections using a boolean search on a MySQL fulltext index of a table containing all the categories for each page. My database is copied from November and only has the tables I needed to test this code.

http://aerik.com/wikintersections.php

I know the interface is no good, but I am trying to collect statistics about query performance. Please take a look and run some intersection queries. Thanks, --Aerik 18:37, 16 January 2007 (UTC)

Looks good. Would you be able to comment on the above two sections as well? I would dearly love to unify any number of categories I selected and create alphabetical indices for large numbers of fragmented categories. Would your system do that as well as intersection (ie. would it do OR as well as AND)? Carcharoth 11:48, 17 January 2007 (UTC)

I've put some links at the top of Category:Unassessed Tolkien articles. Unfortunately, those intersections are only helpful if you are using current data, but that is an example of the reasons why category intersection would be very helpful. Carcharoth 16:20, 21 January 2007 (UTC)

Just to try some worst case queries, I intersected category:Living people with category:album covers (no intersection, query took 4.8575 seconds) and category:Living people with category:Disambiguation (one intersection, query took 7.2953 seconds). I don't know if these are elapsed or CPU times (seem to be elapsed). My understanding is that in at least some cases MySQL processes queries serially, so while one of these is running other queries would wait. Running some more tests, specifically intersecting category:Unassessed_Tolkien_articles with category:Top-importance_Tolkien_articles at the same time as one of the long running queries, it appears reads involving different specific database records are not serialized (and, with this interface there's no way to try simultaneous writes, so I have no idea how writes would interact with these long running reads). The smaller intersection finished quite quickly when starting it during a long query, but repeating the same long running query while one is already active seems to slow down both (implying multiple reads using the same index record are processed concurrently). Intersecting category:Living people with category:Living people (!) takes 15 seconds or so of elapsed time, but the query result says 0.0045 seconds. I think the bottom line is that with read-only access it seems to not be too bad but we should have considerably more benchmarking involving simultaneous read and write operations before imagining we might be able to deploy anything like this to the live en database server(s). It seems like it might be worth getting a MySQL performance expert to comment about this. -- Rick Block (talk) 18:02, 21 January 2007 (UTC)
Thanks for the comments. Over here someone asked when Category Intersection is likely to happen. I said it was anyone's guess, really, but if anyone here can say something more informative, that would be great. Carcharoth 12:51, 23 January 2007 (UTC)

Hi, as an update, I'd call the result of the tests so far, "mixed". Yes, the "Query completed " time is elapsed time that I calculate using a microtime right before the query is passed to MySQL and getting it again right the result set is returned. This is always for a statement limited to 30 records. Rick, thanks for your in-depth observations and tests. I'm going to go ahead and create a lucene index (I need to go educate myself about lucene) of the same data so we can play with that and make some comparisons. Carcharoth, I think we're maybe a couple of months away from having a working solution, but I will likely need a more senior developer's help to integrate the agreed upon solution into the main body of code as efficiently as possible... so there is always the question of politics/bureaucracy... --Aerik 18:20, 23 January 2007 (UTC)

Great. Thanks for the update. Let us know what is happening and if we can do anything to help, like run more tests. I'll try and keep an eye on this page. Carcharoth 23:29, 23 January 2007 (UTC)
Carcharoth, thank you! More tests would be great! I'll post more as soon as I have any data/updates. --Aerik 19:57, 24 January 2007 (UTC)

I had to change hosts and upgraded to a VPS (too cheap for a dedicated server) an in the process deleted some stuff I shouldn't have. To make a long story short, I'm working on a new download (from Apr 7, I believe) and I am also playing with Zend_Search_Lucene and *trying* to create a Lucene index. I hope to put up a new MySQL fulltext index in a day or two, and the Lucene index in maybe a couple of weeks (still working on technical problems and even after that it's going to take days to index everything - I'm planning to go ahead and index through all the namespaces to really see what the performance is like).--Aerik 16:28, 19 April 2007 (UTC)

If this is implemented it would make it a lot easier to

look for octogenarians, for example. Just search for an intersection between category:living people and category:1918 births until category:1927 births. --Wirosableng 20:33, 24 February 2007 (UTC)

Yes I agree, I think we should start a voting process for this proposal and if the majority agrees, the developers could begin the work. --Drgerard 17:06, 25 February 2007 (UTC)

This proposal would wreck the utility of the categories for browsing articles

This system misunderstands what categories primarily for, which is browsing, not searching. If implemented people might now and again find it useful as a search function, but they would miss out on all the detailed connections that one comes across serendipitiously with the current detailed system. Therefore it should be abandoned. Casperonline 15:26, 27 March 2007 (UTC)

I think it would make it easier to browse, which was part of the inspiration for writing the proposal. All the current categories would still exist in some form, and many more would be possible using intersections. Every topic would have fully populated categories and you could navigate to fully populated sub-topics or intersections. Of the three ways we mocked this up, the first would look almost identical to what we already have, so why is the serendipity lost? -- Samuel Wantman 09:03, 28 March 2007 (UTC)
After a lot of problems in WP Films categorization and taking in consideration all views presented, I also think this system would help rather than wreck browsing. Actually it doesn't wreck at all the browsing of articles, it just involves having the primary category additionally populated, offering multiple possibilities, whereas our present conventions limit them. I was also scarred when I first learned about it, but as problems continue to accumulate unanswered, I start seeing its huge merits. Hoverfish Talk 14:22, 10 April 2007 (UTC)
I agree as well it would help browsing. For example, recently I was browsing Category:FA-Class biography (arts and entertainment) articles. But I didn't just want them all, I specifically wanted the subset that were also BLPs. Many subcategories are so large that browsing them may not be of interest. However browsing a subset may very well be. Nil Einne (talk) 19:13, 31 March 2008 (UTC)

So what's going on?

I am interested in doing what I can (as a non-dev) to help push this along. Please tell me that at least one person is still tinkering around in the background... :) Girolamo Savonarola 04:50, 21 August 2007 (UTC)

I'm still watching the talk page. Well, sort of. If you want to get people interested in this again, the best thing to do is get a developer interested, and then get a group of people developing the proposals and putting them into workable formats for wider discussion and, well, in a format that developers can work on. Carcharoth 15:59, 3 October 2007 (UTC)
Okay, fair enough. Where can I hunt for a dev? :) Girolamo Savonarola 17:42, 3 October 2007 (UTC)
I too, keep an eye on this page, and also scan the Wikitech mailing list. I put a copy of this proposal in Jimmy Wales hands last year, I e-mailed the tech list, e-mailed developers that seemed interested. The problem, it turns out, is not getting people interested. Pretty much everyone I've ever talked to about this thinks it should happen. The problem is technical. It involves creating a system that will be able to handle category intersections without bringing the entire site to its knees. Devoloper Gmaxwell, is working on this using Lucene, and in an e-mail to wikitech on September 9th said:
I have a new search system running externally right now which is much more powerful, for example it is able to apply category intersections, fuzzy title matches, text regular expressions, page link constraints and geographic filtering. Most of the functionality is fast enough to make available to everyone.
If you'd like to be a beta-tester please let me know.
I e-mailed him and offered my help in trying to coordinate efforts between the developers and the people who had been discussing the user interface here. I got no response. -- SamuelWantman 20:26, 3 October 2007 (UTC)
That sounds like an external system. Did he give any indication of how it would be integrated to work in the system? I'd try e-mailing him again, or dropping a note off on his talk page. It sounds promising. Did he try it out on the very largest categories? Carcharoth 23:04, 3 October 2007 (UTC)
For what it's worth, I watch this page as well. I think the most promising approach involves using a database for category searches that would allow these searches to be run at a lower priority than other work, although AFAICT MySQL does not currently support this (browser timeouts are another problem, but I think there are potential solutions for this issue). I would think the MySQL folks would be interested in working with the MediaWiki development team to come up with a solution - it would make a great entry on their case studies promo page. Has anyone approached the MySQL folks? -- Rick Block (talk) 00:40, 4 October 2007 (UTC)
How about a well-advertised, open competition to see which system or approach can come up with the goods? Or does that sort of thing not really work any more? Carcharoth 00:44, 4 October 2007 (UTC)
I don't know if it's official, but it's at least an unofficial policy of the board to use free software. An open competition restricted to open source databases might work, but I suspect winning a competition of this nature might be more valuable as advertising to a commercial vendor. MySQL exists in both commercial and free forms (same code). I'm not sure how many others are like this. Asking MySQL if they'd work with the development team on this doesn't seem like it could hurt, and since we use MySQL now seems like the easiest approach. For them to list Wikipedia as a case study would probably require the board's approval. -- Rick Block (talk) 04:26, 4 October 2007 (UTC)

Is "American film directors" a good example?

"American film directors" can be interpreted as "Film directors who are American" or as "Directors of films first published in America". Is this important? --Damian Yerrick (talk | stalk) 22:33, 26 November 2007 (UTC)

The former is the only correct interpretation. UnitedStatesian (talk) 22:35, 26 November 2007 (UTC)

Semantic MediaWiki extension

I am excited by the potential of this extension. I understand this fancy features load servers, complicate wikipedia editing, but these will take wikipedia to the next level. This feature is very scalable. We could initially only give it a limited scope, say dates or birth dates; why have category: 1976 births when just as easily we could have semantic data: "birth::27 may 1976" meaning the semantic tag which goes beyond generic category info - it understands the birthday in question is 27 may 1976. This could be applied to lots of things, and i feel it will (after an adjustment period) enhance and even simplify the categorization element of wikipedia. Chendy (talk) 14:38, 1 February 2008 (UTC)

Indeed, SMW does category intersection, and it lets you do much more precise querying than nebulous things like "Child actors".

{{#ask:
 [[Category:Actors]]
 [[Category:Directors]]
}}

with no changes to wiki markup will find the intersection of those two categories and their subcategories (to a configurable subcategory depth). Then if you've gone on to annotate with properties like [[Date of birth::2004-05-27]], you can ask for

{{#ask:
 [[Category:Actors]]
 [[date of birth::>1992-01-01]]
 |?Date of birth
 |?Category
 |sort=date of birth
 |?
}}

and that will display all actors 16 years old or less, ordered by date of birth, together with any other categories they've been assigned. -- Skierpage (talk) 05:14, 16 April 2008 (UTC)

I certainly think we should be working towards implementing SMW somehow on Wikipedia. It would be useful for categorization and much more besides. In fact I would like to see an SMW database developed somehow outside any specific language version of Wikipedia (something like commons), to enable the sharing of key information between Wikipedias. (People might like to check out my partially-formed idea at User:Kotniski/wdb, which I wrote before I discovered SMW - I believe SMW would be a perfect way to implement something like this.)--Kotniski (talk) 07:45, 16 April 2008 (UTC)

When can we get ?

I would like to use it now. People are creating manual intersection categories all over the place. Perhaps the overhead of the dynamic processing will be offset by dramatically reducing the number of manual intersction categories . . . ? Pee Tern (talk) 00:23, 22 February 2008 (UTC)

PS: They are also creating (non encylcopedic) intersection list articles all over the place. Pee Tern (talk) 00:27, 22 February 2008 (UTC)

I would also be curious to know what exactly is going on. It seems that every few months there's a clamor, we're given a quick update and told it should be around shortly, and then suddenly...nothing for months. At the very least, it would be nice to have some regular biweekly or monthly updates as to the progress being made. This would also increase involvement with this page and (hopefully) editorial advocacy for prioritizing this development work. Thanks, Girolamo Savonarola (talk) 06:06, 22 February 2008 (UTC)
I am thinking of just starting to create semantically primitive categories anyway, and try to get some momentum up. For example, instead of (actually as well as for the being) having a category for law enforcement agency the article would have three semantic primitive categories, law, enforcement, and organisation . . . Pee Tern (talk) 21:22, 24 February 2008 (UTC)

Initial implementation option

Rather than trying to go for the whole solution immediately, perhaps we go for in stage 1 only binary intersections, with a not and groupby clause. In stage 2 we go for nested binaries . . . Pee Tern (talk) 21:53, 24 February 2008 (UTC)

Alternative syntax

The syntax proposed to date seems to introduce another way of "linking" to information in Wikipedia. I would like to propose that we use the existing syntax as follows:

Expression Effect
{{Category:Intersection |category1 |category2}} To produce a category page of those pages in both category1 and category2.
{{Category:Intersection |category1 |category2 |not=category1}} To produce a page of those pages in category2 but not in category1.
{{Category:Intersection |category1 |category2 |not=category2}} To produce a page of those pages in category1 but not in category2.
{{Category:Intersection |category1 |category2 |groupby=category2}} To produce a page of articles in both category1 and category2 where category2 is a parent category. For example {{Category:Intersection |countries |organisation |groupby=countries}} would produce a page of organisations grouped by country (there being a category for each country which is in a parent categroy of countries).
[[:Category:Intersection |category1 |category2]] To produce a link on a page to the corresponding intersection category.
[[Category:Intersection |category1 |category2 ]] Is by definition not defined, or, it could put the page into both categories at once and be equivalent to [[Category:category1]] [[Category:category2]]?
{{Category:Intersection |{{Category:Intersection |category1 |category2 }} |category2 }} For nesting?

Pee Tern (talk) 21:53, 24 February 2008 (UTC)

This is basically creating a 2-level namespace where "Intersection" is a qualifier within the category namespace. I don't actually know the code, but I'd expect that creating "intersection" as a namespace would be a whole lot easier and cleaner. Note that the main problem is (and has always been) the potential performance hit of a sparse intersection between two very large categories. It can take multiple CPU seconds (!) to figure out which members of category A are also in category B (worst case is two very large categories with very little intersection). Allowing individual users to invoke operations that make the DB server(s) consume multiple CPU seconds at wikipedia.org is simply not acceptable (invites a denial of service attack). -- Rick Block (talk) 05:20, 27 February 2008 (UTC)
I guess I am not to hung up on the exact syntax. The currently suggested one seems to be going too far away from what a category looks like and I think if set logic on categories is going to be successful then it needs to look like and feel like a cagegory all the time.
Yep. Performance could be an issue. Perhaps category set logic is only routinely processed by bots at certain times and they can check for malsetups and the only people who can dynamically reevaluate a specific category set logic expression are fully fledged administrators . . . ?

Pee Tern (talk) 10:01, 27 February 2008 (UTC)

A non-wikipedia example

It seems important to note that this idea is important beyond the particular wiki implementation that we call wikipedia, and would be incredibly useful to anyone using the underlying software. A case in point is a project that I'm working on now. We are using MediaWiki to organize and categorize all of the projects that my organization is working on. We want to browse by multiple categories, and also by the interersection of those categories. Example: show me all the present projects. show me all the future projects. Show me all the projects in the IT department. Show me all the past IT department projects. We chose a wiki over a formal database since it was so easy to use and flexible; it would be really wonderful if it could also provide more powerful reporting functions. Consider this a vote in favor of devising a scheme for allowing for union and intersection functions that will make browsing much more powerful. -- mike roy, kenyon college, gambier, OH, USA —Preceding unsigned comment added by 76.181.107.71 (talk) 04:54, 27 February 2008 (UTC)

The main branch of MediaWiki development is typically focused on the needs of wikipedia.org, however (as you note) MediaWiki is actually separate from Wikipedia and there are extensions that can be deployed on sites other than wikipedia.org. I think category intersection isn't being pursued by the developers basically because of potential performance impacts at wikipedia.org (I think several of the suggestions here could clearly be implemented). If I ever end up with a significant amount of time to devote to this (not likely in the short term), I may actually implement one of these. I'd invite anyone with development skills to consider implementing one of these suggestions. -- Rick Block (talk) 05:37, 27 February 2008 (UTC)
I know zip about the "backend"s of wiki*, either from technical perspective or a protocol perspective. If no one gets to this in the next few months, I might have some time after midyear to do a bit more than just muse about it. Where would I start? Pee Tern (talk) 10:08, 27 February 2008 (UTC)
Start at the MediaWiki homepage, http://www.mediawiki.org/wiki/MediaWiki (it's written in PHP, with an SQL database backend). -- Rick Block (talk) 14:48, 27 February 2008 (UTC)

Category Union

I posted this to the Wikitech list...

With the advent of __HIDDENCAT__, I've been wondering about using hidden categories to create indexes. My initial hope with Wikipedia was that we could reorganize categories so that categories could function as broad indexes of single attributes such as "People", "Films", "Bridges", etc... and hide all the intersection categories of parents. Later, if and when category intersection was implemented, all the hidden categories would no longer be needed. However, implementing major changes seems to be near impossible in a project as large and set in its ways as Wikipedia. There is just too much resistance to change. If category intersection was implemented there would be an technical compelling reason to make the change, but short of that upgrade, it seems like a very difficult -- if not impossible -- sell.

It really bothers me (and others, especially librarians), that Wikipedia is not indexed. You cannot find a master index of People, places, books, films, etc... To find anything you have to know in advance, where it is subcategorized. This only works if you know where to browse, and it is your desire to only browse in a small well-defined place. One of the big joys of libraries is the ability of finding things you didn't know about in broad swaths of knowledge. This ability is often lacking in Wikipedia because of categories being constantly broken into smaller pieces. For example, If I want to browse through the bridges in Europe, I have to look at a category for each country separately, and in some countries (like the UK) I have look at one for each county. It is just too difficult and time consuming a task to be a pleasurable leisurely browse.

So I've been thinking of alternative approaches. One possibility is to use hidden categories to create index categories. For instance, Category:Index-Films could contain all films, Category:Index-People could contain all people, etc... However, this would be difficult to maintain because the categories would be hidden, and it would take a tremendous amount of work to populate these categories. It seems crazy to have people doing all the mindless busywork necessary to create categories like these. That is why we have computers.

This is where developers come in...

I'm wondering about creating a new namespace, called (you guessed it) INDEX. Any category of people could be put in an index by adding Index:People on the category page. The "People" INDEX page, into which the category get put, would have links to all the articles and subcategories from the categories in the INDEX. The contents of the subcategories of those categories would NOT be added automatically. Each would have to be manually added to the index if appropriate. Just like a category there would be text that could be edited for each INDEX page. So in essence, an INDEX is a way to do category unions. This would be much, much easier than trying to create and maintain these indexes manually using categories.

It would be great if an INDEX page could be viewed two different ways (and easily switched). The first way would look similar to current categories, showing a category tree at the top, and all the articles below arranged alphabetically. It would also be great to see categories viewed hierarchically, like an index in a book. So the categories would be listed alphabetically and then all the subcategories and articles in the categories would be listed together alphabetically and indented. The categories could be differentiated by either making them bold, italic, or by labeling them as categories. If the subcategories have also been included in the index, their contents would also appear indented in one more level (this could be closed at first and opened using a "+, the same way category trees look. Users might also be able to set the default number of levels that appear -- perhaps two?).

I don't think there is any need to be able to add anything but categories to an INDEX. Adding anything else would probably make it harder to maintain the INDEX, and would probably confuse newbies. Of course, you should be able to create a link to an index page by typing [[:Index:People|Index of people]].

If you think this idea has merit and is a possibility, would it be difficult to implement? It has long been my understanding that category unions would be much less server intensive than category intersections. Perhaps each INDEX display process could be done dynamically?

Thanks, -- SamuelWantman 08:38, 29 February 2008 (UTC)

From the initial response from Symetrical, it looks like this has mySQL limitation problems as well... -- SamuelWantman 08:38, 29 February 2008 (UTC)

Property categories and other thoughts

I've noticed an occasional convention on English wikipedia that Category:Cities (plural) is for things that are cities, while Category:City (singular) is for stuff about the topic "City". -- Skierpage (talk) 05:18, 16 April 2008 (UTC)

Database memory X Server performance/ Wait time Trade-off

I've read most of the project page about the category intersection project; but I see it lacks a mid-term for the project. It was listed an example that included 'American People' then 'Film Directors'. I've noticed this could be a bit overwhelming to do 'on the fly' as the page mentioned; as both categories are relatively large, being 'American People' especially huge. This kind of approach maximizes memory use for the category pages, as it would spare 1 category as you would just 'request' a fusion of categories while navigating. But maybe wouldnt this case be a little extremist? A more convienient solution, imo, would be to keep the biggest intersective categories (saving performance), while having small categories only to be possible to be subject of intersection. Like this we would, over a transition time, keep categories as they are. No changes; no massive relocating in Wikipedia's core structure and lots of work for users; instead, just one possibility: intersect (possibly like a system mentioned on proj page) two or more categories from request. BUT (and heres the change), you may only intersect categories with under X articles. Like this we would keep major intersections (they could work as normal intersections, only they would be alredy stored in the DB), while removing smaller unnecessary ones (many were removed lacking notability, which wouldnt happen anymore). Let me list one example. Someone is looking for some real-time strategy game. But he is looking for platform ones, especifically PS2. Also, he wants the ones released in 2006.

So, acording to what was proposed at project page, we should have this:

 Real-time Strategy games - PS2 games - Games realeased in 2006

You may notice that all categories are huge, so for the server to make such an intersection would take time; server load. Now look at my proposal:

 Real-time strategy games - Ps2 games released in 2006 (1) OR
 Real-time strategy games - Ps2 games - US Platform Games realeased in 2006(2)

Notice that categories mentioned in suggestion 1 alredy exist, and they have both medium sizes. So here intersection comes as a very useful and not much server-expensive feature, excluding the necessity of categories of uncertain notability (ps2 real-time strat games of 2006, in the example), while keeping most categories we have now with a gradual detetion of the alredy existant lesser intersections. (And by doing so decreasing a bit database memory use, while having ocasionally-used intersection tool.) Frequent intersections could be saved as a category by the server, but they would still be seen as a regular intersection.

Feel free to add anything as you please; or to disagree. --Almighty11 (talk) 02:20, 4 March 2008 (UTC)

The main benefit of category intersection is that it allow users to intersect categories in ways that just are not possible in the current system. For example, intersecting Disasters, Events in China, and Events in 1935, or Films from 1963 and Comedy films. -- SamuelWantman 19:51, 8 March 2008 (UTC)
Forget about performance. Category intersection is a fast operation. There's a sort-based algorithm where you can do category intersection in 'n log(n)' operations or better, where n is the number of articles in the biggest category. This is what google does all the time, when you type in a search term, you're searching for the intersection of those search terms.- (User) Wolfkeeper (Talk) 23:36, 8 August 2009 (UTC)

Redundancy and hidden categories

First, I'd like to know how this proposal relates to Wikipedia talk:Categorization#Hidden categories. I think the hidden categories is a tremendously hideous idea, though the intersection is very interesting.

Second, the example listed is redundant. The example shown is that Lawrence Fischburne's categories would be

Categories: 1961 births | American people | People of African descent | Actors |
Child actors | Film actors | Television actors | Living people |
People from Georgia (US State) | People from Augusta, Georgia | Tony Award winners

However, this is redundant. All People from Augusta, Georgia are consecquetially People from Georgia (US State), and all Child actors are consequentially Actors. Surely you can easily see the redundancy. Often I will remove redundant categories such as these from articles, and I refuse to include a main category when an article is already in its subcategory. I like the checkbox format to intersect cateogries, but there must be a way to do it wothout putting an article in both a subcategory and the main category. Reywas92Talk 19:05, 8 March 2008 (UTC)

First: This proposal predates the hidden category proposal. There is widespread support for implementing some sort of category intersection by both developers and long-time editors. The reason it hasn't happened yet is technical, involving managing the server load that will result. Once implemented, intersection categories will most likely be deleted because they will be able to be created dynamically. If the hidden category proposal were implemented, the result would look very similar to this category intersection proposal.
Second: It is the removal of "redundant" categories that is one of the biggest problems with the current categorization system. I think you are thinking of categories as a way of classifying articles, so any additional categorization in the same taxonomy would be redundant. However, classifying articles is just one of the functions of categories, and in my view, not the most important one. Categories help users browse through similar articles. When you remove a redundant categorization, you make it much harder to browse through similar articles. To use your example of child actors, when you remove the redundancy of including them in the parent category, you make the parent much less useful because a large portion of the members will be missing. This case is the reason why we have WP:SUBCAT -- a guideline NOT to remove the redundancy which I hope you will read. Just because smaller subunits are created does not mean that the parent category is no longer useful for browsing, or that the children will be more useful than the parents.
Category intersection requires some of these redundancies, and their necessity is described in the proposal. To be redundant, a category must be fully described by the intersection of two other categories. In the case of "Child actors", what two categories would you intersect to have a result that only includes children actors? You could not intersect "Children" and "Actors" (this would result in people who are currently children and actors, but not former children actors), you could not intersect "Children performers" and "Actors" (this would mistakenly include children that were other types of performers who became actors later in life). See the problem? -- SamuelWantman 19:30, 8 March 2008 (UTC)
In most instances, I still don't think both the parent and child categories should be listed, but fine. About hidden categories again, if this is implemented, then then Hidden cats for normal categories won't need to be, right? I like how it hides maintenance categories, but the rest must be shown. I didn't read everything, but Interesection seems simpler than hidden categories. Reywas92Talk 20:13, 8 March 2008 (UTC)

Problem with redundancy and better proposal

The more I think about this, the more I don't like it. Above, you say that in many cases redundancy of categories would be necessary (Child actors and Actors) to allow for better intersection selections. Although in a few cases (which I disagree with) it may be preferred to have both the child and parent categories, it can really get out of hand.

For example, Warren Buffett. If Laurence Fishburne should be in both Child actors and Actors and in both People from Augusta, Georgia and People from Georgia (US State), then what about a little further. For the most possible intersections available on one article, shouldn't Warren Buffett be categorized in People from Omaha, Nebraska, People from Harrison County, Iowa, People from Iowa, American people, North American people, AND People? If the redundancy of two lower-level cats is advocated, why not all of them? North American people DOES apply to Warren Buffet, and someone may want to intersect that. What about listing him in 1930 births, 1930s births, 20th century births, AND Births of the last 123 years? They all apply, and someone may want to intersect them, but that's just too much redundancy on the article.

Here's a better idea: Rather than having checkboxes on every pages next to every category with an intersection button, why not have a special page with the tool? It could be Special:Category intersection, and it would simply have spaces for the categories to be intersected. It is much simpler than effecting every page, and is is surely no harder to implement. Reywas92Talk 20:23, 10 March 2008 (UTC)

What I'd really like to see is some well defined hierarchies that are integrated into how we do category intersection. So for example, if as you say -- Warren Buffett can be categorized in People from Omaha, Nebraska, People from Harrison County, Iowa, People from Iowa, American people, North American people, and People -- Perhaps only one of these categories is displayed on his page (probably People from Omaha) but there'd be a pull down list that would have all the other layers of the hierarchy. Currently, hierarchy, intersection, and related attributes are all combined into the categorization system and all look the same -- they are all subcategories. -- SamuelWantman 23:40, 11 March 2008 (UTC)
I actually like the pull-down idea, though only if it was small. My biggest problem is big categories. Some say that large categories are okay and make searchability easier, but when there are more than 200 members, it gets rediculous. Also, as some categories are sub-cats of many others, which one would go in the box? Many categories are not for articles, anyway. Reywas92Talk 23:47, 11 March 2008 (UTC)

Category diffusion overview

I made a simple overview of diffusion in some high level categories at Wikipedia talk:Categorization#Category diffusion overview a few days ago. Then I found this proposal which more clearly explains the benefits I was trying to express, and goes further in helping the user browse articles. Therefore I added a link from Wikipedia:Village pump (policy)#Category diffusion overview to this Wikipedia talk:Category intersection, as I believe the more editors read it the more may understand what categories can be used for ("Categories help users browse through similar articles."). -84user (talk) 03:31, 11 July 2008 (UTC)

Proof of concept implementation

I hope this is not dead yet, I just created a proof of concept implementation at

This actually does both, Category intersection and Link intersection, plus it allows specifying unions of categories and/or links prior to intersecting. Feedback welcome! --Dschwen 19:58, 11 July 2008 (UTC)

It works for me. I like it. It is even more up to date than the Wikimedia search (which takes time before it sees newly categorised articles). Let us hope the social barriers are not too high. (I mean the required changes in the mindsets of editors) -84user (talk) 14:01, 15 July 2008 (UTC)
I also did an implementation on my wiki (wikidweb.com), where there are links on every article that go to a special page showing either pages at that intersection, or "related categories". It's rough still, but a start. Roan (I think it was Roan) as some code for the current Mediawiki version that I intend to start working into something similar, and seeing how it's received. --Aerik (talk) 07:19, 12 December 2008 (UTC)
This does not work for finding "Talk:..." pages. I.e. B-class... | Unknown-importance... Tavatar (talk) 18:01, 16 September 2009 (UTC)

I like how Danbooru does tag intersections

If anyone's seen it on imageboards hosted on the Danbooru CMS, looking on a gallery of images/content filed under at least one tag shows a list (on the lower left side) of other tags which are found on those images; each "related tag" then comes with links to add or subtract such tags to narrow down the number of images.

I don't know if Danbooru-like category intersection lists, which are potentially large, can fit within category pages on MediaWiki sites like this, but finding a way to narrow such intersection lists to fit within the side of any category page could help with exposing the benefits of category intersections to a much wider audience.

Furthermore, like Danbooru's tags and tag intersections, I think that web feeds of recent changes to articles linked to the intersections should be provided for. At least, I think that such would be a suitable complement. --Toussaint (talk) 19:06, 11 April 2009 (UTC)

intersection sequences of sortable category groups

I would really like to be able to harness the power of categories to search for something like atheism activists born from 1940 to 1985. For this we'd need to have a new class of category, something like OrderedCategory which could handle and sort a category of categories based on dates for starters but also anything else that can be sorted/ordered. A flippant example: find me all the South Park episodes from season 3 to season 12. Basically it is one big enormous intersection of categories that would rely on sorted indexes in the underlying database to execute in a reasonable amount of time. Any thoughts on this? I couldn't find mention of the concept anywhere else yet. Donama (talk) 12:26, 30 June 2010 (UTC)

Categories added by templates

I'm not sure if I'm doing something wrong or if this feature isn't able to find articles that have categories added through templates, such as the stub-templates. An example is incategory:"1995 films" incategory:"Film stubs" should find for instance Sahara (1995 film) but doesn't. If I'm doing something wrong please let me know, and if this type of search isn't possible (yet) then perhaps it should state so on the Project page. GameOn (talk) 18:23, 4 May 2011 (UTC)