Looking at a recent article on wp:refdesk and other online services, I'd like to see a decent cross-comparison of hunch, y!, g! (historically), refdesk, &c. that addresses how well they contribute to a lasting body of reference knowledge. From a chat with psa:

hand wavy or not, as long as it's similar across the var. services

the trouble is there's nothing comparable in trad ref services
y!, g!, etc. you pick your answer to rate
and wp is collaborative
real life reference desk interactions tend to be different from the kind of random factual questions you get on the internet 
a lot of "can you help me with my paper topic" questions irl; not sure how you would quantify the "accuracy" of my answers to those

but answering that question is a solvable problem

it's tough to quantify judgment calls across services. Your answer for a 12 year old at the ref desk is not the same as for an adult necessarily, etc. etc. 

it's also tough to describe what makes a bridge stable. all sorts of bridges... if it's a question worth answering, we should take it seriously and answer it. if you take it seriously, what was once a custom service hand-tailored to every pairing of practitioner and client becomes a craft, then a science, eventually perhaps a complete body of work.

is it a question worth answering? 
reference education doesn't really help matters, at least anecdotally 
afaik the question hasn't been answered, and that answer applied, to any great affect to date 
unlike bridges... and people disagree on how to measure quality.


all areas of knowledge involve formulating concepts one can use to describe a question or topic "what sort of bridge will span this gap and provide specified support over time" is somewhat defined but you have to bootstrap meaningful specific language to get better answers. the original idea behind that question is more likely "how can I transport things quickly between these two areas" then someone identifies "bridging gaps" as an awkward-to-construct but potentially much-better and long-lasting answer to the question that process of identifying a concept cluster around bridges as a useful area to develop more knowledge and specific language -- exactly the same thing underlies reference work

service metrics in Pnina's paper: "was this cited" "was response fast" "was it accurate for the current question" whereas the elements that would be part of a shared science of information organization

would involve things like "was this a fundamental aspect of the {question and answer} which would be useful to other similar questions" and "what avenues might be pursued to further refine the answer over time" she touches on the latter in noting the benefits of collaborative / iterative answers but it's not one of the performance metrics. and "a good answer" for one questioner is very different from "an answer good in fundamental ways that improve the body of knowledge around how to answer similar questions"

what is "good"? that's the crux of the problem
no matter what, you have to break down quality into whatever you think the component parts of quality are
and analyze the answer against them 

in one case, quality inheres in each answer. in the other, it inheres in the framework of answering; the precision of metrics used to understand if a correct answer has been found; the collection of questions answerable by the aggregate result.

as long as 'reference' is viewed as an ephemeral service provided on top of something lasting, it will be hard to think of quality of anything related to it as something more than "was a single person satisfied?"

reference is fundamentally viewed as a service transaction 
example from the last page, science reference desk 
"user 1: whites of eyes: is it true that humans are the only animals to have whites eyes 
the range of answers I could provide to this person range from 
"yes/no with citation"... "let me show you where to look this up" ...  
"let me show you a bunch of places where you could look this up" 
(listen to their answer, refine my answer based on their response) 
e.g. if they say they are writing an animal science paper about vision, show them the animal science databases. 
as a librarian, we're taught to use reference as education 
on wikipedia, the answer is more like option 2 
how do you compare these goals - info vs. underlying need?

imagine a translation process. your question, as originally framed and written. a set of answerable questions that have been deduped and considered. a body of data providing the latest information about various topics.

a good reference system should give visitors feedback as they are asking questions to help them identify their need with known questions and topics. to help them ask new questions in the context of an existing q or topic. and then to provide an answer in the context of existing answers and information.

This is a system, not a transaction; it can support various transactions - either of those roles (education vs. quick need).

the refdesk provides maybe 5% of this system. A casual classification of questions and answers, as yahoo or other services provide, with basic browsing support, might provide a different 5%.

re: Pnina's metrics, 'responsiveness' is a normal service metric. could be applied to anything. one may as well throw in 'satisfaction' if you have time to survey people; it just needs measuring. and user satisfaction (service quality) should be separated from information accuracy & quality. 'assurance' --> listing sources: that should be applied to the data

right. satisfaction != accuracy or quality.   "satisficing" 
people rate inaccurate answers highly on sites like yahoo!, too 
sources are a great ex: not all sources are created equal, obviously 
I have given A+ service to students without giving them a single piece of factual information. 

and perhaps to the next layer of translation (the set of answerable questions and where they are/have been addressed. here the references should be to discussions of the topic or question) if you want to separate "accuracy in satisfying the questioner" from "accuracy of information provided" (which is tough -- I can answer every question with a perfectly accurate piece of irrelevant data; once you factor in 'relevance' to what the questioner had in mind they are no longer independent) then you should have two separate terms for quality : q of service and q / integrity of the underlying body of information your system has to care about the same needs for knowledge that its participants do

and it's easier to separate those two measures when you identify the different layers of information. the query/response process should be contributing to part of the system.

  • certainly the layer of capturing/identifying queries with what is already known (services like autocompleting search are an excellent example)
  • and the layer of reduced key questions (something that wp does not currently have, and refdesk does not currently think in terms of) [but that sites like Hunch might develo terminology for]
  • and over time, the underlying information: when someone asks a question that hasn't been clearly answered, or a question is so popular that it clearly should be addressed in that form within articles on the subject, articles (or datasets, or sources used in either) are improved by the process

I suppose the division into major topics counts as a tiny bit of work on the layer of 'reduced key questions'. But the process of asking should lead people to browsing existing core questions the process of answering should involve merging and splitting submitted questions so that the whole works towards a lasting valuable resource redirects are useful here.

so this is also a thesaurus/vocabulary problem; search helps 
the part where you don't know white of the eye = sclera, let alone different languages etc etc. 

To your example:

  • capture the original question "whites of eyes: is it true that humans are the only animals to have whites eyes "
  • identify clusters within the question that are prominent in the reduced q/a corpus "whites of eyes" "humans ... animals"
  • link the question to existing related questions (can be done by people who don't themselves like to answer anything)
  • classify the question [into the top-7 categories; though something more like a thousand-category system could be easily remembered by participants]
  • separately : name topics that have clusters of questions around them, and classify those. so "whites of eyes" and sclera would be in a specific biology category, "humans as special subset of animals" would be its own. topics that are sufficiently popular should have shorthand names that an be used to talk about them [that's how you create language with the power to say meaningful things about larger transactions]
  • help the questioner revise/improve the question in terms of what information is already available [not so relevant in this case, but necessary elsewhere]
  • add a new reduced question if this one is truly new [use, say, an interface like digg's with casual pattern matching to see if the questioner doesn't want instead to add a twist to an existing q]
  • identify 'closeness' of other questions; humans assisting the casual pattern matching search can provide to say "these other questions really are related". again, this can be done by people who would not generally answer q's directly
  • finally, the answers themselves. basically what we have today, with citations to articles. [citations outside of articles should probably be pushed down a layer, and added to articles directly, and only transcluded or called out in the answer; perhaps with a cross-article footnote that lets you say "see cite [Voss93] on this page".]