Talk:Inverted index
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
Maybe the content of Full inverted index should be moved here? This page could also be expanded with some small examples for a full inverted index and an inverted file index, as seen in [1] -- Nils Grimsmo 22:12, 18 November 2005 (UTC)
- This has now been done. -- Nils Grimsmo 20:25, 13 December 2005 (UTC)
Why is it called "Inverted"
editI've been wondering, why is this data structure called an inverted index? After all, it's exactly like a regular index in a book, where you look up a word and find a list of pages that mention this word. It's not "inverted" in any sense from the normal definition of an index...
[2] tries to explain that
- The index is "inverted" in the sense that the key value is used to find the record rather than the other way round
but I don't understand this explanation, because this is how all indices work...
I could understand a different name: "inverted text". Whereas a text (treated as a bag of words) is defined by whether or not each word in the dictionary is in it, we can "invert" this information, and define for each word whether or not each file contains it. Instead of a file structure, we have an "inverted file" structure, which allows for quicker searching of words. Could it be that the phrase "inverted index" was coined by starting with the phrase "inverted-text index" and then dropping the word text? Or something of this sort?
Or is there a different explanation for the word "inverted" that appears in the name of the concept "inverted index"?
Nyh 14:04, 26 December 2005 (UTC)
- I have never read anything about the naming, but I would guess that the phrase comes from the term "inverted files". As the concept of the term "file" narrowed over time, and the concept of such an index broadened, I guess some new term was needed, and "inverted index" was chosen because it was close to "inverted files". -- Nils Grimsmo 17:31, 26 December 2005 (UTC)
- Consider the contents of the related article on search engine indexing, where I describe one reason for why it is called inverted. Josh Froelich 16:37, 15 December 2006 (UTC)
External links
editThe link to "inverted index for search engine" does not appear to be related or useful. Perhaps it should be removed. --131.107.65.116 23:08, 8 September 2006 (UTC)
Current state of the article
editI think this article is rather unstructured and messy. It looks like to separate articles concatenated. A full rewrite would perhaps be nice? Klem fra Nils Grimsmo 20:55, 18 October 2006 (UTC)
- Agree, it looks more like an excerpt from a textbook. I was planning to propose a rewrite when time permits (which will be after the weekend). --airborne 20:46, 19 October 2006 (UTC)
inverted list/file database
editThis article should also mention, even before the webcrawling stuff, that the technique was used in some database management systems such as ADABAS, DATACOM/DB, and Infodata's Inquire. According to Pratt, Philip J. (1987). DATABASE SYSTEMS: Management and Design. Boston: Boyd & Fraser Publishing Company. pp. p. 465. ISBN 0-87835-227-9. {{cite book}}
: |pages=
has extra text (help); Unknown parameter |coauthors=
ignored (|author=
suggested) (help), many Relational database management systems use this structure for physically storing secondary keys.
I've gathered some more stuff:
http://www.pcmag.com/encyclopedia_term/0,2542,t=inverted+list&i=45332,00.asp "inverted list" points to "inverted file", the def of which follows:
« In data management, a file that is indexed on many of the attributes of the data itself. For example, in an employee file, an index could be maintained for all secretaries, another for managers. It is faster to search the indexes than every record. Also known as "inverted lists," inverted file indexes use a lot of disk space; searching is fast, updating is slower. »
And:
- Zhang, Hong-Chao (1994). "Impact of computer integrated manufacturing". Computerized Manufacturing Process Planning Systems. Springer. pp. p. 55. ISBN 0-412-41300-0.
An inverted list database is similar to a relational database, but a relational database is at a low level of abstraction, at which the stored tables themselves, and also certain access paths to those stored tables, are directly visible to the user. In general, the inverted list database has the following three significant differences from the relational database:
* The rows of an inverted list table are considered to be ordered in some physical sequence
* An ordering may also be defined for the total database in the form of a database sequence
* For a given table, any number of search keys can be defined in an arbitrary field or combination of fields{{cite book}}
:|pages=
has extra text (help); Unknown parameter|coauthors=
ignored (|author=
suggested) (help)
Then:
- Kaufmann, Helmut (1995). "Text Search Using Database Systems Revisited -- Some Experiments". In Carole Goble & John Keane (eds.) (ed.). Advances in Databases: 13th British National Conference on Databases (BNCOD 13) July 12 - 14, 1995. Proceedings (Lecture Notes in Computer Science). Manchester, United Kingdom: Springer. pp. p. 206. ISBN 3540601007.
...The set of all document identifiers which contain a certain descriptor is called an inverted list or synonymously a posting list. The individual identifiers are called postings. The cardinality of a posting list is called the descriptor's cardinal frequency. Documents can be retrieved using a boolean retrieval method or a non-boolean (Salton 1975) . While the former delivers a set of matching documents, the latter delivers an ordered ranking of documents, from which some prefix must be selected as the query's final result. In the current study, we only address boolean retrieval.
2.1 Indexing Documents using Inverted Lists
Today, inverted lists are the de facto standard for word-based indexing of full text documents [[[#Knu73|Knuth 1973]] ;Harman 1992 ], which support fast search for individual words in textual documents. In principle, an inverted list works as follows:
Assume each document is represented as a tuple of a relationDocument(Document-ID1, Content)
Each document is assigned a number of keywords (e.g. by extracting the individual words) describing the contents of the document. Hence, a document is approximated by a tuple of a relation with a set-valued attribute DescriptorDocumentApproximation(Document-ID, {Descriptor})
When we are searching for documents, we are interested in all documents containing a certain descriptor. Therefore, we introduce a relation with a set-valued attribute Document-IDInvertedList(Desriptor, {Document-ID})
Here, we store for each descriptor all references to those documents which contain this descriptor, i.e. its posting list.
Notes:
#In the following we assume that all documents are identified by a unique number (1, 2, 3, ...){{cite book}}
:|editor=
has generic name (help);|pages=
has extra text (help); Unknown parameter|coauthors=
ignored (|author=
suggested) (help)
(with references to:
- Harman, D. (1992). "Inverted Files". Information Retrieval (Data Structures and Algorithms). Prentice Hall. pp. pp. 28-43.
{{cite book}}
:|pages=
has extra text (help); Unknown parameter|Ref=
ignored (|ref=
suggested) (help) - Knuth, D. E. (1973). The Art of Computer Programming. Reading: Addison-Wesley.
{{cite book}}
: Cite has empty unknown parameter:|coauthors=
(help); Unknown parameter|Ref=
ignored (|ref=
suggested) (help) - Salton, G. (1975). Dynamic Information and Library Processing. Reading: Prentice Hall.
{{cite book}}
: Cite has empty unknown parameter:|coauthors=
(help); Unknown parameter|Ref=
ignored (|ref=
suggested) (help)
Textbook cited without other change?
editWhy was ISBN 978-0262026512 added to the bibliography without any other edit or association to it within the article body? If it were a newly-cited work for the article, I would have expected an edit or piece of information in the article that was sourced from it. Otherwise, wouldn't simply be a "see also" type of reference? --Michael Mol (talk) 16:20, 7 September 2010 (UTC)
External link to exuberant ctags
editI think this article should not have an external link to ctags. First, ctags's site does not provide any information relevant to this article, so WP:ELNO#1 applies. Second, there are several similar code indexing tools: etags, cscope, GNU Global, OpenGrok, LXR to name a few; it isn't clear to me why ctags is to be preferred here. -- X7q (talk) 23:42, 27 August 2011 (UTC)
Example
editWhy are we using an intersection instead of a union? — Preceding unsigned comment added by Dappermuis (talk • contribs) 01:15, 27 September 2012 (UTC)
External links modified
editHello fellow Wikipedians,
I have just modified one external link on Inverted index. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Added archive https://web.archive.org/web/20101203074412/http://www.vision.caltech.edu/malaa/software/research/image-search/ to http://www.vision.caltech.edu/malaa/software/research/image-search/
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}}
(last update: 5 June 2024).
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
Cheers.—InternetArchiveBot (Report bug) 23:26, 15 November 2017 (UTC)