Talk:Inverted index

Latest comment: 7 years ago by InternetArchiveBot in topic External links modified

Maybe the content of Full inverted index should be moved here? This page could also be expanded with some small examples for a full inverted index and an inverted file index, as seen in [1] -- Nils Grimsmo 22:12, 18 November 2005 (UTC)Reply

This has now been done. -- Nils Grimsmo 20:25, 13 December 2005 (UTC)Reply

Why is it called "Inverted"

edit

I've been wondering, why is this data structure called an inverted index? After all, it's exactly like a regular index in a book, where you look up a word and find a list of pages that mention this word. It's not "inverted" in any sense from the normal definition of an index...

[2] tries to explain that

The index is "inverted" in the sense that the key value is used to find the record rather than the other way round

but I don't understand this explanation, because this is how all indices work...

I could understand a different name: "inverted text". Whereas a text (treated as a bag of words) is defined by whether or not each word in the dictionary is in it, we can "invert" this information, and define for each word whether or not each file contains it. Instead of a file structure, we have an "inverted file" structure, which allows for quicker searching of words. Could it be that the phrase "inverted index" was coined by starting with the phrase "inverted-text index" and then dropping the word text? Or something of this sort?

Or is there a different explanation for the word "inverted" that appears in the name of the concept "inverted index"?

Nyh 14:04, 26 December 2005 (UTC)Reply

I have never read anything about the naming, but I would guess that the phrase comes from the term "inverted files". As the concept of the term "file" narrowed over time, and the concept of such an index broadened, I guess some new term was needed, and "inverted index" was chosen because it was close to "inverted files". -- Nils Grimsmo 17:31, 26 December 2005 (UTC)Reply
edit

The link to "inverted index for search engine" does not appear to be related or useful. Perhaps it should be removed. --131.107.65.116 23:08, 8 September 2006 (UTC)Reply

Current state of the article

edit

I think this article is rather unstructured and messy. It looks like to separate articles concatenated. A full rewrite would perhaps be nice? Klem fra Nils Grimsmo 20:55, 18 October 2006 (UTC)Reply

inverted list/file database

edit

This article should also mention, even before the webcrawling stuff, that the technique was used in some database management systems such as ADABAS, DATACOM/DB, and Infodata's Inquire. According to Pratt, Philip J. (1987). DATABASE SYSTEMS: Management and Design. Boston: Boyd & Fraser Publishing Company. pp. p. 465. ISBN 0-87835-227-9. {{cite book}}: |pages= has extra text (help); Unknown parameter |coauthors= ignored (|author= suggested) (help), many Relational database management systems use this structure for physically storing secondary keys.

I've gathered some more stuff:

http://www.pcmag.com/encyclopedia_term/0,2542,t=inverted+list&i=45332,00.asp "inverted list" points to "inverted file", the def of which follows:

« In data management, a file that is indexed on many of the attributes of the data itself. For example, in an employee file, an index could be maintained for all secretaries, another for managers. It is faster to search the indexes than every record. Also known as "inverted lists," inverted file indexes use a lot of disk space; searching is fast, updating is slower. »

And:

  • Zhang, Hong-Chao (1994). "Impact of computer integrated manufacturing". Computerized Manufacturing Process Planning Systems. Springer. pp. p. 55. ISBN 0-412-41300-0. An inverted list database is similar to a relational database, but a relational database is at a low level of abstraction, at which the stored tables themselves, and also certain access paths to those stored tables, are directly visible to the user. In general, the inverted list database has the following three significant differences from the relational database:
    * The rows of an inverted list table are considered to be ordered in some physical sequence
    * An ordering may also be defined for the total database in the form of a database sequence
    * For a given table, any number of search keys can be defined in an arbitrary field or combination of fields
    {{cite book}}: |pages= has extra text (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

Then:

  • Kaufmann, Helmut (1995). "Text Search Using Database Systems Revisited -- Some Experiments". In Carole Goble & John Keane (eds.) (ed.). Advances in Databases: 13th British National Conference on Databases (BNCOD 13) July 12 - 14, 1995. Proceedings (Lecture Notes in Computer Science). Manchester, United Kingdom: Springer. pp. p. 206. ISBN 3540601007. ...The set of all document identifiers which contain a certain descriptor is called an inverted list or synonymously a posting list. The individual identifiers are called postings. The cardinality of a posting list is called the descriptor's cardinal frequency. Documents can be retrieved using a boolean retrieval method or a non-boolean (Salton 1975). While the former delivers a set of matching documents, the latter delivers an ordered ranking of documents, from which some prefix must be selected as the query's final result. In the current study, we only address boolean retrieval.
    2.1 Indexing Documents using Inverted Lists
    Today, inverted lists are the de facto standard for word-based indexing of full text documents [[[#Knu73|Knuth 1973]];Harman 1992], which support fast search for individual words in textual documents. In principle, an inverted list works as follows:
    Assume each document is represented as a tuple of a relation
    Document(Document-ID1, Content)

    Each document is assigned a number of keywords (e.g. by extracting the individual words) describing the contents of the document. Hence, a document is approximated by a tuple of a relation with a set-valued attribute Descriptor
    DocumentApproximation(Document-ID, {Descriptor})

    When we are searching for documents, we are interested in all documents containing a certain descriptor. Therefore, we introduce a relation with a set-valued attribute Document-ID
    InvertedList(Desriptor, {Document-ID})

    Here, we store for each descriptor all references to those documents which contain this descriptor, i.e. its posting list.
    Notes:
    #In the following we assume that all documents are identified by a unique number (1, 2, 3, ...)
    {{cite book}}: |editor= has generic name (help); |pages= has extra text (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

(with references to:


--Jerome Potts (talk) 09:00, 8 January 2008 (UTC)Reply

Textbook cited without other change?

edit

Why was ISBN 978-0262026512 added to the bibliography without any other edit or association to it within the article body? If it were a newly-cited work for the article, I would have expected an edit or piece of information in the article that was sourced from it. Otherwise, wouldn't simply be a "see also" type of reference? --Michael Mol (talk) 16:20, 7 September 2010 (UTC)Reply

edit

I think this article should not have an external link to ctags. First, ctags's site does not provide any information relevant to this article, so WP:ELNO#1 applies. Second, there are several similar code indexing tools: etags, cscope, GNU Global, OpenGrok, LXR to name a few; it isn't clear to me why ctags is to be preferred here. -- X7q (talk) 23:42, 27 August 2011 (UTC)Reply

Example

edit

Why are we using an intersection instead of a union? — Preceding unsigned comment added by Dappermuis (talkcontribs) 01:15, 27 September 2012 (UTC)Reply

edit

Hello fellow Wikipedians,

I have just modified one external link on Inverted index. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 23:26, 15 November 2017 (UTC)Reply