Talk:Vector database
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
Why was the original content removed from the article?
editIt seemed that the history at:
- https://en.wikipedia.org/w/index.php?title=Vector_database&oldid=1159882107
- https://en.wikipedia.org/w/index.php?title=Vector_database&oldid=1176100870
was a good start for the article, but it was stripped down to a one sentence introduction? Oneequalsequalsone (talk) 23:02, 4 October 2023 (UTC)
- The first one uses bad styling for what is needed by Wikipedia, the second one does not use WP:RS. I propose that we start rewriting it based on good sources. PhotographyEdits (talk) 13:05, 5 October 2023 (UTC)
- The first one was adequate for a stub article. Proposals of rewriting in the future are no substitute for even mediocre content; the article as it stands is worse than useless. 2001:5A8:460E:5E00:79EF:AD12:BBB8:D535 (talk) 22:29, 26 October 2023 (UTC)
List notability
editI just stumbled into this article and noticed the "list of vector databases" mostly consists of entries with only primary sources. Entries here really should be established as notable for inclusion. A good rule of thumb on this is having at least one reliable secondary source discussing it (and not in passing). For more information, see NLIST StereoFolic (talk) 21:25, 29 October 2023 (UTC)
- Okay, I've finished going through them and verifying notability of existing entries. The list should be in better shape now, but further eyes are appreciated. The list is also definitely missing a bunch of entries. StereoFolic (talk) 21:53, 29 October 2023 (UTC)
- Thanks, appreciated! PhotographyEdits (talk) 00:20, 30 October 2023 (UTC)
Add OpenSearch
editThis edit request by an editor with a conflict of interest has now been answered. |
Here is a draft entry for OpenSearch including both primary and secondary sources. Since I work on OpenSearch at AWS, I am mindful of WP:COI, and would appreciate it if someone else would insert this. --Macrakis (talk) 21:07, 7 February 2024 (UTC)
Name | License |
---|---|
OpenSearch[1][2][3] | Apache License 2.0[4] |
- ^ "Using OpenSearch as a Vector Database". OpenSearch.org. 2023-08-02. Retrieved 2024-02-07.
- ^ Pan, James Jie; Wang, Jianguo; Li, Guoliang (2023-10-21), Survey of Vector Database Management Systems, doi:10.48550/arXiv.2310.14021, retrieved 2024-02-07
- ^ "AWS debuts new AI-powered data management and analysis tools". SiliconANGLE. 2023-07-26. Retrieved 2024-02-07.
- ^ "OpenSearch license". github.
@StereoFolic and PhotographyEdits: Please take a look at the above proposed edit. Thanks, --Macrakis (talk) 21:20, 7 February 2024 (UTC)
By the way, Zilliz's managed service offering Cardinal supposedly now includes proprietary enhancements to Milvus, so it probably should have a separate line item again. And (COI alert!) perhaps the Amazon OpenSearch Service should also have a line item as the managed service version of OpenSearch. --Macrakis (talk) 23:08, 7 February 2024 (UTC)
- I've just added the edit, thanks for going through the proper disclosure and request process. I'm hesitant about adding entries for hosted options, since to me this list is more about database systems themselves, not vendors offering hosted versions of them (even with fairly minor changes). Anyone interested in finding hosted services for these databases can easily find that information with an online search. StereoFolic (talk) 15:42, 10 February 2024 (UTC)
- "Anyone interested in finding hosted services for these databases can easily find that information with an online search." Well, there is also such a list for VPN serivce. It just depends on the coverage in third party reliable sources. If there is extensive coverage, we should definitely cover it in Wikipedia too. But I don't think it is there yet. PhotographyEdits (talk) 22:09, 11 February 2024 (UTC)
Faiss
editDtunkelang, I just reviewed your source for Faiss being a database, not a library, but I'm still unsure about this one. The blog post lists it in its '5 best vector databases', but the description provided within simply describes it as a library, and in any case I'm unsure this blog would qualify as a reliable source. We could potentially consider the library an in-memory database, but it seems like a stretch and a description the developers themselves don't take on. StereoFolic (talk) 12:02, 14 February 2024 (UTC)
- I am unpersuaded, since I don't see any way in which Faiss fails to satisfy the definition of a vector database in the post. What criterion, in your view, does it fail to meet? Dtunkelang (talk) 18:53, 14 March 2024 (UTC)
- My understanding is that databases, pretty much by definition, are responsible for storing data. If I understand correctly, Faiss doesn't store data, it only queries data provided to it. If we consider any library that loads vectors and manipulates them to be a database, we get lots of obviously wrong labels like calling tensorflow and numpy vector databases. StereoFolic (talk) 22:31, 14 March 2024 (UTC)
- Faiss stores data -- you have to populate a Faiss index in order to query it. I've stored data in. Faiss index myself -- that's the only wast to use it. From https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/: "Faiss (both C++ and Python) provides instances of Index. Each Index subclass implements an indexing structure, to which vectors can be added and searched." Dtunkelang (talk) 18:32, 17 March 2024 (UTC)
- That's helpful context, but even still I don't see why we should be overriding the description the developers themselves give it. I have seen no evidence of the Faiss developers describing it as a database. It's always a "library". Databases typically have much more broad functionality, like server interfaces, transactional semantics, consistency measures, and replication methods. Meanwhile I can find multiple sources (1 and 2) explicitly calling out that Faiss is a library, not a database. I don't see why one listicle including Faiss (but, again, in prose only describing it as a library!) should override this. StereoFolic (talk) 19:22, 17 March 2024 (UTC)
- Faiss stores data -- you have to populate a Faiss index in order to query it. I've stored data in. Faiss index myself -- that's the only wast to use it. From https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/: "Faiss (both C++ and Python) provides instances of Index. Each Index subclass implements an indexing structure, to which vectors can be added and searched." Dtunkelang (talk) 18:32, 17 March 2024 (UTC)
- My understanding is that databases, pretty much by definition, are responsible for storing data. If I understand correctly, Faiss doesn't store data, it only queries data provided to it. If we consider any library that loads vectors and manipulates them to be a database, we get lots of obviously wrong labels like calling tensorflow and numpy vector databases. StereoFolic (talk) 22:31, 14 March 2024 (UTC)
- I think that it would be useful for this discussion to define what the difference between a database and a library is. I'm currently leaning more towards agreeing with StereoFolic, since the sources seem to agree on calling it a library and not a database, including the source that Dtunkelang provided. However, I don't see why it couldn't be considered both a database and a library if there were sources that called it a database. --MtPenguinMonster (talk) 08:42, 23 March 2024 (UTC)
- It's difficult to create a comprehensive definition since the concepts have a lot of overlap and exceptions. Most databases are standalone server applications, but then there are embedded databases. Most databases primarily read and write data on disk, but then there are in-memory databases. In general, getting perfect definitions of things is often not possible, so extrapolating imperfect definitions is risky, so I think we should defer to the labels used by reliable sources. In this case I think the sources overwhelmingly lean toward "library". StereoFolic (talk) 13:34, 23 March 2024 (UTC)
- I agree that FAISS calls itself a library. I just don't see that being a library is exclusive of being a database. For example, Berkeley DB is considered both a database and library, even on Wikipedia. I'll bow to the majority if I'm a lone dissenter here, but I do think that many developers looking at this page would expect and benefit from FAISS's inclusion. And, for whatever is worth, I have no personal stake -- I'm not even remotely associated with Meta or the project. Dtunkelang (talk) 20:39, 24 March 2024 (UTC)
- If we are going to distinguish "vector databases" from "ANN libraries", then do we at least agree that there should be another entry that lists tools like FAISS, Annoy, and other notable ANN tools (cf. https://ann-benchmarks.com/)? I was surprised not to see FAISS, which as far as I know is the most notable tool in this space, not listed here, and I believe that is a disservice to Wikipedia users. Again, I don't have a personal agenda. I am just trying to improve this page and technical Wikipedia generally. Dtunkelang (talk) 20:48, 24 March 2024 (UTC)
- Yes, I definitely think Faiss fits in somewhere on Wikipedia. I think there a list of vector search libraries would meet WP:NLIST. That list would certainly fit in the "See also" section here. Faiss itself probably qualifies for a standalone article of its own too. I just don't think it belongs in a list of databases is all. If you agree, I will let you remove it here so the consensus is clear. Cheers - StereoFolic (talk) 02:15, 25 March 2024 (UTC)
- I agree with the idea. I propose we use this space to establish a clear distinction between a "vector database" and an "ANN library" -- possibly establishing better names or at least synonyms for these -- and then coming up with lists of notable examples of each. That way there isn't too much drama when we introduce the new page and update this one. Though I'm sure we will see a bunch of incremental updates. After all, this is Wikipedia. :-) Dtunkelang (talk) 21:07, 30 March 2024 (UTC)
- How do you and other folks here feel about this explanation of the distinction between vector databases and vector libraries? https://weaviate.io/blog/vector-library-vs-vector-database It's hardly an impartial source, but I'm not sure that we will find an impartial source that defines these terms. Dtunkelang (talk) 13:56, 2 April 2024 (UTC)
- I think that's a decent enough source to use it in an article. Definitions are hard, especially for new concepts, so I still don't think we should be trying to come up with a universally applicable definition which we then apply ourselves to distinguish things like this. I think we should prefer to categorize things by the way they describe themselves, or the way most reliable sources do. StereoFolic (talk) 00:30, 5 April 2024 (UTC)
- I've gone ahead and removed FAISS. Hopefully one of us or someone else will soon get around to creating an entry for vector libraries. Thank you for persuading me. Dtunkelang (talk) 18:11, 5 April 2024 (UTC)
- I think that's a decent enough source to use it in an article. Definitions are hard, especially for new concepts, so I still don't think we should be trying to come up with a universally applicable definition which we then apply ourselves to distinguish things like this. I think we should prefer to categorize things by the way they describe themselves, or the way most reliable sources do. StereoFolic (talk) 00:30, 5 April 2024 (UTC)
- Yes, I definitely think Faiss fits in somewhere on Wikipedia. I think there a list of vector search libraries would meet WP:NLIST. That list would certainly fit in the "See also" section here. Faiss itself probably qualifies for a standalone article of its own too. I just don't think it belongs in a list of databases is all. If you agree, I will let you remove it here so the consensus is clear. Cheers - StereoFolic (talk) 02:15, 25 March 2024 (UTC)
- It's difficult to create a comprehensive definition since the concepts have a lot of overlap and exceptions. Most databases are standalone server applications, but then there are embedded databases. Most databases primarily read and write data on disk, but then there are in-memory databases. In general, getting perfect definitions of things is often not possible, so extrapolating imperfect definitions is risky, so I think we should defer to the labels used by reliable sources. In this case I think the sources overwhelmingly lean toward "library". StereoFolic (talk) 13:34, 23 March 2024 (UTC)
Adding references to every license is overkill. If you are going to use any of these databases/libraries, you better check the license of the files you download instead of relying on the Wikipedia link (which may be outdated, or apply to a different version). It is completely sufficient to only have a link to the software itself (and when it has a Wikipedia article, do we need a reference at all? Apache Lucene uis notable without a reference on every link to the article), as long as the license can trivially be found from the software's main page, in my opinion. Too many "references" hinder readability because of visual clutter. 129.217.30.137 (talk) 17:12, 19 March 2024 (UTC)
- As stated here, WP:OVERCITE can be a problem, but this is not the case here. This article needs more citations, not less. List and comparison articles/sections on Wikipedia are notoriously prone to inaccuracies and maintenance difficulties, because they tend to be under-cited and over-inclusive. Even if a claim is verifiable through an article linked nearby, it must still be independently cited at the point of inclusion. StereoFolic (talk) 17:35, 19 March 2024 (UTC)
An example would help
editThis article could be improved by adding a simple example with a plausible use case. FreeFlow99 (talk) 16:37, 7 April 2024 (UTC)