Talk:HyperLogLog

Latest comment: 6 months ago by Retimuko in topic Algorithm description

2620:0:1CFE:B:6A5B:35FF:FEC2:9ADB (talk) 01:26, 2 August 2014 (UTC) I'm just learning about HLL, but I guess it should be made clear that the number of zeros is actually the number of zeros in the base-2 representation of the numbers.Reply

Question about context

edit

What is meant by "For instance, to achieve 99% accuracy, it needs only 16 KB." This clearly needs more context. Is the memory used by the algorithm really constant or are we assuming a certain type of data is being analyzed. Olleicua (talk) 15:11, 29 September 2014 (UTC)Reply

2% accuracy

edit

What does the figure 2% accuracy mean exactly? Does it mean that the estimated value is at most 2% away from the actual value? Destrielve (talk) 15:57, 19 October 2015 (UTC) It does mean that the estimation error is asymptotically normally distributed with standard deviation of 2%. — Preceding unsigned comment added by 90.146.84.192 (talk) 19:44, 25 April 2016 (UTC)Reply

edit

Hello fellow Wikipedians,

I have just added archive links to one external link on HyperLogLog. Please take a moment to review my edit. If necessary, add {{cbignore}} after the link to keep me from modifying it. Alternatively, you can add {{nobots|deny=InternetArchiveBot}} to keep me off the page altogether. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—cyberbot IITalk to my owner:Online 06:09, 10 March 2016 (UTC)Reply

Incorrect Usage of "Cardinality" in Article

edit

Hi Wikipedians,

I noticed that the definition of "cardinality" for a multiset to mean "the number of distinct elements in the multiset" differs from the definition in the article for Multiset. The correct definition for a multiset's cardinality is the one presented in the Multiset article. The disparity stems from incorrect usage in the original paper, where the authors define "cardinality" for a multiset to mean the number of distinct elements in the multiset. I would edit the page to not use "cardinality," but repeating "the number of distinct elements" everywhere "cardinality" appears seems clunky. Any suggestions for the edit? — Preceding unsigned comment added by 2601:647:4201:C7A8:D56F:DF3F:2538:BF (talk) 06:16, 11 November 2017 (UTC)Reply

Terminology

edit

KingSupernova insists on using terms "dimension" (in this edit) or "multiplicity" (in this edit and this edit) instead of "cardinality". All cited sources use the term "cardinality". Perhaps, this is not ideal, but "dimension" and "multiplicity" are not correct terms either. The term "cardinality" for "number of distinct elements" is widely known and widely used in the industry. Regarding other problems with grammar, I would suggest making separate edits for grammar and terminology. Retimuko (talk) 08:57, 26 November 2017 (UTC)Reply

Regardless of what the cited sources say, the term "cardinality" is simply wrong. Please refer to https://en.wikipedia.org/wiki/Cardinality and https://en.wikipedia.org/wiki/Multiset. You could also look at the talk page entry immediately above this one. ("Dimension" and "multiplicity" can both be used to mean the size of the set constructed from the multiset. Multiplicity is more common, which I why I'm now using that one.) KingSupernova (talk) 19:05, 26 November 2017 (UTC)Reply
I agree that cardinality seems to be not quite right, but Wikipedia is not about our opinion as editors. The sources use this term. It is a standard term in the literature on the subject. Do you have any reliable source talking about how this usage of the term is wrong?
Regarding multiplicity. This term wold be simply wrong indeed in this context. There is no such thing as multiplicity of a multiset. Multiplicity is about members of a multiset, not about multiset itself. Please have a closer look at the multiset article.
Using the term dimension would be just odd. Have you seen it in the context of sets? I don't think I have. Retimuko (talk) 19:29, 26 November 2017 (UTC)Reply
KingSupernova, you are not supposed to keep reverting reverts. Explaining your reasons here is not enough. We need to reach some sort of an agreement. Your behavior very much seems like an edit war. I would like to ask you to undo your last revert and continue discussion here please. Thank you. Retimuko (talk) 06:38, 27 November 2017 (UTC)Reply
I'm sure you could find hundreds of sources that are wrong, it is the internet after all. I provided you with links to the Wikipedia page on Cardinality- if you don't consider that to be a reputable source, then I suggest you look at the sources of that page.
I have heard both "dimension" and "multiplicity" used in that way, but looking into it it seems that neither are mainstream. Perhaps "underlying set" would be a better term, as that is what's used on the Wikipedia page on Multisets.
I agree, we should reach an agreement before any more edits are made. It takes two people to cause an edit war, so let's just leave the page as it is until we reach a consensus. KingSupernova (talk) 07:20, 27 November 2017 (UTC)Reply
> "I'm sure you could find hundreds of sources that are wrong"
Please show a reliable source, which you consider right.
> "I provided you with links to the Wikipedia page on Cardinality- if you don't consider that to be a reputable source, then I suggest you look at the sources of that page."
There are several things wrong with this statement: 1. The article on cardinality doesn't mention multiplicity at all, so I don't see how it could possibly support your point. 2. Wikipedia itself is not a reputable source. 3. It is not my task to look through the sources. Please cite sources supporting your point.
The primary source for this article is the 2007 paper by Flajolet et al. Here is a quote: "The purpose of this note is to present and analyse an efficient algorithm for estimating the number of distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets". Retimuko (talk) 16:18, 27 November 2017 (UTC)Reply
Since you appear to be unwilling to do any work yourself, here are some non-wikipedia sources as to the definition of "cardinality":
https://oeis.org/wiki/Multisets#Cardinality
https://en.oxforddictionaries.com/definition/cardinality
https://www.thefreedictionary.com/cardinality
https://brilliant.org/wiki/cardinality/
https://projecteuclid.org/download/pdf_1/euclid.ndjfl/1093634995
As I am now saying for the second or third time, I admit that my usage of "dimension" and "multiplicity" were not advisable. I'm not sure why you're choosing to focus on that rather than arriving at a correct article.
I notice that even after agreeing to leave the article alone, you have yet again reverted my edit. I don't want the article to contain incorrect information, and it doesn't seem like you have any interest in reaching a consensus. I have therefore fixed the article once more (using the better terminology of "underlying set" this time) and escalated this dispute to the dispute resolution noticeboard. KingSupernova (talk) 19:37, 27 November 2017 (UTC)Reply
I am afraid you misunderstand the rules of Wikipedia. We did not agree to leave the article with your contentious edit in place. The article is supposed to stay as it was before (WP:STATUSQUO) until we reach consensus. Please learn the rules: WP:CONSENSUS, WP:BRD
The links you gave don't support the terminology you propose, and most of them are not reliable sources. Could you cite a reliable source, which defines the number of distinct elements in a multiset as a cardinality of its "underlying set"? I gave you a quote from the 2007 paper by Flajolet et al, which is the basis of this article. There are many reputable and peer reviewed papers citing Flajolet and using the same terminology. I have not seen any of them criticizing this terminology or proposing a new one. The consensus should be within basic rules of Wikipedia, which means that the article must be based on reliable sources, not on personal opinions of editors. Retimuko (talk) 20:21, 27 November 2017 (UTC)Reply
I did: https://en.wikipedia.org/wiki/Multiset#Formal_definition. I'm going to go ahead and predict your response of "Wikipedia is not an accepted source" and respond with "Then look at the sources of that article. If you want to dispute that terminology, that page is where to do it. As of now, that terminology is accepted on Wikipedia, so let's be consistent."
The quote you gave me is wrong. I don't know what more you want me to say. "Cardinality" is a very widely-used term in mathematics, and it simply doesn't mean what that article says it does. A source having one piece of incorrect information does not mean it's a bad source, it just means you need to use some common sense and realize that it's wrong. I have provided you with plenty of sources of the definition of "cardinality", it's used that way in hundreds of Wikipedia articles, and I'm tired of arguing about the meaning of basic English words. At this point you are simply being obstinate and not contributing anything to the discussion. KingSupernova (talk) 01:42, 30 November 2017 (UTC)Reply
I don't think we can just dismiss Flajolet as being wrong. He proposed this HyperLogLog algorithm after all, so it would be very misleading for the reader to use different terminology in the article. There is also a large body of research on this topic and related topics. For example, see count-distinct problem. I quote: "In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements." I believe I have seen research papers published in reputable journals and presented on reputable conferences, which more or less explicitly repeat or implicitly assume this definition. Those journals and conferences have quite strict review processes, and therefore meet Wikipedia standards for reliable sources. I can dig up some links if needed. I believe we have a case of some disagreement between reputable sources. So we should present both views. Perhaps, we can, as a compromise, introduce a section called "Terminology", which would say something like: "In the original paper by Flajolet et al and in related literature on the count-distinct problem the term cardinality is used to mean the number of distinct elements in a data stream with repeated elements. However, in the theory of multisets the term cardinality is defined as a sum of multiplicities of each member of a set." Would you support this approach? Retimuko (talk) 07:11, 1 December 2017 (UTC)Reply
My apologies for the delay in responding, I've been busy. It's true that there are several peer-reviewed papers that use "cardinality" the way it's used in the Flajolet article. However, those papers are a drop in the bucket compared to its mainstream usage in set theory. As was mentioned on the dispute resolution noticeboard, we should cover topics with WP:WEIGHT appropriate to their prevalence, and the usage in the Flajolet paper is simply not the accepted usage in normal mathematics. I think a mention in the Wikipedia article of the inconsistent terminology from the cited papers would be a good idea, but I maintain that the Wikipedia article should use the usual definition when describing the algorithm. KingSupernova (talk) 17:42, 18 December 2017 (UTC)Reply
I added a section with the proposed wording. Is this something we can agree on? Thanks. Retimuko (talk) 19:26, 20 December 2017 (UTC)Reply
I strongly dislike that this article uses terminology that is directly contrary to the entire rest of mathematics, but I'm not interested in continuing to fight it. I edited the section to be a bit more clear, I'll leave it at that. — Preceding unsigned comment added by KingSupernova (talkcontribs) 07:05, 29 December 2017 (UTC)Reply

I would suggest to avoid editorializing in articles ("simply", "directly contrary", "usual"). What does it mean for definition to be directly contrary? You also removed the "sum of multiplicities" that is important, and added unnecessary mentioning of sets. There is no disagreement for sets. If elements don't repeat the number of distinct elements equals the number of elements. So I doubt it was a clarification. Retimuko (talk) 07:32, 29 December 2017 (UTC)Reply

An example?

edit

Speaking as a working software developer, I think this article could really use an example. Perhaps as its own section, but perhaps just a sentence in the intro. E.g., "For example, a HyperLogLog approach is a good choice to estimate the FOO of BAR." Or, "HyperLogLog algorithms are commonly used for problems A and B in domain C." I don't know enough to offer one, but I'd be somebody reading this does. -- William Pietri (talk) 20:47, 9 July 2020 (UTC)Reply

That's a good point. One common usage is to estimate the number of concurrent users in a large distributed system. 2603:8000:D001:5008:DD39:7A2F:8510:612E (talk) 19:07, 14 July 2023 (UTC)Reply

Algorithm description

edit

the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set.

Leading zeroes in any number representation do not have any value: 0001 = 0000001. So the quoted sentence does not make sense. — Preceding unsigned comment added by 158.140.235.213 (talk) 20:29, 12 November 2020 (UTC)Reply

I agree and I suggest adding "fixed-size" to the relevant sentence, which would yield " maximum number of leading zeros in a fixed-sized binary representation of each number in the set". I googled "hyperloglog example" and found illustrations such as phone numbers or credit cards (for example counting crowds or my favorite data structure), which supports this idea of fixed-size. BTW these examples don't neccessarily take the "leading" zeros, but can be the trailing zeros or the total number of zeros. Also, data is first hashed into a fixed-length binary hash, so a more accurate description (and my final suggestion :-) ) would be "the maximum number of leading zeros in a fixed-size binary hashed representation of each member of the set". Is that still readable? Jrob kiwi (talk) 16:06, 26 February 2021 (UTC)Reply
I still find this confusing; the number of leading zeros in a particular computer hardware representation seems more a characteristic of the hardware than the data.
E.g. switch from 32 bit to 64 (or 128) bit: that doesn't change the original data in any way, however now you have more leading (or trailing) zeros. 2601:601:9100:72A0:6D6E:F003:AD78:B05A (talk) 01:17, 12 May 2024 (UTC)Reply
Uniform random distribution implies a fixed range. Retimuko (talk) 02:37, 12 May 2024 (UTC)Reply

Clarify approximate formula for large multi-sets

edit

In Practical Considerations the following formula is reported:

 

that could be used for very large cardinalities approaching the limit of the size of the registers (  for 32-bit registers). I think we need a reference to the source of the formula, and also a clarification of the meaning of the size of the registers: since a counter contains m registers, the formula applies when each of such m registers is 32-bit long?