Talk:Index of coincidence
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
this article needs formulas
- It now has one; I improved it and added text about generalization of the concept. — DAGwyn 15:45, 20 September 2006 (UTC)
- A while back I added a second formula, for kappa I.C. — DAGwyn 23:08, 25 November 2006 (UTC)
I reckon it's pretty good to give one a basic understanding of the IC.
We should have a link to an IC calculator, such as http://www.central.edu/homepages/lintont/classes/spring01/cryptography/java/indexofcoin.html. Dianelowe 09:21, 21 May 2006 (UTC)
- Since IC computation depends on context, it is better to give the general rule than just a specific computation. — DAGwyn 15:45, 20 September 2006 (UTC)
Formulae Corrections
editAre the first three formulae in this article correct to say '1/(1/c)' (or some form thereof)? Doesn't this just cancel to 'c' in the numerators, or is this perhaps just a style issue rather than syntactic?
94.193.216.167 (talk) 04:13, 24 January 2010 (UTC) Steve
- I restored the original formulas. The person who changed them to the form you saw was evidently trying to reorganize them in terms of probabilities rather than number of coincidences. — DAGwyn (talk) 20:57, 23 July 2010 (UTC)
New formula recommendation
editI would humbly recommend the following formula + descriptive text:
Where is some text, is each letter in that text, is the number of times appears in the text (for some ), and is the length of the text. For English texts, the result is multiplied by 26 by convention, to "normalize" the result.
I believe this could help eliminate an ambiguity in the current formula:
Seeing that a previous formula was reverted, I wanted to explain this before editing.
How 'c' can be ambiguous with the current formula
The current formula uses 'c' both for the number of character counts to sum and the normalization coefficient, two terms that are not necessarily related.
It's not clear in the current formula how to process a text that doesn't contain every letter of the alphabet, or a text that includes whitespace/punctuation/multiple cases, or an enciphered text of an unknown source language.
One easy solution in all of these situations would be to just iterate through every available character and not normalize at all. This works perfectly, yielding something near 0.067 for English, regardless of punctuation and case. This was actually William Friedman's original design (see note on p14 of THE INDEX OF COINCIDENCE AND ITS APPLICATIONS IN CRYPTANALYSIS where he claims 0.066 is the expected result for English, not 1.703.)
Removing the coefficient might not be preferred by everyone. But we can at least move the coefficient out in front of the calculation so as to:
- 1. visually simplify the formula
- 2. make the calculation more efficient
- 3. highlight its role, make it more explicit
- 4. make it obvious how to normalize or de-normalize a result for quick conversions
- 5. separate it from the count of letters being summed, 'c', to which it might not correspond
(I hope it's obvious that this formula is equivalent. Dividing the denominator is the same thing as multiplication, and then multiplicands can be freely moved outside of a summation.)
After moving the coefficient out in front, I recommend using 26 instead of a variable, primarily to prevent confusion with all of the other letter counts in the formula (and because 26 here is not actually a count of any relevant letters, as I attempt to demonstrate below).
Pride and Prejudice and Indices
For example, consider the following passage from Pride and Prejudice:
- 'Do you hear the snow against the window-panes, Kitty? How nice and soft
- it sounds! Just as if some one was kissing the window all over outside.
- I wonder if the snow LOVES the trees and fields, that it kisses them so
- gently? And then it covers them up snug, you know, with a white quilt;
- and perhaps it says, "Go to sleep, darlings, till the summer comes
- again." And when they wake up in the summer, Kitty, they dress
- themselves all in green, and dance about--whenever the wind blows--oh,
- that's very pretty!' cried Alice, dropping the ball of worsted to clap
- her hands. 'And I do so WISH it was true! I'm sure the woods look sleepy
- in the autumn, when the leaves are getting brown.
Reading every character:
- The unnormalized IC is 0.0655.
- Normalized to 26, the result is 1.7027.
- Normalized to 46 (the characters in the text), the result is 3.0125.
- Normalized to 52 (the actual number of letters in this alphabet), the result is 3.4054.
Unnormalized, and normalized to 26 is right on track. But where'd we get 26 from? There are more than 26 different characters here. There are even more than 26 different types of letters here, since it's not monocase.
Converting to monospace first and dropping punctuation and whitespace: (There are now only 24 letters, due to the lack of an x or z). Then:
- The unnormalized IC is 0.0654
- Normalized to 24, the result is 1.5705
The stripped text provides results still recognizable as English when unnormalized (but moving slowly in the wrong direction, because we threw away some information). Also note that in both cases normalizing against 'number of characters found' produces somewhat arbitrary results. But also consider, what if these were enciphered by bytes? There'd be no way to tell which bytes to strip out to pare down to 26.
My main point is that 'c' can cause confusion in the current formula. "Characters found" makes a poor 'c' for normalization. But if we're not using "characters found" for 'c', it's no longer clear what characters to iterate over for the sum in the formula as written, which takes a sum of counts for exactly 'c' characters. Maybe we're taking 'c' from some source language, but we might not know what that even is.
(In fact, even if you know the source language, it's sometimes unclear what convention to use. The Portuguese alphabet consists of either 23 or 26 letters depending on what country you happen to be in. Is the German ß a 27th letter? Polish has 32, or 35? Policies on diacritics vary...)
If the normalization coefficient is not from the source text or source language, what else is left? I think we're normalizing to 26 mostly by convention. But even if you don't agree with that, it's probably a good idea to make the coefficient stand out a bit, instead of tucking it under the denominator, for the other reasons listed above.
Element of notation
Finally, I also use the 'element of' notation beneath the sigma, to make it clear we aren't counting upwards, but choosing letters from a set. Thomas B♘talk 06:37, 18 August 2013 (UTC)
- I left the formula as is (except for the stray comma that snuck in there), but added some introductory remarks that I hope will clarify what's going on in the formula. I still recommend moving the coefficient out in front, and using the 'element of' notation, but I'll wait for a second opinion. I also reorganized the page a bit, to bring the calculation right up front, since that's probably what most people want. --Thomas B♘talk 16:30, 18 August 2013 (UTC)
- I think the current equation (with the strange looking double fraction) actually makes sense. The numerator is the number of observed coincidences in an actual text and the denominator is the expected number of coincidences of a text that consists of randomly chosen letters from the same alphabet. 2A02:1205:C6BB:6880:2966:7C40:B6F7:B74E (talk) 18:23, 18 August 2013 (UTC)
- It may make sense to you, but I only recommended the correction after the second time I used this page as a reference and banged my head against the formula for two days until I realized my error in normalization. As I mentioned above, basing 'c' on the observed alphabet will yield incoherent results for any CT containing anything other than exactly 26 different letters, unless you're just arbitrarily normalizing to 26. It also makes the formula useless for determining the language of an enciphered text, something the IoC should be able to accomplish. On top of all that, it seems to recommend a division operation at each step of the sum when you could just do this operation once. It's a bit like that joke about the Welsh restaurant with terrible food in such small portions... the current formula gives incorrect results, and slowly. --Thomas B♘talk 19:12, 24 August 2013 (UTC)
- I think the current equation (with the strange looking double fraction) actually makes sense. The numerator is the number of observed coincidences in an actual text and the denominator is the expected number of coincidences of a text that consists of randomly chosen letters from the same alphabet. 2A02:1205:C6BB:6880:2966:7C40:B6F7:B74E (talk) 18:23, 18 August 2013 (UTC)
- Obviously, c=26 should be used only when the analyst thinks that it is safe to assume that the underlying plaintext is English, so that the number of symbols used is 26. This does occur for many elementary examples found in English-language textbooks. In the computer era, c might be 256 or 2, for example.
- The issue of normalization is dealt with in the explanation and formulas as they were when I last edited this article (2010?). Normalization not only allows comparisons between different situations (see the column-width determination in the article) but also supports combining several independent measures of coincidence into a single overall measure (the "bar" statistics in Mountjoy's paper.)
- To normalize, one must have 'some' model in order to compute the expected correlation for the null hypothesis. Often the null model is just a uniform distribution over the symbol set. If there is no reasonable default model at hand, then unnormalized measures are used, but we lose the comparability and combinability advantages I mentioned above.
- I have no doubt that improvements in the explanations are possible; perhaps some of what I stated here should be incorporated into the article. — DAGwyn (talk) 20:06, 16 February 2015 (UTC)
- Agreed that it's wrong to use c=26 if you aren't sure there are 26 symbols in the PT. But (a) that extra knowledge should be treated as a special case, not the general formula, and, (b) even when you are confident there are 26 symbols, you don't actually gain anything from normalization. You cite two possible advantages: column width analysis and comparability. I'd humbly propose that each of these operations works just fine without any normalization whatsoever. That will probably sound like a strange claim, but consider: Column widths can be identified by seeking the highest IoC. Normalization is just a multiplication operation, it doesn't affect the search for the MAX() at all. MAX(a, b, c) gives the exact same result whether you multiply each of a, b, and c by 26 first or not. Comparability is trickier. Yes, if you're using the normalized 1.73 value for English, you won't know what to make of the 0.0667 your calculation returns. However, comparison is completely preserved so long as you're baselining against the non-normalized value for English: 0.0667. Friedman gives an IC of 0.0667 for English, 0.0778 for French, 0.0762 for German, 0.0738 for Italian and 0.0775 for Spanish. He doesn't bother normalizing, because it's not necessary. Moreover, if you're still trying to discover the language of the PT, there's no way you know how many characters to normalize against. On the other hand, without normalizing, if you get 0.0778, you're probably looking at French. You don't need a division operation on both sides of 26, or maybe 42 if you consider every diacritic and ligature as producing a new character, or 84 for both cases... Even if you told me something was in French I wouldn't know what the divisor is, but supposing I did, dividing on both sides of the comparison equation doesn't help my comparison at all. Comparison actually cuts against normalization. There's no way you can normalize if you're still comparing your text to different language bases, because that implies you don't know the nature of your PT, so you'll have no idea what the divisor should be. Probably moot since no one seems to agree that normalization is just a rabbit hole, just wanted to weigh in for the record. Respectfully, --Thomas B♘talk 21:34, 8 May 2015 (UTC)
- The article really ought to start with a good definition of "index of coincidence". There are two possibilities, (a) the relative frequency of "hits" or (b) the ratio of observed hits to expected hits. (a) was used in Friedman's original article, and (b) became the definition most commonly used by later NSA cryptanalysts, as in Military Cryptanalytics. Mountjoy's article shows the utility of (b). As Thomas B says, (a) must be used when there is no "null" model against which to compare the observed value. In many computational situations, all normalizers have the same value, in which case coincidence values can be compared without first normalizing,. The foregoing could be integrated into the article if somebody thinks it is helpful. —DAGwyn (talk) 09:13, 29 May 2015 (UTC)
- I reread the older Military Cryptanalysis, which contains a great alternate term for the index of coincidence: "the Phi test for monoalphabetieity!" Alternate coefficients are discussed even there in vol. IV, p171, but only for solving bifid ciphers. I'm curious to read the Mountjoy, but I can't seem to track it down using the cite. I've been digging through NSA archives and cryptome for Mountjoy articles or other stuff from 1963, but no luck so far. Any chance there's a link, or a more complete citation? Thanks for your note, interesting history here! --Thomas B♘talk 05:55, 26 August 2015 (UTC)
- I'm sorry to report that Ms. Mountjoy died a few years ago, but she most likely wouldn't have responded even if you had contacted her during her retirement. The "bar statistics" articles were still considered classified when I last had a copy of them. You could try a FOIA request, but although the classification criterion (that a disclosure would likely damage the national security) is not currently met, the Agency automatically applies a set of rules for declassification that tend to treat every bit of information about methodology as exempt from declassification, unless it has already been published (which is why they released Friedman MilCryp Parts 3 and 4 after I found copies in public libraries). The technical essence of the articles is incorporated into the Wikipedia article; the example involves unequal column lengths, and the summing of "hits" is done before normalizing, which was the main point. The definition of IC as a ratio of total observed hits to expected hits was emphasized. And the reason for the article was said to be in response to questions from users of some software package that apparently Mountjoy had been involved with. — DAGwyn (talk) 12:30, 7 July 2017 (UTC)
- I reread the older Military Cryptanalysis, which contains a great alternate term for the index of coincidence: "the Phi test for monoalphabetieity!" Alternate coefficients are discussed even there in vol. IV, p171, but only for solving bifid ciphers. I'm curious to read the Mountjoy, but I can't seem to track it down using the cite. I've been digging through NSA archives and cryptome for Mountjoy articles or other stuff from 1963, but no luck so far. Any chance there's a link, or a more complete citation? Thanks for your note, interesting history here! --Thomas B♘talk 05:55, 26 August 2015 (UTC)
Current formula doesn't give 1.0 for equally distributed texts. Consider omiting the '-1'.
editThe current version of this article says:
"If all c letters of an alphabet were equally distributed, the expected index would be 1.0. The actual monographic IC for telegraphic English text is around 1.73, reflecting the unevenness of natural-language letter distributions."
Unfortunately, this statement is not true for the formulas given. For example, the following text uses a uniform distribution of letters:
"abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyz"
The existing formula yields an index of coincidence of 0.5098 for the above text. But since the letters are uniformly distributed (each letter is used exactly twice), we should compute an index of coincidence of 1.0.
The formula approaches 1.0 as the length of the text increases: 2x alphabet -> 0.5098, 4x -> 0.7573, 16x -> 0.9398.
If the formula did not use the '-1' for both numerator and denominator, then we would properly compute an index of coincidence of 1.0 for equally distributed text. So instead of this idea:
"The chance of drawing that same letter again (without replacement) is (appearances - 1 / text length - 1)"
We'd compute the chance of drawing the same letter again (after replacing the original letter).
This article is fairly lengthy, and I'm not an expert in this subject. So I hesitate to make such an edit. I suspect the formula was intended as an introduction to the idea, and not intended to be the final formula (as the Generalization section suggests). But I know the current article to be inconsistent. For short texts, the expected index is not 1.0 for a uniform distribution of letters, given the current formula. — Preceding unsigned comment added by Mbb25 (talk • contribs) 22:19, 6 July 2017 (UTC)
- There is nothing wrong with the formula. The expected value is the limit of the IC as the message length increases. The value for any specific message will in general not be the expected value. The IC can be wildly off the mark for short messages, as this is a statistical concept.--Bill Cherowitzo (talk) 02:42, 7 July 2017 (UTC)
- I just fixed the actual error; it should have said equiprobable, not equidistributed. Then the formula is correct. Note that the paragraph after the derivation of the formula restates the meaning of the formula using "uniform random distribution" in the sense of "probability distribution". The is needed for the delta I.C. because we don't count a character as matching itself; it is not needed for kappa I.C. — DAGwyn (talk) 12:47, 7 July 2017 (UTC)