Talk:Null-terminated string

Latest comment: 1 year ago by Kpratter in topic "CString" listed at Redirects for discussion

NULL char in UTF-8

edit

In response to this edit: A NULL char is not a part of any valid UTF8 sequence; all characters in all multibyte sequences start with a 1 bit. However, you can encode a NULL char into a UTF8 stream with a 0xC0, 0x80 sequence, which then becomes a 0x0000 when converted to UTF16. - Richfife (talk) 16:09, 13 September 2013 (UTC)Reply

Zero is a valid code point, and the UTF-8 encoding of it is a NUL (\0) byte. An 0xC0, 0x80 sequence in a UTF-8 string is an invalid overlong encoding. In fact, the overlong encoding of NUL is used as an example of a security issue from an incorrect UTF-8 implementation in the RFC. strcat (talk) 17:40, 13 September 2013 (UTC)Reply
See the encoding table provided by the Unicode consortium as a resource. strcat (talk) 17:46, 13 September 2013 (UTC)Reply
That is incorrect, you are describing "modified UTF-8" as used by tcl and some other systems. These also incorrectly encode non-BMP as 6 bytes. Officially the encoding of U+0000 in UTF-8 is a single 0 byte.
However I have reverted this, as the 0 character also exists in ASCII and yet it claims that ASCII works. Thus "works" is defined as "works for all characters other than the zero one". The edit implies that UTF-8 is not supported as well as ASCII, which is false, it is supported equally well with exactly one character not represented.Spitzak (talk) 23:57, 13 September 2013 (UTC)Reply
I am not describing modified UTF-8, but I think you may just be responding to the parent comment. strcat (talk) 03:54, 19 September 2013 (UTC)Reply
Can't we say that NUL (\0) is supported by ASCII, extended ASCII and UTF-8, but NUL is not supported by Null-terminated strings, since NUL has an internal meaning to null-terminated strings. But as long as NUL is not part of encoded text (only TAB, CR and LF is among control chars) both ASCII and UTF-8 can be stored in Null-terminated strings. Both ASCII and UTF-8 might sometimes demand storage of NUL, but if you want to store binary data, why not store them as binary data, not as ASCII, UTF-8 or UTF-16 or null-terminated strings.--BIL (talk) 09:09, 14 September 2013 (UTC)Reply
UTF-8 considers NUL to be totally valid text data. It's the encoding of a valid code point - unlike the explicitly forbidden range of surrogates. I added sources for this, and it was reverted to the previous inaccurate claim of \x00 not being valid UTF-8. strcat (talk) 03:56, 19 September 2013 (UTC)Reply
There is no difference between UTF-8 and ASCII with respect to null terminated strings. Actually UTF-8 preserves backwards compatibility with traditional null terminated strings, enabling more reliable information processing and the chaining of multilingual string data with Unix pipes between multiple processes. Using a single UTF-8 encoding with characters for all cultures and regions eliminates the need for switching between code sets. See Lunde, Ken (1999). CJKV information processing. O'Reilly Media. p. 466. ISBN 978-1-56592-224-2. Retrieved 2011-12-23. {{cite book}}: Unknown parameter |month= ignored (help). So current wording should be fixed. AgadaUrbanit (talk) 07:02, 19 September 2013 (UTC)Reply
Regardless of whether there's a difference between ASCII and UTF-8, not all UTF-8 can be stored in a null-terminated string per the Unicode and UTF-8 standards (given as a source). They are the only authoritative sources here because they define the encoding. 99.231.135.5 (talk) 01:39, 20 September 2013 (UTC)Reply

Do many small strings imply duplicates?

edit

My concern is with the statement, "On modern systems memory usage is less of a concern, so a multi-byte length is acceptable (if you have so many small strings that the space used by this length is a concern, you will have enough duplicates that a hash table will use even less memory)." I can write a program that generates many small strings that are not duplicates. Therefore I do not believe that if I have many small strings then there will be duplicates. There might be many programs where many small strings are duplicates and a hash table will use less memory (e.g. the symbol table in a compiler), but I cannot see that this is always true of all programs. — Preceding unsigned comment added by 80.195.2.190 (talk) 12:56, 6 September 2016 (UTC)Reply

If "many" is considered to mean tending towards infinity then there will be duplicates strings after all permutation of small strings are generated. So in that sense many strings implies duplicates. However, consider 8 bit byte characters where there are 2^8=256 different characters, of which the zero character '\0' NUL can be used as a terminator, leaving 255 other characters from which to form strings. Now consider the number of bytes required to store all permutations of strings of short lenth for a NUL terminated string representation and a string representation having a 4-byte length:

length L permutations P size NUL terminated = P * (L+1) size 4-byte length = P * (L+4)
0 255^0 = 1 1 4
1 255^1 = 255 510 1275
2 255^2 =65,025 195,075 390,150
3 255^3 = 16,581,375 66,325,500 116,069,625
4 255^4 = 4,228,250,625 21,141,253,125 33,826,005,000

To store all permutations of strings up to length 4 requires 20 gigabytes in a NUL terminated string representation and 32 gigabytes in a string representation with a 4 byte length. So a hash table storing such short strings will exhaust current memory sizes before all permutations of short strings can be generated. Therefore for hash tables, stored in current memory sizes, many short strings does not necessarily imply there will be duplicate strings. — Preceding unsigned comment added by 80.195.2.190 (talk) 10:10, 12 January 2017 (UTC)Reply

Second column seems off, I get 4, 1275, 390150, 116069625, 33826005000.
However storing these as independent strings would overflow memory just as much as the hash table. The assumption is that the set of strings fits in available memory, and that there are collisions because some strings are used much more often than others ("index" is probably used much more than "X&*v@" in a programming language). Though the length adds 3 bytes (vs nul-terminated) and the hash table adds H more bytes (where H ~= 16), each collision saves length+4+H bytes. So if there are N collisions out of M total strings then the hash table costs M*(3+H)-N*(length+4+H) extra bytes, which could be negative. However I have no idea how to test this on real data or prove whether it is positive or negative.Spitzak (talk) 18:18, 12 January 2017 (UTC)Reply
Thank you for calculating the correct values. I've now fixed the numbers in that column. I basically agree with your analysis, but to be precise it is duplicate strings rather than collisions that save space in the hash table, because an imperfect hash function can generate the same index, and hence a collision, for different strings, and each different string needs to be stored, so space is saved only for duplicates. I share your intuition that there are many real-world data sets where a hash table (or trie, or other data structure) will save memory by exploiting the frequency of duplication. For a particular application I guess we could get a sufficiently large & representative set of data, and use some kind of statistics-keeping memory management to compare different data structures.

India Education Program course assignment

edit

  This article was the subject of an educational assignment supported by Wikipedia Ambassadors through the India Education Program.

The above message was substituted from {{IEP assignment}} by PrimeBOT (talk) on 20:00, 1 February 2023 (UTC)Reply

"CString" listed at Redirects for discussion

edit

  The redirect CString has been listed at redirects for discussion to determine whether its use and function meets the redirect guidelines. Readers of this page are welcome to comment on this redirect at Wikipedia:Redirects for discussion/Log/2023 October 2 § CString until a consensus is reached. Kpratter (talk) 13:40, 2 October 2023 (UTC)Reply