Talk:C++ string handling
This is the talk page for discussing improvements to the C++ string handling article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||
|
State of auto strings
editIf a string is defined as an auto, i.e. a local variable in a function, and not initialised, has it a defined state, e.g. it's a zero-length string, or is it garbage as other simpler auto variables are, e.g. int and char *? The article should say. -- Ralph Corderoy (talk) 11:52, 12 September 2009 (UTC)
- It's the former. Yes, the article could use some improvement. Regards, decltype (talk) 12:22, 12 September 2009 (UTC)
- The article is incomplete, not factually wrong, so it needs {{stub}} template, rather than {{disputed}}
- Thanks for reinstating the template. I promise I will fix the issues known to me as soon as possible. decltype (talk) 20:06, 3 October 2009 (UTC)
- The article is incomplete, not factually wrong, so it needs {{stub}} template, rather than {{disputed}}
mem usage?
editI just had a young programmer tell me that an uninitialized std::string uses less memory than an initialized one. Is that true? (I guess that would depend on the implementation; but consider, for example gcc) The code I found suspect was:
class blah { private: std:string name; public: blah (std:string in) { if (!in.empty()) name = in; // claimed savings of memory } };
linas (talk) 03:31, 27 January 2008 (UTC)
- That sounds bogus to me. Even if you don't touch
name
, it gets initialized at the beginning of the constructor. You can always look at the source, though. —Ben FrantzDale (talk) 06:10, 27 January 2008 (UTC)
- Looking at glibc source is easier said than done. But I did run an experiment with sbrk(0) and the result was no effect. Wonder why he thought that ... linas (talk) 21:07, 27 January 2008 (UTC)
- A std:string basically looks like this:
struct string { size_t length; char* contents; }
- With an unitialized instance the char* is just 0. Otherwise it points to a memory block (allocated with new char[] or malloc). A string of length 0 can be represented with no memory block, or a memory block containing just the terminating zero. Since heap management operates with granularity (e.g. in units of 16 bytes) you will waste 16 bytes in the latter case.
- Note that there is no direct mapping between malloc and sbrk. The run-time library typically aquires memory in huge chunks from the OS.
--Alba7 (talk) 19:48, 30 August 2008 (UTC)
- The contents pointer is probably not NULL as that would require if statements in many other methods. However it may point to a shared buffer containing only a 0 (that byte is so c_str() can return a null-terminated string). It is also possible that std::string("") does not result in a pointer to this shared buffer but a private 1-byte buffer, because the constructor did not check for this. Thus the uninitialized version can be faster than the initialized one. I do not know which was done. Small-string optimization would, I think, make identical results for both.Spitzak (talk) 02:51, 7 January 2016 (UTC)
null characters
editjust curious if string class accepts null characters. I would assume it does. —Preceding unsigned comment added by 66.102.196.17 (talk) 00:56, 28 February 2008 (UTC)
- I dug around in the gcc header files and found the following in basic_string.h: 1. String really contains _M_length + 1 characters: due to 21.3.4 must be kept null-terminated. But I am still not sure what that means fully. Guess I will have to test it. Kind of a lot to go through for a curiosity. I am starting to think it would have to be possible though, or how else would someone do binary file i/o. —Preceding unsigned comment added by 66.102.196.44 (talk) 03:03, 7 March 2008 (UTC)
- It appears to. It's not easy to add them, though, because
string foo = "asdf\0asdf";
just sets foo to "asdf" because the null terminator means the string constructor never sees the second half of the string. But you can dostr.push_back('\0')
and the length will increase and you can put non-null characters after the null terminator. —Ben FrantzDale (talk) 03:19, 8 March 2008 (UTC)
- No need to check anything or experiment. std::string does support \0. It's in the standard. Of course, C strings still do not. 194.237.142.20 (talk) 15:05, 19 March 2010 (UTC)
- Also string.assign("asdf\0asdf", 9) will make the string contain the null byte.Spitzak (talk) 19:49, 19 March 2010 (UTC)
character sets
editDoes the C++ standard define what character sets the string class stores? I would assume it only does ASCII (or perhaps you can do UTF8, but it won't gaurantee correct operation with some types of manipulation), but I can't recall ever seeing any mention of this in the docs. I was just looking at GLib and I was wondering why they bothered reimplementing a lot of STL, then I figured proper UTF8 support might be the reason. If it is a major difference, perhaps the article should be expanded to compare/contrast std::string with other libraries' string classes. Yanroy (talk) 20:17, 18 July 2008 (UTC)
- Class std::string is actually just an instanciation of a template.
typedef basic_string<char> string;
- You can also use wchar_t instead of char to get UTF16/UTF32 support.
- --Alba7 (talk) 16:54, 23 October 2008 (UTC)
It can handle UTF-8 or any other byte-based encoding. You have copied the typical fallacy of defining "correct operation" as "different than treating the string as bytes". In fact you cannot handle UTF-8 correctly unless you treat the string as bytes, for instance it is quite impossible to reserve enough space for a string to be stored unless you know how many bytes are in it, and it is impossible to quickly and reliably locate a position in the string unless that position is defined by bytes. There is tons of obsolete documentation that used "character" when they meant "byte", this erroneous documentation is what needs to be fixed, not some perceived need to turn string manipulation into an impossibly complex attempt to use some other metric (often called "characters" but usually meaning "UTF-16 code points") to measure strings. Actual looking at characters is never done except in interative processes from the start of a string, and due to the combining rules of Unicode is quite impossible even in UTF-32.Spitzak (talk) 19:55, 19 March 2010 (UTC)
- Nonetheless the question is good. C has firstly been developed to work with legacy iso646 (digraph and trigraph epoch ) With encodings such as UTF-8 you both need to see it as bytes for memory allocation and to see it as characters for other features such as unicode equivalence. The fact is that standard C/C++ library only provide some of those features (bytes handling) other are only available in alien unicode libraries such as International Components for Unicode.
- C and C++ have always be blind (some say agnostic) to those issues, inducing this issue to be treated lately in development, by legacy locale mechanism, or dumb technical interoperability limitation.
- Agnostic word is a misnomer for this language as C language provides iso 646 features (as in <iso646.h> ).
- C and C++ blindness is such it does not offer any encoding conversion mechanism. Bad old language! 86.75.160.141 (talk) 20:34, 5 November 2012 (UTC)
- Additional, C++ is not so agnostic because it includes some specific UTF8 UTF16 and UTF32 features in codecvt and convert features [1]. 86.75.160.141 (talk) 21:23, 5 November 2012 (UTC)
- What I am protesting is anybody saying "strlen does not return the number of characters and therefore you can't use UTF-8 in C!". This is false and incredibly destructive, and this pov has perhaps been the greatest impediment to getting working internationalization and Unicode support into software.Spitzak (talk) 23:12, 5 November 2012 (UTC)
incorrect
edit"when two c-strings are compared, it is implementation defined as to whether the contents or addresses are compared."
Huh? No, it's not implementation defined, it's definitely an address compare. The only freedom is that in:
char *p1 = "hello"; char *p2 = "hello";
... the compiler is allowed to share the two, i.e. p1==p2 *maybe*. — Preceding unsigned comment added by 204.15.64.190 (talk) 22:48, 3 December 2008
- The compiler will never been allowed to do so unless p1 and p2 are totally not touched in the later code. Otherwise changing p2 will inevitably change the content of p1 at the same time. --LunarShaddowღIvy (talk) 12:12, 2 November 2013 (UTC)
Renaming this article to follow a consistent convention
editHi, I am currently considering renaming this article to conform to a common convention for C++ Standard Library components. The full discussion can be found here. decltype 09:47, 6 March 2009 (UTC)
bytes, char16_t and char32_t
editMost operations are described as handling bytes.
Nonetheless, from my understanding, strings can also possibly be made from char16_t and char32_t. So we might write:
* string::at – Accesses specified code unit with bounds checking. * string::operator[] – Accesses specified code unit * string::front – Accesses the first code unit * string::back – Accesses the last code unit * string::data – Accesses the underlying array
— Preceding unsigned comment added by 86.75.160.141 (talk) 20:19, 5 November 2012 (UTC)
- The object is then a "std_string<T>" and not what C++ headers call "string". I think it is enormously clearer to describe byte strings first and then point out that the template can be reused for other objects.Spitzak (talk) 23:08, 5 November 2012 (UTC)
- Now we got u16string and u32string in C++11, cheers. --LunarShaddowღIvy (talk) 12:14, 2 November 2013 (UTC)
COW history
editThe fact that std::string cannot use copy-on-write, according to the standard, is interesting. But I am having a hard time finding references and details.
Though I can find many statements that it is not allowed, I cannot find a pointer to a document saying "the C++ standard was changed to disallow cow strings", including perhaps a date and/or C++ standard version number.
Also I clearly remember learning about this as a difference and big argument between users of MSVC and G++. One of them used COW and the other didn't. I thought it was G++ that did not use COW, but looking at papers it appears this may be incorrect, there are many references to G++ being "late in implementing conforming std::string without COW". I cannot find anything about MSVC++, but I find it believable that I had this backwards and it was MSVC that was slower. Anybody know the details?Spitzak (talk) 02:58, 7 January 2016 (UTC)
string literals
editThis article would benefit from some text mentioning the new (C++14?) string literals that use a normal string literal suffixed with an "s", e.g.
using namespace std::literals; auto mystring = "mystring"s; — Preceding unsigned comment added by Raybellis (talk • contribs) 10:33, 15 January 2016 (UTC)