This is the mail archive of the
mailing list for the Cygwin project.
Re: "C" character set (again)
Corinna Vinschen wrote:
On Jan 8 12:12, Thomas Wolff wrote:I couldn't reproduce this for an hour until I noticed why, and suddenly
all arguments seem to blend well together:
Andy Koppe wrote:It does! ...
And byte-transparent, right?
There's an important distinction here between the C locale and the
defaut locale. The C locale is what you get if you don't call
setlocale at all, whereas the default locale is what you get if you
call setlocale(LC_FOO, "") and the relevant environment variables are
all unset or empty.
The default locale uses UTF-8, and I most certainly agree that this
should stay as is. The charset of the filesystem and the console are
both controlled by the default locale (unless overridden in the
environment). They are independent of the C locale's charset or
whether an application calls setlocale.
No, this is about the C locale only. Lots of people and programs make
assumptions about the C locale which may not be valid according to
POSIX, but which nevertheless hold true for Linux and most (if not
all) other Unices, including Cygwin 1.5. The most important assumption
is that the C locale is 8-bit clean.
Which gets me back to this printf issue; actually your point here
seems to approve my arguments there, if only I had explicitly
restricted them to the C locale.
Could you agree that functions like sprintf should handle their char
* arguments byte-transparently if acting in the C locale?
My sample program (attached) as well as the sample from the other thread
do not even work if cygwin runs in an 8-bit locale.
This is surprising - a user cannot rectify the problem using the locale
mechanism although it is supposed to provide the feature of proper
The program can and can only be convinced to do what's expected if the
setlocale is invoked to explicitly set an 8 bit locale
(included in a comment of my program).
The reason is probably the programs always start in the "C" locale (I
think that's something claimed by POSIX?). If that's UTF-8, however,
behaviour of locale-agnostic programs is not as expected. This actively
breaks legacy compatibility.
So, actually, reconsidering your response above, no it does not. If
running in the C locale, whether explicitly or implicitly,
sprintf is not byte-transparent in 1 of 3 cases (of my sample program),
and printf is not byte-transparent in 2 of 3 cases (which is another
surprising inconsistency, between printf and sprintf).
Some of the details have been noted before (sorry), but for me, this
summary results in a clearer picture now,
and the best and easiest solution IMHO would be to indeed change the C
locale back to 8 bit, byte-transparent, and not even plan to rechange
(That's why I'm discussing it here, not in the sprintf thread.)
The problem occurs in the *format* string. ...
[Maybe this should be discussed in the other thread but let's keep it
together for now.]
Yes, and I doubted (in the other thread) that is should occur, putting
it more precisely now, because in
the condition that "a wide-character code that does not correspond to a
valid character has been detected" is only mentioned as a condition for
the EILSEQ error.
While Andy had a valid point in finding *format* to be described as a
"character string" and relating that to a generic POSIX definition of
this certainly does not justify the current behaviour of slient dropping
and reporting partial success because that is not one of the options in
the "RETURN VALUE" section;
also I don't see what Andy's claim "Including invalid bytes in the
format string is undefined behaviour." is based on.
So I'd like to encourage you to apply your patch to vprintf (I don't see
a need to feel uneasy about it) in any case - whether or not the C
locale gets changed;
there is an additional consideration in favour of it:
The printf functions, especially fprintf and sprintf, are not
necessarily preparing text output, esp. to a terminal. They can also be
used to prepare binary data for output into a file which is totally
locale-agnostic and shouldn't be broken.