Board index » cppbuilder » case insensitive wchars

case insensitive wchars


2005-08-02 03:52:01 AM
cppbuilder26
I'm writing a platform-independant function that compares two wchar_t
strings together, with and without case sensitivity.
But, I've run into a problem with trying to make the two strings lowercase
so that I can compare them.
How can I convert wchar_t strings to lowercase ?
I'm using std::wstring to handle the strings.
Jonathan
 
 

Re:case insensitive wchars

"Jonathan Benedicto" < XXXX@XXXXX.COM >writes:
Quote
I'm writing a platform-independant function that compares two
wchar_t strings together, with and without case sensitivity.
How are they encoded?
Quote
But, I've run into a problem with trying to make the two strings
lowercase so that I can compare them.

How can I convert wchar_t strings to lowercase?
Not at all. No software is able to do that correctly, because doing it
would require the software to understand the text.
 

Re:case insensitive wchars

"Thomas Maeder [TeamB]" < XXXX@XXXXX.COM >wrote in message
Quote
How are they encoded?
I don't know.
Quote
Not at all. No software is able to do that correctly, because doing it
would require the software to understand the text.
So, would it be the best idea for me to write the function to only perform
the lowercase conversion on english characters ?
I'm very sorry for being so ignorant about this. I'm just learning about
Unicode.
Jonathan
 

{smallsort}

Re:case insensitive wchars

Jonathan Benedicto < XXXX@XXXXX.COM >wrote:
Quote
I'm very sorry for being so ignorant about this. I'm just learning about
Unicode.

I am in the same situation. So take this reply for what it's worth.
Win32 API lstrcmpi() can perform a comparison with no case sensitivity.
But as the documentation said:
For some locales, the lstrcmpi function may be insufficient. If this
occurs, use CompareString to ensure proper comparison.
--
JF Jolin
 

Re:case insensitive wchars

"JF Jolin" < XXXX@XXXXX.COM >wrote in message
Quote
I am in the same situation. So take this reply for what it's worth.
Thank you for replying.
Quote
Win32 API lstrcmpi() can perform a comparison with no case sensitivity.
But as the documentation said:

For some locales, the lstrcmpi function may be insufficient. If this
occurs, use CompareString to ensure proper comparison.
Problem I have, is that the function must be platform-independant.
Jonathan
 

Re:case insensitive wchars

"Jonathan Benedicto" < XXXX@XXXXX.COM >writes:
Quote
>How are they encoded?

I don't know.
It's hard to do anything useful with data whose meaning you don't
know.
Quote
>Not at all. No software is able to do that correctly, because doing
>it would require the software to understand the text.

So, would it be the best idea for me to write the function to only
perform the lowercase conversion on english characters?
I can't tell you what the best idea for you is.
There is no such thing as "english characters" (well, the term
"character" is overloaded, but you know what I mean :-) ). English
texts are commonly written in the Latin alphabet. Which is one of the
few alphabets used on this planet that distinguish between upppercase
and lowercase characters (the other one I know is the cyrillic
alphabet).
To correctly perform the conversion to lowercase in German and French
text (both typically written in the Latin alphabet), your software
needs to understand the text. I'd assume that other languages would
apply as well, but I can't tell for sure.
So if you *know* that a certain text is in English, it may be safe to
do the conversion to lowercase character per character, as typical
functions do it.
Quote
I'm very sorry for being so ignorant about this. I'm just learning about
Unicode.
So the wchar_t objects contain Unicode code points? That would at
least mean that you how the text is encoded.
 

Re:case insensitive wchars

JF Jolin < XXXX@XXXXX.COM >writes:
Quote
For some locales, the lstrcmpi function may be insufficient. If this
occurs, use CompareString to ensure proper comparison.
I doubt that this function CompareString does a proper comparison.
E.g. to tell if MASSE and Maße should compare equal, understanding of
the text is required.
 

Re:case insensitive wchars

"Thomas Maeder [TeamB]" < XXXX@XXXXX.COM >wrote in message
Quote
It's hard to do anything useful with data whose meaning you don't
know.
Basically I'm trying to write a wstring based string class. So, I don't
think that I'll ever be able to determine what data is placed into the
class.
Quote
I can't tell you what the best idea for you is.

There is no such thing as "english characters" (well, the term
"character" is overloaded, but you know what I mean :-) ). English
texts are commonly written in the Latin alphabet. Which is one of the
few alphabets used on this planet that distinguish between upppercase
and lowercase characters (the other one I know is the cyrillic
alphabet).

To correctly perform the conversion to lowercase in German and French
text (both typically written in the Latin alphabet), your software
needs to understand the text. I'd assume that other languages would
apply as well, but I can't tell for sure.

So if you *know* that a certain text is in English, it may be safe to
do the conversion to lowercase character per character, as typical
functions do it.
This I don't know, so I guess I'd better drop out the case-insensitivity.
I think that maybe I should use that open-source ICU library instead of
trying to handle the wchar myself.
Jonathan
 

Re:case insensitive wchars

"Thomas Maeder [TeamB]" < XXXX@XXXXX.COM >wrote in message
Quote
JF Jolin < XXXX@XXXXX.COM >writes:

>For some locales, the lstrcmpi function may be insufficient. If this
>occurs, use CompareString to ensure proper comparison.

I doubt that this function CompareString does a proper comparison.

E.g. to tell if MASSE and Maße should compare equal, understanding of
the text is required.
Same for église, EGLISE and ÉGLISE in French. You're
allowed to drop the accent or not in the capitalized format.
We're currently working with this same concept and it's quite complex.
The best we've found so far is to find a cross platform library (QString
in our case) that handles Unicode and do the comparisons with these objects.
But even this isn't perfect due to examples like the above.
I have no idea how you would do this with anything from std c++.
Dealing with std::string or wstring doesn't really work. You basically
have to use something like UTF8 as an intermediate and then you
have problems with normalization and such. As well as the fact that
there can be more than one valid UTF8 encoding for the same
unicode character. Then on top of that, not everyone uses UTF8.
You have to deal with UTF16 and UCS2 for example.
 

Re:case insensitive wchars

"Duane Hebert" < XXXX@XXXXX.COM >wrote in message
Quote
I have no idea how you would do this with anything from std c++.
Dealing with std::string or wstring doesn't really work. You basically
have to use something like UTF8 as an intermediate and then you
have problems with normalization and such. As well as the fact that
there can be more than one valid UTF8 encoding for the same
unicode character. Then on top of that, not everyone uses UTF8.
You have to deal with UTF16 and UCS2 for example.
I guess then that the best idea is just to leave out case conversion. In my
case, that would mean making the function not support case-insensitive
comparison, or as I have it now, make the case sensitive option use the
tolower function.
Jonathan
 

Re:case insensitive wchars

"Jonathan Benedicto" < XXXX@XXXXX.COM >writes:
Quote
I think that maybe I should use that open-source ICU library instead
of trying to handle the wchar myself.
Probably. Buying is typically cheaper than building. And "buying" at
that price is very hard to beat :-)
But this library deals with Unicode, not necessarily with wchar_t
strings, if I understand
www-306.ibm.com/software/globalization/icu/index.jsp correctly.
I have the feeling that you are seeing an equivalence between wchar_t
and Unicode that isn't there. wchar_t objects can be used to represent
Unicode characters; but they can be used for other things as
well. OTOH, Unicode characters can be represented by wchar_t objects,
but there are other representations.
On platforms where sizeof(wchar_t)==2, a representation different from
wchar_t is likely to be more useful since 21 bits are required to
represent all Unicode code points; if the set of Unicode characters to
be represented doesn't exclude these characters (I think they are used
in Thailand), you're probably better of with a 32bit character type.
 

Re:case insensitive wchars

Thomas Maeder [TeamB] < XXXX@XXXXX.COM >wrote:
Quote
E.g. to tell if MASSE and Maße should compare equal, understanding of
the text is required.

I agree (cl?vs clef) have equivalent meanings.
Do we start a debate over equivalence and comparison ?
What about synonym ?
Fruit and color orange are two different realities.
This is endless...
--
JF Jolin
 

Re:case insensitive wchars

"Thomas Maeder [TeamB]" < XXXX@XXXXX.COM >wrote in message
Quote
I have the feeling that you are seeing an equivalence between wchar_t
and Unicode that isn't there. wchar_t objects can be used to represent
Unicode characters; but they can be used for other things as
well. OTOH, Unicode characters can be represented by wchar_t objects,
but there are other representations.
Yes, thank you, I was thinking they were the same. Now I know better.
Quote
On platforms where sizeof(wchar_t)==2, a representation different from
wchar_t is likely to be more useful since 21 bits are required to
represent all Unicode code points; if the set of Unicode characters to
be represented doesn't exclude these characters (I think they are used
in Thailand), you're probably better of with a 32bit character type.
Do you think that instead of using wstring, that basic_string<int,
char_traits<int>, allocator<int>>might be better because it would provide
32-bit character sizes ?
Jonathan
 

Re:case insensitive wchars

Duane Hebert < XXXX@XXXXX.COM >wrote:
Quote
[...]
I have no idea how you would do this with anything from std c++.
Dealing with std::string or wstring doesn't really work. You basically
have to use something like UTF8 as an intermediate and then you
have problems with normalization and such. As well as the fact that
there can be more than one valid UTF8 encoding for the same
unicode character. Then on top of that, not everyone uses UTF8.
You have to deal with UTF16 and UCS2 for example.
We do Unicode in cross-platform code using
'std::basic_string<>'. Basically, we determine
'wchar_t's size at compile-time and decide
what to use for character types.
On Windows, the main internal representation
is UCS-2 (i.e. those Unicode chars that need
only 16bit) because that's what Windows does.
On OS X the same type (but with a 'wchar_t'
of 32bit) carries UTF-32 (which I think is
safe so far, as there aren't any UTF-32 chars
needing more than 32bit), because it's (AFAIK)
what OS X uses internally.
Schobi
--
XXXX@XXXXX.COM is never read
I'm Schobi at suespammers dot org
"Coming back to where you started is not the same as never leaving"
Terry Pratchett
 

Re:case insensitive wchars

"Hendrik Schober" < XXXX@XXXXX.COM >wrote in message
Quote
We do Unicode in cross-platform code using
'std::basic_string<>'. Basically, we determine
'wchar_t's size at compile-time and decide
what to use for character types.
We use a lot of config files that are "shared" via
unc to different boxes. So far, it's mostly linux and
windows. We've been using std::string
with UTF8 for the config stuff. The classes that are
non-gui deal with them straight. The gui classes that
allow user I/O with some of this data use QString which
has to/from utf8 functions.
So far it's been working well
but the OP asked for cross platform standard way of
doing things. Your answer may be more suitable
to him.
Quote
On Windows, the main internal representation
is UCS-2 (i.e. those Unicode chars that need
only 16bit) because that's what Windows does.
On OS X the same type (but with a 'wchar_t'
of 32bit) carries UTF-32 (which I think is
safe so far, as there aren't any UTF-32 chars
needing more than 32bit), because it's (AFAIK)
what OS X uses internally.
How are you handling things like normalization?