String vs Text

@Garry

String counts code points

Text counts “user perceived characters” more or less

Dim a As Text = "☺️"
Var aLength As Integer = a.Length
Var b As Text = "😀"
Var bLength As Integer = b.Length

you get the right answer :slight_smile:

I don’t know too much about encodings, would String.CountFields("") always return what Text.Length does?

I honestly dont know
It may

Strings count “code points”
And several code points can be combined to result in one “user perceived character”

For instance, a ü can b created in several ways in an encoding like UTF-8 (pretty sure all 3 are valid0

  1. a “combining umlaut” + u (2 code points)
  2. a u + combining umlaut
  3. the code point for ü

a combining umlaut is on that gets put together with the other character to form ONE “character” we see on screen, on a printed page etc

string would count 1 and 2 as 2 “characters” because it doesnt count “user perceived characters” - it counts code points

TEXT was literally designed to solve these issues and
THERE is NO reason to remove TEXT
Its just another datatype like date , datetime integer, uint64 etc

1 Like

OR JUST USE TEXT !!!
this IS what it does already

Yes, but text is deprecated (rightly or wrongly) so I imagine people will be nervous to use it.

Xojo should add these functions into String.

1 Like

What do they do with “Length” ?
Either it counts code points or grapheme clusters - not both

And if they have a new function “Clusters” that returns a count of clusters then explaining the difference between “code points” and “user perceived characters” is on par with the difference between text and string

Making string handle classic “string” functions, and “text” functions will be messy

And probably break code along the way

its WHY text was originally created
NO breakage - move to it when you need it etc etc etc

But …

1 Like

Should, wish, hope. Or another workaround.

Having String contain both code point & grapheme cluster functions wouldn’t be any messier than two different data types. I’m not sure how it could break existing code if the grapheme cluster methods were named differently. This is sort of how it is done with other languages such as Google Dart.

I don’t think I would call it a workaround.

Some people need String to work the way it does today, other people (like Gary) need it to work differently.

Creating the extension methods isn’t hard but it does require the user to know that the problem exists in the first place.

1 Like

The workaround is having to use the iterator to get the length.

Even if Xojo implemented the ICU / OS functions there would still be character iteration going on somewhere as i’m pretty sure its the only way you can count the length of a string when taking grapheme clusters into account.

I’d split it into two classes and eventually actually remove “string”

That way data that is supposed to be “textual” eventually is all handled via “text”
And data that is bucks of bytes is handled in memoryblock

String munging both those into a single data type has long caused confusion

How would that work if you had a bucket of bytes from somewhere that was actually text? Today you can just define the encoding but the method you are describing would more than likely involve cloning the bytes which could be a performance killer in some situations.

Being able to use byte functions on strings is useful in some situations when your data is correctly formed UTF-8 / ASCII as you can perform quick searches using InstrB rather than InStr.

If the current Xojo text data type hadn’t been a performance killer and also supported code point based functions then we would have probably used it. Unfortunately, it never really got improved past the original implementation so was DOA as far as we were concerned.

Same way it works in Xojo for ios NOW

You convert the bucket of bytes to a TEXT AND IF the conversion cannot work you get an error
Something that doesnt happen with define encoding

I must admit I don’t use the iOS framework so haven’t looked into how strings work but if conversion means cloning the data I can see that causing performance issues.

as far as the guts go hard to know

but iOS has the clean split between memoryblock & text so you couldnt accidentally manipulate text as bytes and bytes as if they we re text

but you dont have to use iOS for this - text worked everywhere