String vs Text

npalardy · 26 July 2021 14:41

String counts code points

Text counts “user perceived characters” more or less

Dim a As Text = "☺️"
Var aLength As Integer = a.Length
Var b As Text = "😀"
Var bLength As Integer = b.Length

you get the right answer

BillG · 26 July 2021 15:13

I don’t know too much about encodings, would String.CountFields("") always return what Text.Length does?

npalardy · 26 July 2021 15:20

I honestly dont know
It may

Strings count “code points”
And several code points can be combined to result in one “user perceived character”

For instance, a ü can b created in several ways in an encoding like UTF-8 (pretty sure all 3 are valid0

a “combining umlaut” + u (2 code points)
a u + combining umlaut
the code point for ü

a combining umlaut is on that gets put together with the other character to form ONE “character” we see on screen, on a printed page etc

string would count 1 and 2 as 2 “characters” because it doesnt count “user perceived characters” - it counts code points

TEXT was literally designed to solve these issues and
THERE is NO reason to remove TEXT
Its just another datatype like date , datetime integer, uint64 etc

npalardy · 26 July 2021 15:50

OR JUST USE TEXT !!!
this IS what it does already

s7g2vp2 · 26 July 2021 15:56

Yes, but text is deprecated (rightly or wrongly) so I imagine people will be nervous to use it.

Xojo should add these functions into String.

npalardy · 26 July 2021 16:21

What do they do with “Length” ?
Either it counts code points or grapheme clusters - not both

And if they have a new function “Clusters” that returns a count of clusters then explaining the difference between “code points” and “user perceived characters” is on par with the difference between text and string

Making string handle classic “string” functions, and “text” functions will be messy

And probably break code along the way

its WHY text was originally created
NO breakage - move to it when you need it etc etc etc

But …

HalGumbert · 26 July 2021 16:28

Should, wish, hope. Or another workaround.

s7g2vp2 · 26 July 2021 16:31

Having String contain both code point & grapheme cluster functions wouldn’t be any messier than two different data types. I’m not sure how it could break existing code if the grapheme cluster methods were named differently. This is sort of how it is done with other languages such as Google Dart.

s7g2vp2 · 26 July 2021 16:34

I don’t think I would call it a workaround.

Some people need String to work the way it does today, other people (like Gary) need it to work differently.

Creating the extension methods isn’t hard but it does require the user to know that the problem exists in the first place.

HalGumbert · 26 July 2021 16:39

The workaround is having to use the iterator to get the length.

s7g2vp2 · 26 July 2021 16:43

Even if Xojo implemented the ICU / OS functions there would still be character iteration going on somewhere as i’m pretty sure its the only way you can count the length of a string when taking grapheme clusters into account.

npalardy · 26 July 2021 21:16

I’d split it into two classes and eventually actually remove “string”

That way data that is supposed to be “textual” eventually is all handled via “text”
And data that is bucks of bytes is handled in memoryblock

String munging both those into a single data type has long caused confusion

s7g2vp2 · 26 July 2021 21:36

How would that work if you had a bucket of bytes from somewhere that was actually text? Today you can just define the encoding but the method you are describing would more than likely involve cloning the bytes which could be a performance killer in some situations.

Being able to use byte functions on strings is useful in some situations when your data is correctly formed UTF-8 / ASCII as you can perform quick searches using InstrB rather than InStr.

If the current Xojo text data type hadn’t been a performance killer and also supported code point based functions then we would have probably used it. Unfortunately, it never really got improved past the original implementation so was DOA as far as we were concerned.

npalardy · 26 July 2021 22:03

Same way it works in Xojo for ios NOW

You convert the bucket of bytes to a TEXT AND IF the conversion cannot work you get an error
Something that doesnt happen with define encoding

s7g2vp2 · 26 July 2021 22:37

I must admit I don’t use the iOS framework so haven’t looked into how strings work but if conversion means cloning the data I can see that causing performance issues.

npalardy · 26 July 2021 22:41

as far as the guts go hard to know

but iOS has the clean split between memoryblock & text so you couldnt accidentally manipulate text as bytes and bytes as if they we re text

but you dont have to use iOS for this - text worked everywhere