UTF-8 Character output

Torsten_B · 13 June 2023 17:38

Compared to Swift, Xojo is lightyears behind in text/string handling.

prodman · 13 June 2023 17:47

Compared to Swift, the Xojo language is as unchanging as Stonehenge, which is both a good thing and a bad thing.

npalardy · 13 June 2023 17:54

Unfortunately
if you look at API 2 much of what has been added hasn’t added capability it just shifted it from functions like VAL to ToInteger, TODouble etc
It shifted event names from “open” top “opening” - but that didnt add capabilities

And in some respects it removed capability

Much of it is like Y2K - rewrite tons of code to get exactly what you already had

bkeeney · 13 June 2023 18:17

Worse than Y2K. At least with Y2K you had multiple developers looking at how date times were being used. The API2 code has bugs in it that are still being found years later. QA has never been Xojo’s strong point.

thorstenstueker · 13 June 2023 18:23

I was participating once. No question answered and at the end speaking about the weather. no time for stuffs like that. Mr. Perlman has the absolute control. That`s professional.

anic297 · 13 June 2023 19:17

Can’t they just add a [n optional] parameter to existing string functions indicating the function to treat the string as a text stream? Since text and string were so close, why not accepting the current single datatype way (always strings) with adapting the functions to treat “streams of characters” as either one? It would be the best of both worlds.

Torsten_B · 13 June 2023 19:34

Oops:

So either you deal with the clunky Xojo Mobile Platform or you do things in Xcode. Both together won’t happen I suppose.

https://forum.xojo.com/t/xojo-to-swiftui/76150/4

npalardy · 13 June 2023 19:50

I would say no they cant do that
The guts are different for text vs bytes which is why STRING still gets the counts of “characters” wrong in some cases
TEXT works in the right way for TEXTUAL data - its not meant for BYTES
Its based on the Unicode consortiums libraries on each platform so its quite consistent

STRING is mostly ancient and for the longest time assumed that 1 byte = 1 character
And then encodings were added to it to handle UTF-8 etc
But it counts some things wrong - like ü as 2 characters
Its home grown code that handles some things fundamentally wrong

And they already HAVE text - they need to just NOT throw it out

anic297 · 13 June 2023 20:44

But they are the same thing under the hood, no? A stream of characters.

What does the text type have that the string type doesn’t? Or, in other words, how was the text type defined?

npalardy · 13 June 2023 23:27

NO
see Page Not Found — Xojo documentation

String is “a sequence of bytes that you CAN interpret using an encoding” or not

TEXT is a sequence of characters
And “character” is a VERY deliberate and well defined term in Xojo TEXT

The documentation is very deliberate in its use of these terms: character, code point, and scalar value. A character, in this context, refers to an extended grapheme cluster (also known as a user-perceived character). The terms code point and scalar value retain the meaning defined in the Unicode standard.

which, in part is why STRING gets it WRONG
in something like ü there are at least 2 parts to it ; the u and the dieresis (the two dots)
this might be unicode code point U+00FC LATIN SMALL LETTER U WITH DIAERESIS which is ONE character (multiple bytes)
OR it can be u with a combining dieresis (U+0308)
BOTH give the same appearance on screen (or the same user perceived character)
String counts the first as 1 “character”, the second as 2
The second is flat out WRONG as its not count it as user perceived characters

just try this & you’ll see why STRING is wrong !


Dim s As String = &u00fc
Dim s1 As String = "u" + &u0308

Dim slen As Integer = s.Len
Dim s1len As Integer = s1.Len


Dim t As Text = &u00fc
Dim t1 As Text = "u" + &u0308

Dim tlen As Integer = t.Length
Dim t1len As Integer = t1.Length


Break

and look at what you see in the debugger
you will see both s and s1 look like they hold ü
but one says it is one “character” long and the other 2

but the text version are right in either case

this is why for international usage you should use TEXT - not string
the Xojo framework got a lot of things right and Xojo is moving AWAY from them

thorstenstueker · 13 June 2023 23:49

But using a decoding like utf8 in your entire chain it makes it secure that nothing happens. So I don’t get why many people don’t care about.

Torsten_B · 14 June 2023 03:45

The thread has been removed from TOF.

npalardy · 14 June 2023 04:48

Of course it has

s7g2vp2 · 14 June 2023 07:03

They are both streams of bytes but use different methods to interpret them.

STRING in most cases interprets the bytes as Unicode code points.

TEXT interprets the bytes at a higher level which on MS-Windows and Linux use the ICU library. Not sure on macOS but it might be a NSString.

The Xojo String type isn’t wrong (or buggy) as people might say and its functionality matches several other programming languages. Unfortunately, as text composition has become more sophisticated the Xojo documentation hasn’t been updated to reflect this so incorrectly states characters.

Both types have their place depending on how you want to manipulate strings (we actually rely on the way that STRING works).

One of the biggest problems with TEXT is that the atrocious performance was never really improved ever since it was introduced. I would also say that Xojo didn’t do a good job explaining its benefits over STRING so people were reluctant to use it.

anic297 · 14 June 2023 11:09

If the only differences are that strings can have no encoding and texts may not have unprintable characters, my idea would stand (read below why not).

I’m fairly aware of this. It doesn’t yet defeat my suggestion.

That’s why I was saying: if they’ve removed the text type (and won’t certainly admit to undo the change), they could instead add a parameter to various functions to consider the input as text rather than string.
But that’s already what the “B” versions do (LeftB, LenB, etc.) to differentiate between the non-“B” ones, no?

Ah, that’s interesting and I’m starting to see the actual difference. So my idea of adding parameters to string functions to remediate to the loss of the text data type would basically mean converting from Unicode to NSString (or ICU) each time, which would be inefficient. Fine, I now see why my idea wouldn’t work. Thank you.

Granted.

s7g2vp2 · 14 June 2023 13:15

My understanding is that both NSString and ICU require UTF-16 so you would probably take a big hit in performance.

npalardy · 14 June 2023 14:06

Text, is NOT a sequence of bytes
https://docs.xojo.com/Text
Much of this doc was written BY the person that implemented TEXT (Joe Ranieri)

it CAN be converted INTO a sequence of bytes (aka old style string)

bkeeney · 14 June 2023 14:22

I miss Joe. But then I miss many of the former Xojo developers like Aaron and Mars. Currently work with several as well.

A lot of talented people have left Xojo. Not just employees but users as well. The brain drain is real.

s7g2vp2 · 14 June 2023 14:48

How are those scalar values stored in memory?

npalardy · 14 June 2023 15:09

OK you want to split hairs
yes its a run of “bytes” in one sense - every last thing in a computer is

What TEXT IS NOT is a “sequence of bytes that represent characters” ie/ like ASCII where each BYTE represents one char, or like utf 8 where each byte is one of a sequence for a code point.

Its a list of integers, each one being a unicode code point - but NOT the BYTES that unicode code point is represented by in an encoding like UTF-8

In string when you add &u0308 in UTF-8 you get the bytes 0xCC 0x88 in a string
In text when you add &u0308 you get - &u0308
And to get the BYTES from that TEXT you have to ask that TEXT to turn itself into bytes
That then gives you, in UTF-8, 0xCC 0x88

So what they store IS different in INTENT and usage