File names with umlauts in the name not handled by Xojo

I’ve tried a few different things from TOF
And i’ve tried converting the encoding as well and making sure I decompose things to the right form
So far nothing works
The really weird part is that in simple tests the name of the file is actually NOT as it seems
In Finder lists it shows as

 Hauptmen.frm

but when I do ls -al in terminal its

 -rwxr-xr-x@ 1 npalardy  staff  16373  8 Oct  2020 Hauptmen?.frm

and if I just run some code to grab the parent then list all the names it appears to not have those special characters … unless I check the binary and then its

Sucky part is this name I originally get from the manifest of a VB Project and in there its

Form=Hauptmenü.frm

in a hex editor its

Pretty sure I need to convert this to UCS- (utf16)
The mechanism I have tried dont seem to work correctly and the file is not found

Can anyone offer any assistance on this ?

Ran into this trying to transmit binary from a language that insists strings be utf-8.
I had to go out of my way to prevent binary data from being converted. A hex dump was full of extra 0xC2 plus another wrong byte.
In the first hex dump, 0xC2 is a utf-8 escape for binary codes over 0x7F.
UTF-8 0xc281 decodes to 
The second dump is a ‘latin1’ / ISO-8859 (8 bit binary) &xFC = ü
I’m not sure if this gives you any more idea about where to look for the ‘helpful translation’ section.

yeah this is a puzzler
the zip file when unzipped gives me those odd names
and the one thing that is in there is a VB6 project manifest file which contains a line like

Form=Hauptmenü.frm

in WindowsANSI
But so far I cant figure out how to take that ansi and actually get the file which IS on disk
Its like macOS doesnt see it or tries to turn the ANSI encoded string into a UTF16 one which wont match and so it says “no such file”

Such fun !

Is not this a 2-characters vs single-character trick?
Like you have é alone but can also have e and ´ separated (proven by the fact I just wrote them). IIRC, both can be combined in a “single character” and has a different code point than the all-in-one version.

I wish it was
The VB6 manifest has it one way - and the file system actually has it unzipped another way
Still trying to sort out why and how I can deal with this

MS and their way of making universally compatible stuffs… :thinking:

Well this is really screwy
The manifest is encoded in WindowsAnsi
This manifest contains the name of the file
And in that the string contains

    Hauptmenü.frm

and the ü is encoded as a single byte &hFC
The zip file, when uncompressed on macOS gives me a file that has the name with that single &hFC encoded in an encoding I dont recognize and its 2 bytes &hC2 &h81
So far I havent found any combination of convert encoding / compose / decompose etc that turns that &hFC byte into &hC2 &h81

What I’ve landed on that DOES work is along these lines

f is the dir that contains the unzipped manifest & vb files

Dim value As String = ConvertEncoding("Hauptmenü.frm", Encodings.WindowsANSI)
Dim newvalue As String = value.ReplaceAll(ChrB(&hFC), ChrB(&hC2)+ChrB(&h81))
Dim fl1 As folderitem = f.child(newvalue)
Dim newUrl As String = f.URLPath + "/" + EncodeURLComponent(newvalue)
Dim fl2 As folderitem = GetFolderItem(newUrl, FolderItem.PathTypeURL)

Break

in this F1 will not be NIL but EXISTS will be FALSE :frowning:
But F2, fetched by URL, will not be nil AND exists will be TRUE

The 0xC281 is a ‘utf8’ escaped 0x81. In the DOSLatinUS encoding this is ü.

  Public Function utf8escape(ss as String) as String
  Dim ret As String
  Dim s() As String 
  Dim i,c As Integer
  
  s = SplitB(ss,"")
  
  While i <= UBound(s)
    c = AscB(s(i))
    If c < 128 Then
      ret = ret + s(i)
    Else
      // '0xC0 + c >> 6' + '0x80 + c & 0x7f'
      ret = ret + ChrB(192 + (Bitwise.ShiftRight(c,6) And 127)) + ChrB(128 + (c And 127))
    End If
    i = i + 1
  Wend
  
  Return ret
End Function

Dim value As String = ConvertEncoding("Hauptmenü.frm", Encodings.DOSLatinUS) value = utf8escape(value)

oooooo !!! this might work even better than the hacky way I have it now !

EDIT - once I figure out what the right conversion from WindowsAnsi to DosLatin is this could help
its that part thats tripping this whole mess up

value1 is what I start with - it is windows ansi as best I can tell
and the file system seems to have it in that weird dos latin 1 form
so I need to figure the right conversion from one to the other
so far I’m not quite there

Dim value1 As String = DefineEncoding( chrb(&hFC) + ".frm" , Encodings.WindowsANSI) 
Dim newvalue1 As String = utf8escape(value1)
Dim value2 As String = ConvertEncoding(value1, Encodings.DOSLatinUS) 
Dim newvalue2 As String = utf8escape(value1)

Break

I need to do more hunting

Re-examined the encoding. I didn’t strip enough lower bits:

Else
  // '0xC0 + c >> 6' + '0x80 + c & 0x7f'
  ret = ret + ChrB(192 + (Bitwise.ShiftRight(c,6) And 127)) + ChrB(128 + (c And 127))
End If

should be:

Else
 // '0xC0 + c >> 6' + '0x80 + c & 0x3f'
 ret = ret + ChrB(192 + (Bitwise.ShiftRight(c,6) And 127)) + ChrB(128 + (c And 63))
End If

This should match the 2-byte encoding from https://en.wikipedia.org/wiki/UTF-8