I have the following method that converts a string to RTF code. This works fine as long as no emojis with grapheme clusters are included (Code Point > 65535).
The following string should be converted into valid RTF code:
Hello World 🧛🏽♂️😀
My method gives me this back:
Hello World 1F9DB 1F3FD\uc0\u8205 \uc0\u9794 \uc0\u-497 1F6009
It’s not right. Looking at the RTF output for this string in Apple Pages and Microsoft Word, I get the other results. You can find the RTF files generated by Microsoft Word and Apple Pages here:
Does anyone recognize a pattern to correctly convert these grapheme cluster emojis or has a way for the correct code?
Function ToRTF(value As String) As String
Var iCharacterCount As Integer = value.Length - 1
Var sCharacter As String
Var iCodePoint As Integer
Var arsRTF() As String
For i As Integer = 0 To iCharacterCount
sCharacter = value.Middle(i, 1)
iCodePoint = sCharacter.Asc
Select Case iCodePoint
Case 92, 123, 15
arsRTF.AddRow("\" + Encodings.UTF8.Chr(iCodePoint))
Case Is <= 127
arsRTF.AddRow(Encodings.UTF8.Chr(iCodePoint))
Case 128 To 255
arsRTF.AddRow("\'" + Hex(iCodePoint))
Case 256 To 32768
arsRTF.AddRow("\uc0\u" + iCodePoint.ToString + " ")
Case 32769 To 65535
iCodePoint = iCodePoint - 65536
arsRTF.AddRow("\uc0\u" + iCodePoint.ToString + " ")
Else
' I'm not sure this is right. Actually, this is where the
' remaining grapheme cluster emojis should be processed.
arsRTF.AddRow(" " + Hex(iCodePoint))
End Select
Next
Return String.FromArray(arsRTF, "")
End Function
just a guess … but the 1F3FD (skin tone modifier) did not enter the Unicode world until 2015… and I don’t think the RTF spec has changed since about 1902
Surely there are several other emojis that are not composed with 1F3FD but with other connection characters. Do you have an idea how to create a universal code?
YES, but if you look at the two RTF files you will see that the emojis are encoded differently. Apple does it differently than Microsoft, and yet the Apple RTF file can be read into Microsoft Word without errors. I have no idea how Apple and Microsoft come up with this encoding.
There is a Perl code that does the emoji conversion. Unfortunately I don’t know anything about Perl. What would this code look like when converted to Xojo?
my $c = substr( $ARGV[0], 0, 1 );
say join( "\t⇒ ", $c, sprintf( "U+%X", ord($c) ), emoji2rtf($c) );
exit;
sub emoji2rtf($) {
my $n = ord( substr( shift, 0, 1 ) );
die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
return sprintf( "\\u%d?\\u%d?",
0xd800 + ( ( $n - 0x10000 ) & 0xffc00 ) / 0x400 - 0x10000,
0xdC00 + ( ( $n - 0x10000 ) & 0x3ff ) - 0x10000 );
}
I’d have expected the return statement to be something like
Function emoji2rtf(n as integer) as String
// die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
if n < 65536 then
break
return "" // probably not right but ...
end if
dim d1 as integer = &hd800 + ( ( n - &h10000 ) and &hffc00 ) / &h400 - &h10000
dim d2 as integer = &hdC00 + ( ( n - &h10000 ) and &h3ff ) - &h10000
return "\u" + str(d1) +"?\u" + str(d2) + "?"
end function
BUT this doesn’t seem to give either result from the two rtf’ posted before
What I don’t know about perl could fill books
I’m not sure about it numeric types and default sizes and I’m sure that plays into why the results vary
Thanks, Norman. Yeah, what can I say? In fact the output does not match the RTF code of Apple Pages and Microsoft Word, but the RTF code of Microsoft Word is similar to the result of the function and if I embed this result in an RTF file, the emoji will load correctly in Word and Pages. Now compare the RTF output of Xojo, Word and Pages and you will see that Word outputs the string \'5f instead of ?. According to the RTF documentation this means an underscore.
What I notice about my modified function is that the output is a few characters longer than the original RTF code. Why is that?
And one more question about the syntax, what exactly happens with such a construct: (iCodePoint And 1047552)? I have never seen that an integer is calculated with an And.
Function ToRTF(value As String) As String
Var iCharacterCount As Integer = value.Length - 1
Var sCharacter As String
Var iCodePoint As Integer
Var arsRTF() As String
Var d1, d2 As Integer
For i As Integer = 0 To iCharacterCount
sCharacter = value.Middle(i, 1)
iCodePoint = sCharacter.Asc
Select Case iCodePoint
Case 92, 123, 15
arsRTF.AddRow("\" + Encodings.UTF8.Chr(iCodePoint))
Case Is <= 127
arsRTF.AddRow(Encodings.UTF8.Chr(iCodePoint))
Case 128 To 255
arsRTF.AddRow("\'" + Hex(iCodePoint))
Case 256 To 32768
arsRTF.AddRow("\uc0\u" + iCodePoint.ToString + " ")
Case 32769 To 65535
iCodepoint = iCodepoint - 65536
arsRTF.AddRow("\uc0\u" + iCodePoint.ToString + " ")
Else
iCodePoint = iCodePoint - 65536
d1 = 55296 + (iCodePoint And 1047552) / 1024 - 65536
d2 = 56320 + (iCodePoint And 1023) - 65536
arsRTF.AddRow("\u" + d1.ToString + "?\u" + d2.ToString + "?")
End Select
Next
Return String.FromArray(arsRTF, "")
End Function
My algorithm seems to match the wrong cases a few times for this emoji: \uc0\u8205 \uc0\u9794 \uc0\u-497. That was the point of my question, if there is any way around this.
No, I don’t know anything about it. I’m gonna have to read into these.
The perl code is basically converting a unicode value to UTF-16 and formatting it as a rtf sequence.
I have supplied some Xojo code below which should do the same as well as handle ASCII & Unicode Plane 0 characters. It doesn’t handle tabs / carriage returns though.
It looks like RTF can accept the \u sequences either as signed or unsigned 16 bit values.
so… \u-497 in your example is the same as \u65039 (-497 + 65536)
The character after the \u (? or '5f) is to tell the RTF parser what character to display if it doesn’t understand the \u sequence.
The number of characters you have to specify is based on the \uc command which defaults to 1 (\uc0 means don’t display anything).
Const kEndOfASCII = &h7F
Const kMaxSigned16BitValue = &h7FFF
Const kStartOfUnicodePlane1 = &h10000
Const kUTF16HighSurrogateStart = &hD800
Const kUTF16LowSurrogateStart = &hDC00
Const kHigh10Bits = &hFFC00
Const kHigh10BitsShiftToLow10Bits = &h400
Const kLow10Bits = &h3FF
Const kUnsigned16BitValueIntoSigned16BitValueRange = &h10000
Dim r As String
Dim s As String
Dim u As UInt32
Dim w1, w2 As Int16
s = "😀"
r = ""
u = Asc(s)
'note. the following is based on \uc being 1
If u <= kEndOfASCII Then
'ascii
r = s
ElseIf u <= kMaxSigned16BitValue Then
'unicode plane 0 (values 0 - 32767)
r = "\u" + Str(u) + "\'5f"
ElseIf u < kStartOfUnicodePlane1 Then
'unicode plane 0 (values 32768 - 65535)
r = "\u" + Str(u - kUnsigned16BitValueIntoSigned16BitValueRange) + "\'5f"
Else
'unicode plane 1 upwards
'convert unicode plane 1 to utf16
u = u - kStartOfUnicodePlane1
w1 = kUTF16HighSurrogateStart + ((u And kHigh10Bits) / kHigh10BitsShiftToLow10Bits) - kUnsigned16BitValueIntoSigned16BitValueRange
w2 = kUTF16LowSurrogateStart + (u And kLow10Bits) - kUnsigned16BitValueIntoSigned16BitValueRange
'add the high & low surrogates
r = "\u" + Str(w1) + "\'5f" + "\u" + Str(w2) + "\'5f"
End If
Break
I don’t think there will be any difference in the result for the two methods.
The second method might be more efficient as it is implemented using math rather then converting strings to different encodings and using a memory block.
re> Your revised code not working.
I see you are using \uc0 for Unicode Plane 0.
However, when you encode for > Unicode 0 you have included ? but don’t switch to \uc1 so maybe you need to prefix with \uc1. You might possibly also need a space afterwards.
You might actually be able to just use \uc0 all of the time and use a space instead of ? like you do for Unicode Plane 0.
NOTE. I don’t think you have to prefix every single Unicode sequence with \uc. From what I can remember, the \uc value is like other commands which means once set it applies to the current group and any sub-groups. If you apply it to a sub-group the previous value is restored when the sub-group is closed.