RTF Emoji Code

I have the following method that converts a string to RTF code. This works fine as long as no emojis with grapheme clusters are included (Code Point > 65535).

The following string should be converted into valid RTF code:

Hello World 🧛🏽‍♂️😀

My method gives me this back:

Hello World 1F9DB 1F3FD\uc0\u8205 \uc0\u9794 \uc0\u-497  1F6009

It’s not right. Looking at the RTF output for this string in Apple Pages and Microsoft Word, I get the other results. You can find the RTF files generated by Microsoft Word and Apple Pages here:

Does anyone recognize a pattern to correctly convert these grapheme cluster emojis or has a way for the correct code?

Function ToRTF(value As String) As String
  Var iCharacterCount As Integer = value.Length - 1
  Var sCharacter As String
  Var iCodePoint As Integer
  Var arsRTF() As String

  For i As Integer = 0 To iCharacterCount

    sCharacter = value.Middle(i, 1)
    iCodePoint = sCharacter.Asc

    Select Case iCodePoint
    Case 92, 123, 15
      arsRTF.AddRow("\" + Encodings.UTF8.Chr(iCodePoint))
    Case Is <= 127
      arsRTF.AddRow(Encodings.UTF8.Chr(iCodePoint))
    Case 128 To 255
      arsRTF.AddRow("\'" + Hex(iCodePoint))
    Case 256 To 32768
      arsRTF.AddRow("\uc0\u" + iCodePoint.ToString + " ")
    Case 32769 To 65535
      iCodePoint = iCodePoint - 65536
      arsRTF.AddRow("\uc0\u" + iCodePoint.ToString + " ")
    Else
      ' I'm not sure this is right. Actually, this is where the 
      ' remaining grapheme cluster emojis should be processed.
      arsRTF.AddRow(" " + Hex(iCodePoint))
    End Select

  Next

  Return String.FromArray(arsRTF, "")
End Function

Dracula is 1F9DB for sure
Smiley Face should be 1F600

So not sure what all the stuff between them is… nor the trailing “9” which will most likely kill the Smiley :smiley:

It is not the normal Dracula, but a dyed one.

just a guess … but the 1F3FD (skin tone modifier) did not enter the Unicode world until 2015… and I don’t think the RTF spec has changed since about 1902

Well, what can I do with this information? :wink:

Surely there are several other emojis that are not composed with 1F3FD but with other connection characters. Do you have an idea how to create a universal code?

you just need to be sure you calcuations are compatible with the RTF standard I guess

YES, but if you look at the two RTF files you will see that the emojis are encoded differently. Apple does it differently than Microsoft, and yet the Apple RTF file can be read into Microsoft Word without errors. I have no idea how Apple and Microsoft come up with this encoding. :frowning:

the difference is UTF Code Point vs UTF8 encoding

U+1F600 :grinning: f0 9f 98 80 GRINNING FACE

what you have is the 1st column above, what Pages has is the 2nd

https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128512

Okay, I’m not getting anywhere with Dave’s leads.

I have found the following blog post: In the unlikely event you need to represent Emoji in RTF using Perl …

There is a Perl code that does the emoji conversion. Unfortunately I don’t know anything about Perl. What would this code look like when converted to Xojo?

my $c = substr( $ARGV[0], 0, 1 );
say join( "\t⇒ ", $c, sprintf( "U+%X", ord($c) ), emoji2rtf($c) );
exit;
 
sub emoji2rtf($) {
    my $n = ord( substr( shift, 0, 1 ) );
    die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
    return sprintf( "\\u%d?\\u%d?",
        0xd800 + ( ( $n - 0x10000 ) & 0xffc00 ) / 0x400 - 0x10000,
        0xdC00 + ( ( $n - 0x10000 ) & 0x3ff ) - 0x10000 );
}

https://perldoc.perl.org/functions/sprintf.html

I’d have expected the return statement to be something like

Function emoji2rtf(n as integer)  as String
    // die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
    if n < 65536 then
         break
         return "" // probably not right but ...
    end if

     dim d1 as integer = &hd800 + ( ( n - &h10000 ) and &hffc00 ) / &h400 - &h10000
     dim d2 as integer = &hdC00 + ( ( n - &h10000 ) and &h3ff ) - &h10000 

    return "\u" + str(d1) +"?\u" + str(d2) + "?"
end function

BUT this doesn’t seem to give either result from the two rtf’ posted before
What I don’t know about perl could fill books
I’m not sure about it numeric types and default sizes and I’m sure that plays into why the results vary

1 Like

Thanks, Norman. Yeah, what can I say? In fact the output does not match the RTF code of Apple Pages and Microsoft Word, but the RTF code of Microsoft Word is similar to the result of the function and if I embed this result in an RTF file, the emoji will load correctly in Word and Pages. Now compare the RTF output of Xojo, Word and Pages and you will see that Word outputs the string \'5f instead of ?. According to the RTF documentation this means an underscore.

Input: 🧛🏽‍♂️

  • Xojo: \u-10178?\u-8741?\u-10180?\u-8195?\uc0\u8205 \uc0\u9794 \uc0\u-497
  • Word: \u-10178\'5f\u-8741\'5f\u-10180\'5f\u-8195\'5f
  • Pages: \uc0\u55358 \u56795 \u55356 \u57341 \u8205 \u9794 \u65039

What I notice about my modified function is that the output is a few characters longer than the original RTF code. Why is that?

And one more question about the syntax, what exactly happens with such a construct: (iCodePoint And 1047552)? I have never seen that an integer is calculated with an And.

Function ToRTF(value As String) As String
  Var iCharacterCount As Integer = value.Length - 1
  Var sCharacter As String
  Var iCodePoint As Integer
  Var arsRTF() As String
  Var d1, d2 As Integer

  For i As Integer = 0 To iCharacterCount
  
    sCharacter = value.Middle(i, 1)
    iCodePoint = sCharacter.Asc
 
    Select Case iCodePoint
    Case 92, 123, 15
      arsRTF.AddRow("\" + Encodings.UTF8.Chr(iCodePoint))
    Case Is <= 127
      arsRTF.AddRow(Encodings.UTF8.Chr(iCodePoint))
    Case 128 To 255
      arsRTF.AddRow("\'" + Hex(iCodePoint))
    Case 256 To 32768
      arsRTF.AddRow("\uc0\u" + iCodePoint.ToString + " ")
    Case 32769 To 65535
      iCodepoint = iCodepoint - 65536
      arsRTF.AddRow("\uc0\u" + iCodePoint.ToString + " ")
    Else
      iCodePoint = iCodePoint - 65536
      d1 = 55296 + (iCodePoint And 1047552) / 1024 - 65536
      d2 = 56320 + (iCodePoint And 1023) - 65536
    
      arsRTF.AddRow("\u" + d1.ToString + "?\u" + d2.ToString + "?")
    End Select
  
  Next

  Return String.FromArray(arsRTF, "")
End Function

I’m honestly not sure about the RTF and why MS uses one, Apple another yet both work
I havent looked at the RTF spec in many years

This is a BIT WISE AND
http://docs.xojo.com/Bitwise.BitAnd
Unfortunately the docs are really weak here as to what this means

I’ll just assume you know about bitwise functions unless you need / want more explanation ?

My algorithm seems to match the wrong cases a few times for this emoji: \uc0\u8205 \uc0\u9794 \uc0\u-497. That was the point of my question, if there is any way around this.

No, I don’t know anything about it. I’m gonna have to read into these.

https://www.great-white-software.com/blog/2020/02/11/bit-twiddling-in-a-css-age/

I think this will convert Unicode values > 65535 to what you want:

Dim m As MemoryBlock
Dim v1, v2 As Int32

m = ConvertEncoding(“:grinning:”, Encodings.UTF16LE)
m.LittleEndian = True

v1 = m.UInt16Value(0) - 65536
v2 = m.UInt16Value(2) - 65536

1 Like

The perl code is basically converting a unicode value to UTF-16 and formatting it as a rtf sequence.
I have supplied some Xojo code below which should do the same as well as handle ASCII & Unicode Plane 0 characters. It doesn’t handle tabs / carriage returns though.

It looks like RTF can accept the \u sequences either as signed or unsigned 16 bit values.
so… \u-497 in your example is the same as \u65039 (-497 + 65536)

The character after the \u (? or '5f) is to tell the RTF parser what character to display if it doesn’t understand the \u sequence.
The number of characters you have to specify is based on the \uc command which defaults to 1 (\uc0 means don’t display anything).

  Const kEndOfASCII = &h7F
  Const kMaxSigned16BitValue = &h7FFF
  Const kStartOfUnicodePlane1 = &h10000
  Const kUTF16HighSurrogateStart = &hD800
  Const kUTF16LowSurrogateStart = &hDC00
  Const kHigh10Bits = &hFFC00
  Const kHigh10BitsShiftToLow10Bits = &h400
  Const kLow10Bits = &h3FF
  Const kUnsigned16BitValueIntoSigned16BitValueRange = &h10000
  
  Dim r As String
  Dim s As String
  Dim u As UInt32
  Dim w1, w2 As Int16
  
  s = "😀"
  
  r = ""
  
  u = Asc(s)
  
  'note. the following is based on \uc being 1
  If u <= kEndOfASCII Then
    'ascii
    r = s
  ElseIf u <= kMaxSigned16BitValue Then
    'unicode plane 0 (values 0 - 32767)
    r = "\u" + Str(u) + "\'5f"
  ElseIf u < kStartOfUnicodePlane1 Then
    'unicode plane 0 (values 32768 - 65535)
    r = "\u" + Str(u - kUnsigned16BitValueIntoSigned16BitValueRange) + "\'5f"
  Else
    'unicode plane 1 upwards
    
    'convert unicode plane 1 to utf16
    u = u - kStartOfUnicodePlane1
    
    w1 = kUTF16HighSurrogateStart + ((u And kHigh10Bits) / kHigh10BitsShiftToLow10Bits) - kUnsigned16BitValueIntoSigned16BitValueRange
    w2 = kUTF16LowSurrogateStart + (u And kLow10Bits) - kUnsigned16BitValueIntoSigned16BitValueRange
    
    
    'add the high & low surrogates
    r = "\u" + Str(w1) + "\'5f" + "\u" + Str(w2) + "\'5f"
  End If
  
  
  Break
1 Like

Great. Thank you for the full explanation.

Can you see why in my above (corrected) code, after the Emoji encoding, other RTF codes are appended? Where is the error?

I do understand that there is no difference between the MemoryBlock method and the second method of yours in terms of results?

I don’t think there will be any difference in the result for the two methods.
The second method might be more efficient as it is implemented using math rather then converting strings to different encodings and using a memory block.

re> Your revised code not working.
I see you are using \uc0 for Unicode Plane 0.
However, when you encode for > Unicode 0 you have included ? but don’t switch to \uc1 so maybe you need to prefix with \uc1. You might possibly also need a space afterwards.
You might actually be able to just use \uc0 all of the time and use a space instead of ? like you do for Unicode Plane 0.

NOTE. I don’t think you have to prefix every single Unicode sequence with \uc. From what I can remember, the \uc value is like other commands which means once set it applies to the current group and any sub-groups. If you apply it to a sub-group the previous value is restored when the sub-group is closed.

@s7g2vp2 Kev, your method is much faster then mine. Don’t know why, but it’s impressive. Thanks so much.

Line Wraps and the Zero-Width Joiner is an interesting blog post on exactly this topic.