String concatenation: mixing encodings

Clothears · 3 June 2020 14:29

Suppose I have:

Var a, b, c As String

Now I initialise c to some string. Therefore it’s UTF-8. Then b gets the data from some socket, so it has Nil TextEncoding, thus:

b = mySock.ReadAll ()
c = "qwerty"

Next I’m gonna do:

a = b + c

Question: what TextEncoding does a have now?

Worse than that. I’m building a up in several steps involving such concatenations, and only at the end will I apply an encoding to the final result. Essentially, I need the concatenation to take one bag-o-bytes and join it to another without any backchat or argument. Is that what will happen?

npalardy · 3 June 2020 14:51

It should be nil

You can use DefineEncoding at the very end BUT, if there are byte sequences that are not that encoding you can end up with a nil encoding and not know why

Is there a reason why you wouldn’t define the encodings as you go ?

Clothears · 3 June 2020 15:47

I had a mail with these lines in the header:

Sender: =?utf-8?Q?=E9=98=B2=E7=96=AB=E7=89=A9=E8=B5=84=E6=8A?=
  =?utf-8?Q?=A5=E5=85=B3?=
  <joe@example.com>

The first and second lines contain encode-words - the items starting with =? and ending with ?= . Now, I have to decode these, so I had a method to do that. The Q indicates that the meat is encoded as quoted-printable, so that =E9 is one byte with value &hE9, etc. So, we have a stream of bytes in UTF8-8. These are all 3 bytes long and are in fact Chinese characters.

The problem here is that one byte that should be in the first encoded-word, on the first line, has been moved to the second one (=A5), probably to keep within a certain line-length. This is invalid, since the first encoded word now has an incomplete UTF-8 byte-cluster. Each encoded-word has its own encoding definition, with both being UTF-8 in this example.

My original method was treating each encoded word entirely separately, and thus thinking that it had a number of invalid characters in the overall string. My new method handles this, but has to assume that the encoding of the first encoded-word applies to them all.

This is an example of the ways email clients can screw up; in this case the sending client is doing line-length logic in the wrong place.

Anyway I tried it and so far it works fine - thanks.

npalardy · 3 June 2020 16:22

ah so in this case you need to grab all the individual bytes from those 2 lines of the “sender” and only when they are all accumulated then define the encoding

makes sense

@beatrixwillius is probably VERY familiar with all this stuff

beatrixwillius · 3 June 2020 16:59

Yup. I have a full email parser for such lovelies. The whitespace in a header needs to be treated gently. And then you can parse the data. Here is the beginning of my code to remove whitespace:

dim HeaderString as string = hasHeader
dim theRegex as RegEx

theRegex = new RegEx
theRegex.options.ReplaceAllMatches = true
theRegex.Options.Greedy = True

'remove folded white space
theRegex.SearchPattern = "(\s|\t)+(\r|\n)+(\s|\t)+"
theRegex.ReplacementPattern = " "
HeaderString = theRegex.Replace(HeaderString)
theRegex.SearchPattern = "(\r|\n)+(\s|\t)+"
theRegex.ReplacementPattern = " "
HeaderString = theRegex.Replace(HeaderString)