String concatenation: mixing encodings

Suppose I have:

Var a, b, c As String

Now I initialise c to some string. Therefore it’s UTF-8. Then b gets the data from some socket, so it has Nil TextEncoding, thus:

b = mySock.ReadAll ()
c = "qwerty"

Next I’m gonna do:

a = b + c

Question: what TextEncoding does a have now?

Worse than that. I’m building a up in several steps involving such concatenations, and only at the end will I apply an encoding to the final result. Essentially, I need the concatenation to take one bag-o-bytes and join it to another without any backchat or argument. Is that what will happen?

It should be nil

You can use DefineEncoding at the very end BUT, if there are byte sequences that are not that encoding you can end up with a nil encoding and not know why

Is there a reason why you wouldn’t define the encodings as you go ?

I had a mail with these lines in the header:

Sender: =?utf-8?Q?=E9=98=B2=E7=96=AB=E7=89=A9=E8=B5=84=E6=8A?=
  =?utf-8?Q?=A5=E5=85=B3?=
  <joe@example.com>

The first and second lines contain encode-words - the items starting with =? and ending with ?= . Now, I have to decode these, so I had a method to do that. The Q indicates that the meat is encoded as quoted-printable, so that =E9 is one byte with value &hE9, etc. So, we have a stream of bytes in UTF8-8. These are all 3 bytes long and are in fact Chinese characters.

The problem here is that one byte that should be in the first encoded-word, on the first line, has been moved to the second one (=A5), probably to keep within a certain line-length. This is invalid, since the first encoded word now has an incomplete UTF-8 byte-cluster. Each encoded-word has its own encoding definition, with both being UTF-8 in this example.

My original method was treating each encoded word entirely separately, and thus thinking that it had a number of invalid characters in the overall string. My new method handles this, but has to assume that the encoding of the first encoded-word applies to them all.

This is an example of the ways email clients can screw up; in this case the sending client is doing line-length logic in the wrong place.

Anyway I tried it and so far it works fine - thanks.

ah so in this case you need to grab all the individual bytes from those 2 lines of the “sender” and only when they are all accumulated then define the encoding

makes sense

@beatrixwillius is probably VERY familiar with all this stuff :slight_smile:

Yup. I have a full email parser for such lovelies. The whitespace in a header needs to be treated gently. And then you can parse the data. Here is the beginning of my code to remove whitespace:

dim HeaderString as string = hasHeader
dim theRegex as RegEx

theRegex = new RegEx
theRegex.options.ReplaceAllMatches = true
theRegex.Options.Greedy = True

'remove folded white space
theRegex.SearchPattern = "(\s|\t)+(\r|\n)+(\s|\t)+"
theRegex.ReplacementPattern = " "
HeaderString = theRegex.Replace(HeaderString)
theRegex.SearchPattern = "(\r|\n)+(\s|\t)+"
theRegex.ReplacementPattern = " "
HeaderString = theRegex.Replace(HeaderString)
1 Like