String Interning -- an oft-overlooked way to reduce the memory footprint of cached data

A caching strategy I have used to great effect is to load tables of static or at least seldom-changed data into memory rather than look things up in the database on-demand. On modern hardware, preloading even tens of thousands of rows in memory can be a great performance tradeoff. Obviously, though, you want to minimize the memory footprint.

Any time cached data has a lot of repeating string values, a simple strategy that can save a LOT of memory is to only load each unique string value once. This is easily accomplished via string interning.

String interning is a means of keeping just one instance of each unique string value in memory. An easily understood example is names and addresses. The country and region/state/province code, and to a slightly lesser extent, postal codes and city names, will not have many unique values relative to the total number of records you might load into memory, for any non-trivial number of records. For example, if you are only loading US data, there will be just over 50 unique state / territory codes (50 states plus Puerto Rico, US Virgin Islands, Guam, etc). If you have 20,000 names and addresses in memory, interning these repetitive values saves a LOT of memory.

And doing it is drop-dead simple.

// Assuming all rows have a 2-character state code, and the following code is
// called for each row, this will load 4 bytes times the number of rows loaded 
// (.NET strings are UTF-16).
Rules.Region = dbRegionValue;

// .... but this will store 4 bytes times the number of unique values,
// such that whenever the Region property of each instance of the Rules
// class has a particular value, say, "NY", it will point to the SAME instance
// in memory as all the other instances with the value "NY".
Rules.Region = string.Intern(dbRegionValue); // 99% memory savings!

The trade off? It will load a tad slower, because each value is tracked in a string interning pool, which is basically a hash table, and there’s a lookup to see if a string already exists or not. The system takes care of making sure the string is a singleton; if the string already exists, the existing instance is returned.

In practice, I haven’t really seen a subjective difference in speed with or without interning, even though the native interning implementation isn’t particularly fast unless you’re running as a native AOT app. However if this turns out to be an issue, you can accomplish the same thing as native interning with a [Concurrent]Dictionary<string,string> implementation … here’s a URL discussing the relevant benchmarks:

However, I think this would be premature optimization for my use case. I note a hesitation on startup of 2 or 3 seconds … I could cut that down to a fraction of a second, but these are system processes and console apps; it’s not worth the extra effort.

Note that the .NET runtime normally just interns string constants in your program, as a way to minimize the memory footprint of the assembly. This is a particularly effective use of interning because it’s all upside: the interning overhead happens at compile time, and the hash table is “baked in” to the assembly.

You would not intern dynamically manipulated strings, such as strings concatenated or created with StringBuilders, as there’d be no point. You would have the hashtable lookup without the memory savings.

7 Likes

Incidentally I forgot to mention that the overhead of each string instance is not just the char array for the string value but the length property, the pointer to the char array itself and other object overhead. So when you avoid having another instance of, e.g., “NY” you are saving more than the 4 bytes used to store those 2 chars. Obviously this is particularly dramatic with shorter strings, less consequential with big ones.

Uh… isn’t that how a String Variable was stored originally (not today, but like VB3 or before)
and itsn’t that a Xojo C-String today?

Xojo does this as far as I recall
Two variable with the same contents will refer to the same “string”
But once you alter either they get split up & refer to their own instance
Would have to poke around in C with Xojo strings to be 100% certain but that was my understanding of them way back when

I looked into this when I was getting my bearings with Xojo and got that impression also, but it’s not like the docs are clear on the issue.

In .NET, for any two interned strings, value and reference equality are the same. For any two non-interned strings, they are likely different. This also raises the possibility of comparing two strings by reference rather than by value, if you know they’re both interned; reference comparisons are much faster than value comparisons since they don’t have to examine each character (worst case, all of them, if no differences are found).

I think the design philosophy behind .NET strings was that they would be immutable / thread safe in all circumstances, therefore, by default creating a string creates a new instance without regard to value.

many of the same design considerations for Xojo strings