In this post I will show how to do a memory-efficient string compression. C# and .NET in general has a managed memory model which means that the memory cannot be accessed directly. In some cases, such as this one it might be too restrictive. Fortunately .NET enables, for many reasons, direct memory access like in non-managed runtimes such as C and C++. While compression is a good example for the technique presented, any byte oriented processing of strings may benefit from it.

I will use .NET’s GZipStream for the example, but any compression stream can be used (I have actually used the 7ZIP LZMA library with it). Before we get to the efficient implementation lets check the naive approach:

byte[] CompressStringNaive(string str, Encoding encoding) {   //Get the string's bytes byte[] strBytes = encoding.GetBytes(str);   //Wrap the string's bytes with a memory stream MemoryStream memoryStream = new MemoryStream(strBytes);   //Init compression GZipStream compressionStream = new GZipStream(memoryStream, CompressionLevel.Fastest);   //Copy to a memory stream MemoryStream compressedBytesStream = new MemoryStream(); compressionStream.CopyTo(compressedBytesStream);   //Voila! the compressed bytes byte[] compressedBytes = compressedBytesStream.ToArray();   return compressedBytes; }

We have two copies of the string bytes, once in the string and once in the byte array. This is a waste of memory. .NET encodes its strings with UTF16. The following will compress the string’s UTF16 bytes directly from memory:

unsafe byte[] CompressString(string str) { fixed (char* strPtr = str) { long length = Encoding.Unicode.GetByteCount(strPtr, str.Length); //Get stream of string bytes byte* bytePtr = (byte*)strPtr; var strStream = new UnmanagedMemoryStream(bytePtr, length);   MemoryStream outStream = new MemoryStream((int)(length / 50));   //Compress GZipStream compressionStream = new GZipStream(strStream, CompressionLevel.Fastest); compressionStream.CopyTo(outStream);   //Close str stream strStream.Close();   //To Array outStream.Seek(0, SeekOrigin.Begin); return outStream.ToArray(); } }

Notice that the method is marked as unsafe, this tells .NET that the method is using unsafe (unmanaged operations) and it is necessary for the code to compile. The fixed keyword tell the garbage collector not to move the string around the memory because we are accessing it by address. UnmanagedMemoryStream enables us to view the string bytes as a stream very similar to the more common MemoryStream. Notice that we are also initializing the outStream so it will at least be initialized with 2% of the original size – this can be configurable and tuned. It prevents reallocations while compressing.

We lost control of the encoding and that’s a shame. We can wrap the UnmanagedMemoryStream with an encoding conversion stream (I should cover the implemantation of such a thing in a future post – stay tuned). In any case we have cut the memory requirements by half! This will have an impact on both time and space (think of LOH collections for long strings)

Happy Compression

EDIT
I got some constructive comments on HN (here) suggesting that I’ll explain the benefits of the second version of the code in more depth (thank you tkellogg for your constructive critisism). So I will! Well, two points which arose in the comments are (1) why are you using unsafe code? it is not a recommended practice (2) Why should we bother with unsafe code and pointers at all, the byte array is going to be collected very quickly either way. Both of these are excellent points. First, we must understand that compression is a thing we do on large strings and not on a few KBs. Saying that, you should recall that every object larger than 85K does not go into the first generation of the .NET’s memory heaps – it allocated in a separate heap called the Large Object Heap (LOH). Why you ask? because if an objects survives a first generation collection, it moves to the second and then to the third. These moves are very expensive and should be avoided for large objects, even by the garbage collector. The LOH collections are expensive in both time (only in .NET 4.5 you have a server GC with background mode) that is you application halts and in CPU (check the % time in GC counter with long strings and high throughput – it is nasty). So the case for compression is large objects and large objects are expensive to collect so we should try to avoid their allocation in the first place. The improved version of the code does just that and it actually uses less memory which is always nice. Secondly, unsafe code is not a good practice primarily because the generations shifts I mentioned earlier are prevented and defragmentation of the heaps (something the GC is doing as well) is also interrupted – this is the result of fixed keyword used. In our case however, these operations does not happen so often as they are expensive on the LOH and are actually less frequent because we are keep the LOH smaller – all in all a good practice. Also if you check out the implementation of Encoding.GetByteCount or Encoding.GetBytes with a decomplier (such as dotPeek which I blogged about before) you will see that the same use of pointers – so that’s MS code for you. I hope that helps to understand the rationalizations further.