Stream Classes For Encoding Binary Data
Email is a text based mechanism. Email cannot contain binary data. However, email can contain MIME attachments, which are binary attachments encoded to a text format. Encoding will expand the size of the attachment because the printable character set is much smaller than the 256 possible values that an 8-bit byte can hold. Some encoding schemes use a variable number of bytes, some will use a fixed number. For example, base64 and uu encoding convert 3 bytes of binary data into 4 octets of printable characters, whereas Yenc encoding encodes one byte to one octet for most values, and has 'escaped' octets for the rest. Some encoding schemes produce line delineated data, others just produce a block of data.The .NET framework provides code to encode and decode base64. The
Convert class provides ToBase64String to convert a
byte array to a base64 encoded string and FromBase64String
to convert a string back to a binary byte array. The framework also
provides the ToBase64Transform and ToBase64Transform
classes, but these merely uses Convert.
There are many problems with the Convert classes implementation.
For a start, Convert methods are static which means
that they cannot cache data over to a subsequent call. This means that only
whole packets of data (3 bytes for raw data, 4 characters for encoded data) can
be converted at a time. If you read binary data from a stream, you must make
sure that you read an exact multiple of three bytes and pass this to
ToBase64String, if you pass a count that is not a multiple of three then
the routine will pad the data with zeros before converting it. The base64
routine treats each three byte input block as a 24-bit number that it splits
into four 6-bit groups. Each group is then converted to a character. The groups
that contain bits entirely from the padding are converted to an =
character, this character can only appear at the end of the data (although the
end of the data may not have the = character).
If you are converting data from a stream you may be tempted to read in blocks
from the stream and pass them to ToBase64String and then add
together the output strings. However, if you are part of the way through a data
stream and are not stringent about passing an exact multiple of three bytes to
ToBase64String you may end up with a character string with invalid
characters in it. Furthermore, you will have the overhead of concatenating
strings. Correspondingly, if you read encoded data you must read an exact
multiple of four characters and pass this to FromBase64String, if
you pass another count then an exception will be thrown.
This brings me to another problem with the methods in Convert.
RFC2045, which defines MIME attachments, states that base64 data should be
represented as lines of no more than 76 characters. FromBase64String
will ignore whitespace and so can convert a MIME attachment that includes
newlines, but ToBase64String does not allow you to specify that you
want the data split into lines. If you want base64 data split over lines you
have to call ToBase64String and split the lines yourself. Again,
this implies involves creating additional strings.
The FromBase64Transform and ToBase64Transform
classes have instance methods, so it is possible for them to cache partial
blocks of data. However, both of these classes use the methods in Convert
to convert to and from base64 and hence get their problems.
FromBase64Transform strips out whitespace from the encoded stream which
is a waste of processing time because Convert.FromBase64CharArray
will ignore whitespace. ToBase64Transform does not have the
facility to split the resulting encoded data into lines.
Closer inspection of the Convert methods show other issues, and
these occur largely because of code reuse. The first issue is that the transform
methods of these classes perform lots of allocations, now, I know that in .NET
memory allocation is cheap, but it is more expensive than not doing it at all.
In some cases memory allocations can occur even when the allocated arrays will
not be used. A further issue happens with the fact that
FromBase64CharArray takes an array of Char and
ToBase64CharArray returns a Char array (doh!) but the
methods that use them (TransformBlock and TransformFinalBlock
on the FromBase64Transform and ToBase64Transform
classes) handle byte arrays. This means that there will always be a
call to Encoding.ASCII to convert between these two array types.
This involves more array allocation and iterations through the values in the
various input buffers: yet more CPU cycles are burned.
In addition, since so many temporary buffers are used this means that a lot
of copying must occur between all of these buffers. The library code does make a
concession to optimisation here because instead of using the generic
Array.Copy routine the library methods use the Buffer class.
Array.Copy and Buffer.BlockCopy are internalcall,
which means that they are implemented in unmanaged C++, and essentially involves
a call to memmove.
Version 1.1 of the framework only has code to convert base64 streams, and although this is a popular encoding stream, it is not the only one in use. I decided that I would create classes that would do base64 4ncoding, uuencoding and Yenc encoding.
I wanted to fix all of these problems. I argued that the data that would be
converted would be made available through a stream (for example a
NetworkStream or a FileStream) so it made sense for me to
write my classes as stream classes. The framework's CryptoStream
class is interesting because instances are based on another stream instance, in
effect chaining the streams. I liked this paradigm and decided that I would make
my own stream classes work this way. I wanted to make these streams unbuffered.
The reason is that it should be the developer's choice whether buffering is
used, and in any case, FileStream contains buffering, and the
winsock implementation that is wrapped by NetworkStream also has
buffering, so any buffering in my classes would be unnecessary.
The download for this article is a library that contains four classes shown here:
| Class | Description |
|---|---|
EncodedStream |
Abstract base class containing the common code for all the classes. |
Base64Stream |
Allows you to encode and decode base64 data in a stream. You tell the class to split the output data into lines. |
UUStream |
Standard Unix file encoding. |
YencStream |
The Yenc encoding. This supports data being split over multiple streams. |
The Base64Stream class has the following constructors:
public Base64Stream(Stream stream, bool read);
public Base64Stream(Stream stream, int lineLen);
The first constructor can create a read or write stream. The first time you
access the stream you determine what type of stream it is. If you call a read
method (Read or ReadByte) then the stream will be a
read-only stream and any attempt to write to it will throw an exception. If the
first call you make is to a write method (Write or WriteByte)
then the stream will be a write-only stream. The second constructor has a
Boolean which you can use to indicate whether the stream is read or write. The
final constructor is a write-only stream that will split the output over lines.
The UUStream class performs standard Unix encoding and decoding.
It has the following constructors:
public UUStream(Stream stream, string name, string mode)
The first constructor is read-only, the stream will extract the header
information and provide that through two read-only string properties called
FileName and Mode. The second constructor is
write-only and the user provides the name of the file and mode that will be
placed at the beginning of the output stream.
Finally, the YencStream class performs Yenc encoding. This
mechanism allows you to convert a binary stream to one or more output streams.
This is reflected in the constructors:
public YencStream(Stream stream, int byteCount, uint crc);
public YencStream(Stream stream, string name, int size, int part,
int pbegin, int totalSize, int totalParts, uint crc);
The first constructor is a write-only stream and outputs a single part. The
size of the input data and the name of the file are parameters because they have
to be written to the header part of the output stream. The second constructor is
a read-only stream, that can be part of a multi-part set of data. The name of
the file will be read from the stream and made available through the Name
read-only string property. The byteCount parameter is a count of
the data in this part, the Size read-only property will give the
total size of all the data in all the parts of the file. The final constructor
is a write-only stream that can take multiple parts. You need to create a
YencStream for each part that you want to create (and therefore it is
your responsibility to calculate the size of each part). The name of the file is
passed in the name parameter, the size of the entire file (ie the
size of all the parts) is passed in totalSize and the number of
parts is passed in totalParts. For each part you pass the size of
the part in size, the part number in part and the
start position in the file in pbegin. Yenc ensures data integrity
by providing cyclic redundancy checks. Each part has a CRC. If you have a
multi-part file then you must provide the CRC from the last part as an
initialization parameter. When you have written data to the stream, you can get
the CRC by calling the read-only CRC property.
The download for this page is provided as a binary file only. I do not have the time to document the source code, and so I will not provide. If you use this library you must acknowledge me in your product's documentation and in your product's About box.