Encoded Streams
Binary data contains 256 different values for each byte (octet). Most internet messages are based on 7 bit characters and so you cannot attach binary data to internet messages line email or news. There are several encoding schemes that will allow you to convert binary data to 7 bit characters. On this page I will give several stream classes that implement these encoding schemes. The advantage of using a stream class rather than a simple method to encode a block of data, is two fold. Firstly, the implementation is more efficient in terms of memory allocations, and secondly the stream classes presented here can be created from other stream objects which allows you to chain streams.Problems With The Framework Classes
On the cryptography pages of the Security Workshop I explained that there were issues with the .NET framework library classes provided to encode and decode base64. In this section I want to go into greater depth about those issues.
The main problem is that the
FromBase64Transform and ToBase64Transform classes in
System.Security.Cryptography use the FromBase64CharArray
and ToBase64CharArray of the Convert class. These encoding
methods will encode to a single block of characters without line breaks and
yet typically it will be used to read and write MIME data which has a maximum
line length of 76 characters. If you use either ToBase64Transform
or ToBase64CharArray to convert a block of data you will get
back a char array. You will then have to create another array and
copy the data into this new array, making sure that every 76th character from
the encoded array you add a newline in the final array. The code that consumes
this data will typically send the data over a socket, and sockets transmits
arrays of bytes, so you will have to convert the char
array to a byte array. As you can see, using the framework
classes there are several extra
allocations that you have to do just to use the data from the .NET framework
classes.
In addition, the Convert methods and the transform methods on
the
FromBase64Transform and ToBase64Transform classes also
perform many allocations of small buffers, and although allocations are cheap,
they are more expensive than not doing it at all. Further, FromBase64CharArray and
ToBase64CharArray act on char arrays, but the
FromBase64Transform and ToBase64Transform classes act on
byte arrays and so the transform methods have to perform a conversion between these
two types, again, involving more allocations. Finally, to provide a stream interface the
FromBase64Transform and ToBase64Transform classes must be
wrapped in a CryptoStream class. All of these issues mean that
the framework classes are not as efficient as they could be in terms of memory
and performance. Thus I wanted to write my own stream class that addresses all
of these issues.
It is also worth pointing out that base64 is just one encoding scheme. Unix has UU encoding and there is now a new encoding scheme called yEnc which is intended to be more efficient than the other two while also providing a cyclic redundancy check. The .NET framework does not provide classes for either of these encoding schemes.
EncodedStream
In the implementation I
have factored the code into a base class EncodedStream and then
provided several subclasses that extend the code to give base64, uuencode and
yEnc encoding. The EncodedStream class derives from Stream
and I allow this base class to do much of the work. This means that I don’t have to implement the
asynchronous methods, and I don’t need to implement the single byte methods. I
have chosen to disallow the stream to read and write at the same time, which
means that once a stream has been used to read data (decoded to raw data) the
same instance will not be allowed to write data (encode raw data) and vice versa. I have
also decided to disallow seeking.
The EncodedStream class provided an almost unbuffered
implementation. I say 'almost' because in the case of base64 and uuencoding
blocks of three bytes are converted to blocks of four characters, so a small
amount of buffering has to be used to accommodate this. Further, uuencoding is
split into lines where the first character indicates the number of bytes that
were converted. This means that the uuencode class must have a buffer for each
line that is read or written. However, I pre-allocate these buffers and reuse
them to minimize the number of memory allocations that are performed.
Base64Stream
The base 64 process follows these steps
- The data is split into groups of three 8 bit numbers which represents a 24 bit number. This is then split into four 6 bit numbers.
- Each 6 bit number is encoded as a printable character using Table 1
- Each line of encoded data has no more than 76 characters
- If there are fewer than 24 bits remaining at the end of the data, zero
bits are added to the right to make up the next 6 bit number with the
available data. The encoded data is then padded with
=characters to make up for the missing 6 bit numbers.
| Table 1 |
|---|
0 A 8 I 16 Q 24 Y 32 g 40 o
48 w 56 4
|
This means that when encoding data a minimum of 3
bytes is required and so the stream maintains a 3 byte buffer which is filled
by the Write and WriteByte methods. Once this buffer is filled the conversion
to base64 occurs. The exception is if the input stream does not contain a
multiple of 3 bytes and this means that the last block of data to be converted
must be padded with zeros. This last packet will be converted when the stream
is flushed which occurs when the stream is closed or when the Flush
method is called.
When decoding data the input buffer must be multiples of 4 characters so no
padding need occur. In this case the stream maintains a three byte buffer for
the raw data already converted from base64 and when all items in this buffer
have been read it will read 4 bytes from the input data (base64 always encodes
to 7-bit ASCII characters) and convert it to raw data.
The Base64Stream class defines the following
constructors:
public Base64Stream(Stream stream, bool read);
public Base64Stream(Stream stream, int lineLen);
The
object must be based on another stream which is used to read the encoded data
to be decoded or used to write the raw data once its been encoded. The first
constructor can be used for a read-only stream or a write-only stream, and the
first time a Read or Write is called an internal flags is set and so
henceforth the mode of the stream has been set. You can specifically indicate
whether the stream is read or write-only with the second constructor, a
read-only stream has the second parameter set to true. The final constructor
is only used for write-only streams and it specifies the line length of the
encoded data that is written. If you use either of the first two constructors
to create a write-only stream the data will not be split into lines, if you
use the third constructor and pass a value of 0 as the line length then the
default RFC2045 line length of 76 will be used.
UUStream
The Unix UU encoding differs from base64 in a couple of ways. The most
obvious way is that it has a prolog and an epilog. The prolog is begin
<mode> <filename> where <mode> is the Unix file
permissions for the attached file and <filename> is the name of
the file. The epilog is just the text end on a line by itself
followed by an empty line. In addition, each line starts with a character that
specifies the number of bytes that have been encoded on the line. (Note
that this is the number of raw bytes, not the number of the encoded
characters.) The rest of the encoding is similar to base64 except that
the encoding uses a base character of 0x20 rather than A
used in base64. Here are the steps:
- The data is split into groups of three 8 bit numbers which represents a 24 bit number. This is then split into four 6 bit numbers.
- Each 6 bit number is encoded as a printable character using Table 2.
- Epilog has
begin <mode> <filename>, where<mode>is the Unix file permission (usually644) - Each line is a maximum of 62 characters (including the
0D 0Anewline). Each line begins with a character that represents the number of encoded bytes on the line, encoded using Table 2. - The last line of encoded data is followed by a space on a single line.
- The epilog is the text
endon a single line followed by a newline.
| Table 2 |
|---|
0 space 8 ( 16 0 24 8 32 @ 40 H
48 P 56 X
|
There are two constructors:
public UUStream(Stream stream);
The first constructor takes a stream, a file name and a mode. This is used to create a stream to encode data and the file name and mode will be written to the wrapped stream in the epilog. The second constructor only takes a stream and so this represents an object that will decode data. The wrapped stream will be read to obtain the filename and mode from the epilog.
YencStream
yEnc encoding provides several facilities that are not present in the two other schemes. Bytes are not encoded in blocks, which results in smaller encodings. As the encoding is performed a CRC is calculated and this information is added to the final encoding. This means that when encoded data is decoded the code can determine if the code was corrupted. Finally, the protocol allows binary data to be encoded into several messages and information about how many messages and the index of a message in the collection is stored in each message.
The steps are shown here:
- The prolog starts with
=ybeginfollowed by other attributes in the<name>=<value>format for the line length (line), size of the original file (size), name of the file (name), part and total number of messages if the file is split over several messages (partandtotal). - Each character
Iis encoded to the characterOusingO = (I + 42) % 256except for the special values0x00,0x0A,0x0Dand0x3Dthat are encoded to two bytes,=followed byO = (I + 64) % 256. - Lines can be any length and are ended with
0x0D 0x0A. Escaped characters must not be split over two lines. - The epilog starts with
=yendand is followed by the size of the unencoded file (or the part of the file encoded if multiple parts are used). If the message is part of a collection of parts, the epilog has the part number. The epilog also has the CRC of the whole data (crc32) or this part of the data (pcrc32).
The class implements three constructors:
public YencStream(Stream stream, string name, int size);
public YencStream(Stream stream, string name, int size, int part, int pbegin, int totalSize, int totalParts, uint crc)
The first one has no name and hence it is used to read yEnc data, the other two provide a file name and so are used to write yEnc data.
Download
The code is provided as a library assembly and you are free to use it in your own code as long as you acknowledge me in your product's About box. (An email telling me that you are doing this would be nice!) The source file has details of the terms to use this code.
The source code for the stream, and test cases can be downloaded from here.
Errata
If you see an error on this page, please contact me and I will fix the problem.