Encoded Streams
Home About Workshops Articles Writing Talks Books Contact

Encoded Streams

Binary data contains 256 different values for each byte (octet). Most internet messages are based on 7 bit characters and so you cannot attach binary data to internet messages line email or news. There are several encoding schemes that will allow you to convert binary data to 7 bit characters. On this page I will give several stream classes that implement these encoding schemes. The advantage of using a stream class rather than a simple method to encode a block of data, is two fold. Firstly, the implementation is more efficient in terms of memory allocations, and secondly the stream classes presented here can be created from other stream objects which allows you to chain streams.

Problems With The Framework Classes

On the cryptography pages of the Security Workshop I explained that there were issues with the .NET framework library classes provided to encode and decode base64. In this section I want to go into greater depth about those issues.

The main problem is that the FromBase64Transform and ToBase64Transform classes in System.Security.Cryptography use the FromBase64CharArray and ToBase64CharArray of the Convert class. These encoding methods will encode to a single block of characters without line breaks and yet typically it will be used to read and write MIME data which has a maximum line length of 76 characters. If you use either ToBase64Transform or ToBase64CharArray to convert a block of data you will get back a char array. You will then have to create another array and copy the data into this new array, making sure that every 76th character from the encoded array you add a newline in the final array. The code that consumes this data will typically send the data over a socket, and sockets transmits arrays of bytes, so you will have to convert the char array to a byte array. As you can see, using the framework classes there are several extra allocations that you have to do just to use the data from the .NET framework classes.

In addition, the Convert methods and the  transform methods on the FromBase64Transform and ToBase64Transform classes also perform many allocations of small buffers, and although allocations are cheap, they are more expensive than not doing it at all. Further, FromBase64CharArray and ToBase64CharArray act on char arrays, but the FromBase64Transform and ToBase64Transform classes act on byte arrays and so the transform methods have to perform a conversion between these two types, again, involving more allocations. Finally, to provide a stream interface the FromBase64Transform and ToBase64Transform classes must be wrapped in a CryptoStream class. All of these issues mean that the framework classes are not as efficient as they could be in terms of memory and performance. Thus I wanted to write my own stream class that addresses all of these issues.

It is also worth pointing out that base64 is just one encoding scheme. Unix has UU encoding and there is now a new encoding scheme called yEnc which is intended to be more efficient than the other two while also providing a cyclic redundancy check. The .NET framework does not provide classes for either of these encoding schemes.

EncodedStream

In the implementation I have factored the code into a base class EncodedStream and then provided several subclasses that extend the code to give base64, uuencode and yEnc encoding. The EncodedStream class derives from Stream and I allow this base class to do much of the work. This means that I don’t have to implement the asynchronous methods, and I don’t need to implement the single byte methods. I have chosen to disallow the stream to read and write at the same time, which means that once a stream has been used to read data (decoded to raw data) the same instance will not be allowed to write data (encode raw data) and vice versa. I have also decided to disallow seeking.

The EncodedStream class provided an almost unbuffered implementation. I say 'almost' because in the case of base64 and uuencoding blocks of three bytes are converted to blocks of four characters, so a small amount of buffering has to be used to accommodate this. Further, uuencoding is split into lines where the first character indicates the number of bytes that were converted. This means that the uuencode class must have a buffer for each line that is read or written. However, I pre-allocate these buffers and reuse them to minimize the number of memory allocations that are performed.

Base64Stream

The base 64 process follows these steps

  1. The data is split into groups of three 8 bit numbers which represents a 24 bit number. This is then split into four 6 bit numbers.
  2. Each 6 bit number is encoded as a printable character using Table 1
  3. Each line of encoded data has no more than 76 characters
  4. If there are fewer than 24 bits remaining at the end of the data, zero bits are added to the right to make up the next 6 bit number with the available data. The encoded data is then padded with = characters to make up for the missing 6 bit numbers.
Table 1
 0 A   8 I  16 Q  24 Y  32 g  40 o  48 w  56 4
 1 B   9 J  17 R  25 Z  33 h  41 p  49 x  57 5
 2 C  10 K  18 S  26 a  34 i  42 q  50 y  58 6
 3 D  11 L  19 T  27 b  35 j  43 r  51 z  59 7
 4 E  12 M  20 U  28 c  36 k  44 s  52 0  60 8
 5 F  13 N  21 V  29 d  37 l  45 t  53 1  61 9
 6 G  14 O  22 W  30 e  38 m  46 u  54 2  62 +
 7 H  15 P  23 X  31 f  39 n  47 v  55 3  63 /

This means that when encoding data a minimum of 3 bytes is required and so the stream maintains a 3 byte buffer which is filled by the Write and WriteByte methods. Once this buffer is filled the conversion to base64 occurs. The exception is if the input stream does not contain a multiple of 3 bytes and this means that the last block of data to be converted must be padded with zeros. This last packet will be converted when the stream is flushed which occurs when the stream is closed or when the Flush method is called. When decoding data the input buffer must be multiples of 4 characters so no padding need occur. In this case the stream maintains a three byte buffer for the raw data already converted from base64 and when all items in this buffer have been read it will read 4 bytes from the input data (base64 always encodes to 7-bit ASCII characters) and convert it to raw data.

The Base64Stream class defines the following constructors:

public Base64Stream(Stream stream);
public Base64Stream(Stream stream, bool read);
public Base64Stream(Stream stream, int lineLen);

The object must be based on another stream which is used to read the encoded data to be decoded or used to write the raw data once its been encoded. The first constructor can be used for a read-only stream or a write-only stream, and the first time a Read or Write is called an internal flags is set and so henceforth the mode of the stream has been set. You can specifically indicate whether the stream is read or write-only with the second constructor, a read-only stream has the second parameter set to true. The final constructor is only used for write-only streams and it specifies the line length of the encoded data that is written. If you use either of the first two constructors to create a write-only stream the data will not be split into lines, if you use the third constructor and pass a value of 0 as the line length then the default RFC2045 line length of 76 will be used.

UUStream

The Unix UU encoding differs from base64 in a couple of ways. The most obvious way is that it has a prolog and an epilog. The prolog is begin <mode> <filename> where <mode> is the Unix file permissions for the attached file and <filename> is the name of the file. The epilog is just the text end on a line by itself followed by an empty line. In addition, each line starts with a character that specifies the number of bytes that have been encoded on the line. (Note that this is the number of raw bytes, not the number of the encoded characters.) The rest of the encoding is similar to base64 except that the encoding uses a base character of 0x20 rather than A used in base64. Here are the steps:

  1. The data is split into groups of three 8 bit numbers which represents a 24 bit number. This is then split into four 6 bit numbers.
  2. Each 6 bit number is encoded as a printable character using Table 2.
  3. Epilog has begin <mode> <filename>, where <mode> is the Unix file permission (usually 644)
  4. Each line is a maximum of 62 characters (including the 0D 0A newline). Each line begins with a character that represents the number of encoded bytes on the line, encoded using Table 2.
  5. The last line of encoded data is followed by a space on a single line.
  6. The epilog is the text end on a single line followed by a newline.
Table 2
 0 space 8 (  16 0  24 8  32 @  40 H  48 P  56 X
 1 !     9 )  17 1  25 9  33 A  41 I  49 Q  57 Y
 2 "    10 +  18 2  26 :  34 B  42 J  50 R  58 Z
 3 #    11 *  19 3  27 ;  35 C  43 K  51 S  59 [
 4 $    12 ,  20 4  28 <  36 D  44 L  52 T  60 \
 5 %    13 -  21 5  29 =  37 E  45 M  53 U  61 ]
 6 &    14 .  22 6  30 >  38 F  46 N  54 V  62 ^
 7 '    15 /  23 7  31 ?  39 G  47 O  55 W  63 _

There are two constructors:

public UUStream(Stream stream, string name, string mode);
public UUStream(Stream stream);

The first constructor takes a stream, a file name and a mode. This is used to create a stream to encode data and the file name and mode will be written to the wrapped stream in the epilog. The second constructor only takes a stream and so this represents an object that will decode data. The wrapped stream will be read to obtain the filename and mode from the epilog.

YencStream

yEnc encoding provides several facilities that are not present in the two other schemes. Bytes are not encoded in  blocks, which results in smaller encodings. As the encoding is performed a CRC is calculated and this information is added to the final encoding. This means that when encoded data is decoded the code can determine if the code was corrupted. Finally, the protocol allows binary data to be encoded into several messages and information about how many messages and the index of a message in the collection is stored in each message.

The steps are shown here:

  1. The prolog starts with =ybegin followed by other attributes in the <name>=<value> format for the line length (line), size of the original file (size), name of the file (name), part and total number of messages if the file is split over several messages (part and total).
  2. Each character I is encoded to the character O using O = (I + 42) % 256 except for the special values 0x00, 0x0A, 0x0D and 0x3D that are encoded to two bytes, = followed  by O = (I + 64) % 256.
  3. Lines can be any length and are ended with 0x0D 0x0A. Escaped characters must not be split over two lines.
  4. The epilog starts with =yend and is followed by the size of the unencoded file (or the part of the file encoded if multiple parts are used). If the message is part of a collection of parts, the epilog has the part number. The epilog also has the CRC of the whole data (crc32) or this part of the data (pcrc32).

The class implements three constructors:

public YencStream(Stream stream, int byteCount, uint crc);
public YencStream(Stream stream, string name, int size);
public YencStream(Stream stream, string name, int size, int part, int pbegin, int totalSize, int totalParts, uint crc)

The first one has no name and hence it is used to read yEnc data, the other two provide a file name and so are used to write yEnc data.

Download

The code is provided as a library assembly and you are free to use it in your own code as long as you acknowledge me in your product's About box. (An email telling me that you are doing this would be nice!) The source file has details of the terms to use this code.

The source code for the stream, and test cases can be downloaded from here.

Errata

If you see an error on this page, please contact me and I will fix the problem.

This page is (c) 2007 Richard Grimes, all rights reserved