Filter Streams

InputStream and OutputStream are fairly raw classes. They allow you to read and write bytes, either singly or in groups, but that’s all. Deciding what those bytes mean—whether they’re integers or IEEE 754 floating point numbers or Unicode text—is completely up to the programmer and the code. However, there are certain data formats that are extremely common and can benefit from a solid implementation in the class library. For example, many integers passed as parts of network protocols are 32-bit big-endian integers. Much text sent over the Web is either 7-bit ASCII or 8-bit Latin-1. Many files transferred by ftp are stored in the zip format. Java provides a number of filter classes you can attach to raw streams to translate the raw bytes to and from these and other formats.

The filters come in two versions: the filter streams and the readers and writers. The filter streams still work primarily with raw data as bytes, for instance, by compressing the data or interpreting it as binary numbers. The readers and writers handle the special case of text in a variety of encodings such as UTF-8 and ISO 8859-1. Filter streams are placed on top of raw streams such as a TelnetInputStream or a FileOutputStream or other filter streams. Readers and writers can be layered on top of raw streams, filter streams, or other readers and writers. However, filter streams cannot be placed on top of a reader or a writer, so we’ll start here with filter streams and address readers and writers in the next section.

Filters are organized in a chain as shown in Figure 4.2. Each link in the chain receives data from the previous filter or stream and passes the data along to the next link in the chain. In this example, a compressed, encrypted text file arrives from the local network interface, where native code presents it to the undocumented TelnetInputStream. A BufferedInputStream buffers the data to speed up the entire process. A CipherInputStream decrypts the data. A GZIPInputStream decompresses the deciphered data. An InputStreamReader converts the decompressed data to Unicode text. Finally, the text is read into the application and processed.

The flow of data through a chain of filters

Figure 4-2. The flow of data through a chain of filters

Every filter output stream has the same write( ), close( ), and flush( ) methods as java.io.OutputStream. Every filter input stream has the same read( ), close( ), and available( ) methods as java.io.InputStream. In some cases, such as BufferedInputStream and BufferedOutputStream, these may be the only methods they have. The filtering is purely internal and does not expose any new public interface. However, in most cases, the filter stream adds public methods with additional purposes. Sometimes these are intended to be used in addition to the usual read( ) and write( ) methods as with the unread( ) method of PushbackInputStream. At other times, they almost completely replace the original interface. For example, it’s relatively rare to use the write( ) method of PrintStream instead of one of its print( ) and println( ) methods.

Chaining Filters Together

Filters are connected to streams by their constructor. For example, the following code fragment buffers input from the file data.txt. First a FileInputStream object fin is created by passing the name of the file as an argument to the FileInputStream constructor. Then a BufferedInputStream object bin is created by passing fin as an argument to the BufferedInputStream constructor:

FileInputStream     fin = new FileInputStream("data.txt");
BufferedInputStream bin = new BufferedInputStream(fin);

From this point forward, it’s possible to use the read( ) methods of both fin and bin to read data from the file data.txt. However, intermixing calls to different streams connected to the same source may violate several implicit contracts of the filter streams. Consequently, most of the time you should use only the last filter in the chain to do the actual reading or writing. One way to write your code so that it’s at least harder to introduce this sort of bug is to deliberately lose the reference to the underlying input stream. For example:

InputStream in = new FileInputStream("data.txt");
in = new BufferedInputStream(in);

After these two lines execute, there’s no longer any way to access the underlying file input stream, so you can’t accidentally read from it and corrupt the buffer. This example works because it’s not necessary to distinguish between the methods of InputStream and those of BufferedInputStream. BufferedInputStream is simply used polymorphically as an instance of InputStream in the first place. In those cases where it is necessary to use the additional methods of the filter stream not declared in the superclass, you may be able to construct one stream directly inside another. For example:

DataOutputStream dout = new DataOutputStream(new BufferedOutputStream( 
 new FileOutputStream("data.txt")));

Although these statements can get a little long, it’s easy to split the statement across several lines like this:

DataOutputStream dout = new DataOutputStream(
                         new BufferedOutputStream(
                          new FileOutputStream("data.txt")
                         )
                        );

There are times when you may need to use the methods of multiple filters in a chain. For instance, if you’re reading a Unicode text file, you may want to read the byte order mark in the first three bytes to determine whether the file is encoded as big-endian UCS-2, little-endian UCS-2, or UTF-8 and then select the matching Reader filter for the encoding. Or if you’re connecting to a web server, you may want to read the MIME header the server sends to find the Content-encoding and then use that content encoding to pick the right Reader filter to read the body of the response. Or perhaps you want to send floating point numbers across a network connection using a DataOutputStream and then retrieve a MessageDigest from the DigestOutputStream that the DataOutputStream is chained to. In all these cases, you do need to save and use references to each of the underlying streams. However, under no circumstances should you ever read from or write to anything other than the last filter in the chain.

Buffered Streams

The BufferedOutputStream class stores written data in a buffer (a protected byte array field named buf) until the buffer is full or the stream is flushed. Then it writes the data onto the underlying output stream all at once. A single write of many bytes is almost always much faster than many small writes that add up to the same thing. This is especially true of network connections because each TCP segment or UDP packet carries a finite amount of overhead, generally about 40 bytes’ worth. This means that sending 1 kilobyte of data 1 byte at a time actually requires sending 40 kilobytes over the wire whereas sending it all at once only requires sending a little more than 1K of data. Most network cards and TCP implementations provide some level of buffering themselves, so the real numbers aren’t quite this dramatic. Nonetheless, buffering network output is generally a huge performance win.

The BufferedInputStream class also has a protected byte array named buf that servers as a buffer. When one of the stream’s read( ) methods is called, it first tries to get the requested data from the buffer. Only when the buffer runs out of data does the stream read from the underlying source. At this point, it reads as much data as it can from the source into the buffer whether it needs all the data immediately or not. Data that isn’t used immediately will be available for later invocations of read( ). When reading files from a local disk, it’s almost as fast to read several hundred bytes of data from the underlying stream as it is to read one byte of data. Therefore, buffering can substantially improve performance. The gain is less obvious on network connections where the bottleneck is often the speed at which the network can deliver data rather than either the speed at which the network interface delivers data to the program or the speed at which the program runs. Nonetheless, buffering input rarely hurts and will become more important over time as network speeds increase.

BufferedInputStream has two constructors, as does BufferedOutputStream :

public BufferedInputStream(InputStream in)
public BufferedInputStream(InputStream in, int bufferSize)
public BufferedOutputStream(OutputStream out)
public BufferedOutputStream(OutputStream out, int bufferSize)

The first argument is the underlying stream from which unbuffered data will be read or to which buffered data will be written. The second argument, if present, specifies the number of bytes in the buffer. Otherwise, the buffer size is set to 2,048 bytes for an input stream and 512 bytes for an output stream. The ideal size for a buffer depends on what sort of stream you’re buffering. For network connections, you want something a little larger than the typical packet size. However, this can be hard to predict and varies depending on local network connections and protocols. Faster, higher bandwidth networks tend to use larger packets, though eight kilobytes is an effective maximum packet size for UDP on most networks today, and TCP segments are often no larger than a kilobyte.

BufferedInputStream does not declare any new methods of its own. It only overrides methods from InputStream. It does support marking and resetting. For example:

public synchronized int read(  ) throws IOException
public synchronized int read(byte[] input, int offset, int length) 
 throws IOException
public synchronized long skip(long n) throws IOException
public synchronized int available(  ) throws IOException
public synchronized void mark(int readLimit)
public synchronized void reset(  ) throws IOException
public boolean markSupported(  )

Starting in Java 1.2, the two multibyte read( ) methods attempt to completely fill the specified array or subarray of data by reading from the underlying input stream as many times as necessary. They return only when the array or subarray has been completely filled, the end of stream is reached, or the underlying stream would block on further reads. Most input streams (including buffered input streams in Java 1.1.x and earlier) do not behave like this. They read from the underlying stream or data source only once before returning.

BufferedOutputStream also does not declare any new methods of its own. It overrides three methods from OutputStream:

public synchronized void write(int b) throws IOException
public synchronized void write(byte[] data, int offset, int length) 
 throws IOException
public synchronized void flush(  ) throws IOException

You call these methods exactly as you would for any output stream. The difference is that each write places data in the buffer rather than directly on the underlying output stream. Consequently, it is essential to flush the stream when you reach a point at which the data needs to be sent.

PrintStream

The PrintStream class is the first filter output stream most programmers encounter because System.out is a PrintStream. However, other output streams can also be chained to print streams, using these two constructors:

public PrintStream(OutputStream out)
public PrintStream(OutputStream out, boolean autoFlush)

By default, print streams should be explicitly flushed. However, if the autoFlush argument is true, then the stream will be flushed every time a byte array or linefeed is written or a println( ) method is invoked.

As well as the usual write( ), flush( ), and close( ) methods, PrintStream has 9 overloaded print( ) methods and 10 overloaded println( ) methods:

public void print(boolean b)
public void print(char c)
public void print(int i)
public void print(long l)
public void print(float f)
public void print(double d)
public void print(char[] text)
public void print(String s)
public void print(Object o)
public void println(  )
public void println(boolean b)
public void println(char c)
public void println(int i)
public void println(long l)
public void println(float f)
public void println(double d)
public void println(char[] text)
public void println(String s)
public void println(Object o)

Each print( ) method converts its argument to a string in a semipredictable fashion and writes the string onto the underlying output stream using the default encoding. The println( ) methods do the same thing, but they also append a platform-dependent line separator character to the end of the line they write. This is a linefeed (\n) on Unix, a carriage return (\r) on the Mac, and a carriage return/linefeed pair (\r\n) on Windows.

PrintStream is evil and network programmers shouldavoid it like the plague

The first problem is that the output from println( ) is platform-dependent. Depending on what system runs your code, your lines may sometimes be broken with a linefeed, a carriage return, or a carriage return/linefeed pair. This doesn’t cause problems when writing to the console, but it’s a disaster for writing network clients and servers that must follow a precise protocol. Most network protocols such as HTTP specify that lines should be terminated with a carriage return/linefeed pair. Using println( ) makes it easy to write a program that works on Windows but fails on Unix and the Mac. While many servers and clients are liberal in what they accept and can handle incorrect line terminators, there are occasional exceptions. In particular, in conjunction with the bug in readLine( ) discussed shortly, a client running on a Mac that uses println( ) may hang both the server and the client. To some extent, this could be fixed by using only print( ) and ignoring println( ). However, PrintStream has other problems.

The second problem with PrintStream is that it assumes the default encoding of the platform on which it’s running. However, this encoding may not be what the server or client expects. For example, a web browser receiving XML files will expect them to be encoded in UTF-8 or raw Unicode unless the server tells it otherwise. However, a web server that uses PrintStream may well send them encoded in CP1252 from a U.S.-localized Windows system or SJIS from a Japanese-localized system, whether the client expects or understands those encodings or not. PrintStream doesn’t provide any mechanism to change the default encoding. This problem can be patched over by using the related PrintWriter class instead. But the problems continue.

The third problem is that PrintStream eats all exceptions. This makes PrintStream suitable for simple textbook programs such as HelloWorld, since simple console output can be taught without burdening students with first learning about exception handling and all that implies. However, network connections are much less reliable than the console. Connections routinely fail because of network congestion, phone company misfeasance, remote systems crashing, and many more reasons. Network programs must be prepared to deal with unexpected interruptions in the flow of data. The way to do this is by handling exceptions. However, PrintStream catches any exceptions thrown by the underlying output stream. Notice that the declaration of the standard five OutputStream methods in PrintStream does not have the usual throws IOException declaration:

public abstract void write(int b)
public void write(byte[] data)
public void write(byte[] data, int offset, int length)
public void flush(  )
public void close(  )

Instead, PrintStream relies on an outdated and inadequate error flag. If the underlying stream throws an exception, this internal error flag is set. The programmer is relied upon to check the value of the flag using the checkError( ) method:

public boolean checkError(  )

If programmers are to do any error checking at all on a PrintStream, they must explicitly check every call. Furthermore, once an error has occurred, there is no way to unset the flag so further errors can be detected. Nor is any additional information available about what the error was. In short, the error notification provided by PrintStream is wholly inadequate for unreliable network connections. At the end of this chapter, we’ll introduce a class that fixes all these shortcomings.

PushbackInputStream

PushbackInputStream is a subclass of FilterInputStream that provides a pushback stack so that a program can “unread” bytes onto the input stream. The HTTP protocol handler in Java 1.2 uses PushbackInputStream. You might also use it when you need to check something a little way into the stream, then back up. For instance, if you were reading an XML document, you might want to read just far enough into the header to locate the encoding declaration that tells you what character set the document uses, then push all the read data back onto the input stream and start over with a reader configured for that character set.

The read( ) and available( ) methods of PushbackInputStream are invoked exactly as with normal input streams. However, they first attempt to read from the pushback buffer before reading from the underlying input stream. What this class adds is unread( ) methods that push bytes into the buffer:

public void unread(int b) throws IOException

This method pushes an unsigned byte given as an int between and 255 onto the stream. Integers outside this range are truncated to this range as by a cast to byte. Assuming nothing else is pushed back onto this stream, the next read from the stream will return that byte. As multiple bytes are pushed onto the stream by repeated invocations of unread( ), they are stored in a stack and returned in a last-in, first-out order. In essence, the buffer is a stack sitting on top of an input stream. Only when the stack is empty will the underlying stream be read.

There are two more unread( ) methods that push a specified array or subarray onto the stream:

public void unread(byte[] input) throws IOException
public void unread(byte[] input, int offset, int length) throws IOException

The arrays are stacked in last-in, first-out order. However, bytes pushed from the same array will be returned in the order they appeared in the array. That is, the zeroth component of the array will be read before the first component of the array.

By default, the buffer is only one byte long, and trying to unread more than one byte throws an IOException. However, the buffer size can be changed with the second constructor as follows:

public PushbackInputStream(InputStream in)
public PushbackInputStream(InputStream in, int size)

Although PushbackInputStream and BufferedInputStream both use buffers, BufferedInputStream uses them for data read from the underlying input stream, while PushbackInputStream uses them for arbitrary data, which may or may not, have been read from the stream originally. Furthermore, PushbackInputStream does not allow marking and resetting. The markSupported( ) method of PushbackInputStream returns false.

Data Streams

The DataInputStream and DataOutputStream classes provide methods for reading and writing Java’s primitive data types and strings in a binary format. The binary formats used are primarily intended for exchanging data between two different Java programs whether through a network connection, a data file, a pipe, or some other intermediary. What a data output stream writes, a data input stream can read. However, it happens that the formats used are the same ones used for most Internet protocols that exchange binary numbers. For instance, the time protocol uses 32-bit big-endian integers, just like Java’s int data type. The controlled-load network element service uses 32-bit IEEE 754 floating point numbers, just like Java’s float data type. (This is probably correlation rather than causation. Both Java and most network protocols were designed by Unix developers, and consequently both tend to use the formats common to most Unix systems.) However, this isn’t true for all network protocols, so you should check details for any protocol you use. For instance, the Network Time Protocol (NTP) represents times as 64-bit unsigned fixed point numbers with the integer part in the first 32 bits and the fraction part in the last 32 bits. This doesn’t match any primitive data type in any common programming language, though it is fairly straightforward to work with, at least as far as is necessary for NTP.

The DataOutputStream class offers these 11 methods for writing particular Java data types:

public final void writeBoolean(boolean b) throws IOException
public final void writeByte(int b) throws IOException
public final void writeShort(int s) throws IOException
public final void writeChar(int c) throws IOException
public final void writeInt(int i) throws IOException
public final void writeLong(long l) throws IOException
public final void writeFloat(float f) throws IOException
public final void writeDouble(double d) throws IOException
public final void writeChars(String s) throws IOException
public final void writeBytes(String s) throws IOException
public final void writeUTF(String s) throws IOException

All data is written in big-endian format. Integers are written in two’s complement in the minimum number of bytes possible. Thus a byte is written as one two’s-complement byte, a short as two two’s-complement bytes, an int as four two’s-complement bytes, and a long as eight two’s-complement bytes. Floats and doubles are written in IEEE 754 form in 4 and 8 bytes, respectively. Booleans are written as a single byte with the value for false and 1 for true. Chars are written as two unsigned bytes.

The last three methods are a little trickier. The writeChars( ) method simply iterates through the String argument, writing each character in turn as a 2-byte, big-endian Unicode character. The writeBytes( ) method iterates through the String argument but writes only the least significant byte of each character. Thus information will be lost for any string with characters from outside the Latin-1 character set. This method may be useful on some network protocols that specify the ASCII encoding, but it should be avoided most of the time.

Neither writeChars( ) nor writeBytes( ) encodes the length of the string in the output stream. Consequently, you can’t really distinguish between raw characters and characters that make up part of a string. The writeUTF( ) method does include the length of the string. It encodes the string itself in a variant of UTF-8 rather than raw Unicode. Since writeUTF( ) uses a variant of UTF-8 that’s subtly incompatible with most non-Java software, it should be used only for exchanging data with other Java programs that use a DataInputStream to read strings. For exchanging UTF-8 text with all other software, you should use an InputStreamReader with the appropriate encoding. (There wouldn’t be any confusion if Sun had just called this method and its partner writeString( ) and readString( ) rather than writeUTF( ) and readUTF( ).)

As well as these methods to write binary numbers, DataOutputStream also overrides three of the customary OutputStream methods:

public void write(int b)
public void write(byte[] data, int offset, int length)
public void flush(  )

These are invoked in the usual fashion with the usual semantics.

DataInputStream is the complementary class to DataOutputStream. Every format that DataOutputStream writes, DataInputStream can read. In addition, DataInputStream has the usual read( ), available( ), skip( ), and close( ) methods as well as methods for reading complete arrays of bytes and lines of text.

There are 9 methods to read binary data that match the 11 methods in DataOutputStream (there’s no exact complement for writeBytes( ) and writeChars( ); these are handled by reading the bytes and chars one at a time):

public final boolean readBoolean(  ) throws IOException
public final byte readByte(  ) throws IOException
public final char readChar(  ) throws IOException
public final short readShort(  ) throws IOException
public final int readInt(  ) throws IOException
public final long readLong(  ) throws IOException
public final float readFloat(  ) throws IOException
public final double readDouble(  ) throws IOException
public final String readUTF(  ) throws IOException

In addition, DataInputStream provides two methods to read unsigned bytes and unsigned shorts and return the equivalent int. Java doesn’t have either of these data types, but you may encounter them when reading binary data written by a C program:

public final int readUnsignedByte(  ) throws IOException
public final int readUnsignedShort(  ) throws IOException

DataInputStream has the usual two multibyte read( ) methods that read data into an array or subarray and return the number of bytes read. It also has two readFully( ) methods that repeatedly read data from the underlying input stream into an array until the requested number of bytes have been read. If enough data cannot be read, then an IOException is thrown. These methods are especially useful when you know in advance exactly how many bytes you have to read. This might be the case when you’ve read the Content-length field out of an HTTP MIME header and thus know how many bytes of data there are:

public final int read(byte[] input) throws IOException
public final int read(byte[] input, int offset, int length) 
 throws IOException
public final void readFully(byte[] input) throws IOException
public final void readFully(byte[] input, int offset, int length) 
 throws IOException

Finally, DataInputStream provides the popular readLine( ) method that reads a line of text as delimited by a line terminator and returns a string:

public final String readLine(  ) throws IOException

However, this method should not be used under any circumstances, both because it is deprecated and because it is buggy. It’s deprecated because it doesn’t properly convert non-ASCII characters to bytes in most circumstances. That task is now handled by the readLine( ) method of the BufferedReader class. However, both that method and this one share the same insidious bug: they do not always recognize a single carriage return as ending a line. Rather, readLine( ) recognizes only a linefeed or a carriage return/linefeed pair. When a carriage return is detected in the stream, readLine( ) waits to see whether the next character is a linefeed before continuing. If it is a linefeed, then both the carriage return and the linefeed are thrown away, and the line is returned as a String. If it isn’t a linefeed, then the carriage return is thrown away, the line is returned as a String, and the extra character that was read becomes part of the next line. However, if the carriage return is the last character in the stream (a very likely occurrence if the stream originates from a Macintosh or a file created on a Macintosh), then readLine( ) hangs, waiting for the last character that isn’t forthcoming.

This problem isn’t so obvious when reading files because there will almost certainly be a next character, -1 for end of stream if nothing else. However, on persistent network connections such as those used for FTP and late-model HTTP, a server or client may simply stop sending data after the last character and wait for a response without actually closing the connection. If you’re lucky, the connection may eventually time out on one end or the other and you’ll get an IOException, though this will probably take at least a couple of minutes. If you’re not lucky, the program will hang indefinitely.

Note that it is not enough for your program to merely be running on Windows or Unix to avoid this bug. It must also ensure that it does not send or receive text files created on a Macintosh and that it never talks to Macintosh clients or servers. These are very strong conditions in the heterogeneous world of the Internet. It is obviously much simpler to avoid readLine( ) completely.

Compressing Streams

The java.util.zip package contains filter streams that compress and decompress streams in zip, gzip, and deflate formats. Besides its better-known uses with respect to files, this allows your Java applications to easily exchange compressed data across the network. HTTP 1.1 explicitly includes support for compressed file transfer in which the server compresses and the browser decompresses files, in effect trading increasingly cheap CPU power for still-expensive network bandwidth. This is done completely transparently to the user. Of course, it’s not at all transparent to the programmer who has to write the compression and decompression code. However, the java.util.zip filter streams make it a lot more transparent than it otherwise would be.

There are six stream classes that perform compression and decompression. The input streams decompress data and the output streams compress it. These are:

public class 
               DeflaterOutputStream extends FilterOutputStream
public class InflaterInputStream extends FilterInputStream
public class 
               
               
               GZIPOutputStream extends FilterOutputStream
public class GZIPInputStream extends FilterInputStream
public class ZipOutputStream extends FilterOutputStream
public class ZipInputStream extends FilterInputStream

All of these use essentially the same compression algorithm. They differ only in various constants and meta-information included with the compressed data. In addition, a zip stream may contain more than one compressed file.

Compressing and decompressing data with these classes is almost trivially easy. You simply chain the filter to the underlying stream and read or write it like normal. For example, suppose you want to read the compressed file allnames.gz. You simply open a FileInputStream to the file and chain a GZIPInputStream to that like this:

FileInputStream fin = new FileInputStream("allnames.gz");
GZIPInputStream gzin = new GZIPInputStream(fin);

From that point forward, you can read uncompressed data from gzin using merely the usual read( ), skip( ), and available( ) methods. For instance, this code fragment reads and decompresses a file named allnames.gz in the current working directory:

FileInputStream fin   = new FileInputStream("allnames.gz");      
GZIPInputStream gzin  = new GZIPInputStream(fin);
FileOutputStream fout = new FileOutputStream("allnames");
int b = 0;
while ((b = gzin.read(  )) != -1) fout.write(b);
gzin.close(  );
out.flush(  );
out.close(  );

In fact, it isn’t even necessary to know that gzin is a GZIPInputStream for this to work. A simple InputStream type would work equally well. For example:

InputStream in = new GZIPInputStream(new FileInputStream("allnames.gz"));

DeflaterOutputStream and InflaterInputStream are equally straightforward. ZipInputStream and ZipOutputStream are a little more complicated because a zip file is actually an archive that may contain multiple entries, each of which must be read separately. Each file in a zip archive is represented as a ZipEntry object whose getName( ) method returns the original name of the file. For example, this code fragment decompresses the archive shareware.zip in the current working directory:

FileInputStream fin = new FileInputStream("shareware.zip");
ZipInputStream zin = new ZipInputStream(fin);
ZipEntry ze = null;
int b = 0;
while ((ze = zin.getNextEntry(  )) != null) {
  FileOutputStream fout = new FileOutputStream(ze.getName(  ));
  while ((b = zin.read(  )) != -1) fout.write(b);
  zin.closeEntry(  );
  fout.flush(  );
  fout.close(  );
}
zin.close(  );

Digest Streams

The java.util.security package contains two filter streams that can calculate a message digest for a stream. They are DigestInputStream and DigestOutputStream. A message digest, represented in Java by the java.util.security.MessageDigest class, is a strong hash code for the stream; that is, it is a large integer (typically 20 bytes long in binary format) that can easily be calculated from a stream of any length in such a fashion that no information about the stream is available from the message digest. Message digests can be used for digital signatures and for detecting data that has been corrupted in transit across the network.

In practice, the use of message digests in digital signatures is more important. Mere data corruption can be detected with much simpler, less computationally expensive algorithms. However, the digest filter streams are so easy to use that at times it may be worth paying the computational price for the corresponding increase in programmer productivity. To calculate a digest for an output stream, you first construct a MessageDigest object that uses a particular algorithm, such as the Secure Hash Algorithm (SHA). You pass both the MessageDigest object and the stream you want to digest to the DigestOutputStream constructor. This chains the digest stream to the underlying output stream. Then you write data onto the stream as normal, flush it, close it, and invoke the getMessageDigest( ) method to retrieve the MessageDigest object. Finally you invoke the digest( ) method on the MessageDigest object to finish calculating the actual digest. For example:

MessageDigest sha = MessageDigest.getInstance("SHA");
DigestOutputStream dout = new DigestOutputStream(out, sha);
byte[] buffer = new byte[128];
while (true) {
  int bytesRead = in.read(buffer);
  if (bytesRead < 0) break;
  dout.write(buffer, 0, bytesRead);
}
dout.flush(  );
dout.close(  );
byte[] result = dout.getMessageDigest().digest(  );

Calculating the digest of an input stream you read is equally simple. It still isn’t quite as transparent as some of the other filter streams because you do need to be at least marginally conversant with the methods of the MessageDigest class. Nonetheless, it’s still far easier than writing your own secure hash function and manually feeding it each byte you write.

Of course, you also need a way of associating a particular message digest with a particular stream. In some circumstances, the digest may be sent over the same channel used to send the digested data. The sender can calculate the digest as it sends data, while the receiver calculates the digest as it receives the data. When the sender is done, it sends some signal that the receiver recognizes as indicating end of stream and then sends the digest. The receiver receives the digest, checks that the digest received is the same as the one calculated locally, and closes the connection. If the digests don’t match, the receiver may instead ask the sender to send the message again. Alternatively, both the digest and the files it digests may be stored in the same zip archive. And there are many other possibilities. Situations like this generally call for the design of a relatively formal custom protocol. However, while the protocol may be complicated, the calculation of the digest is straightforward, thanks to the DigestInputStream and DigestOutputStream filter classes.

Encrypting Streams

Not all filter streams are part of the core Java API. For legal reasons, the filters for encrypting and decrypting data, CipherInputStream and CipherOutputStream, are part of a standard extension to Java called the Java Cryptography Extension, JCE for short. This is in the javax.crypto package. Sun provides an implementation of this API in the U.S. and Canada available from http://java.sun.com/products/jce/, and various third parties have written independent implementations that are available worldwide. Of particular note is the more or less Open Source Cryptix package, which can be retrieved from http://www.cryptix.org/.

The CipherInputStream and CipherOutputStream classes are both powered by a Cipher engine object that encapsulates the algorithm used to perform encryption and decryption. By changing the Cipher engine object, you change the algorithm that the streams use to encrypt and decrypt. Most ciphers also require a key that’s used to encrypt and decrypt the data. Symmetric or secret key ciphers use the same key for both encryption and decryption. Asymmetric or public key ciphers use the different keys for encryption and decryption. The encryption key can be distributed as long as the decryption key is kept secret. Keys are specific to the algorithm in use, and are represented in Java by instances of the java.security.Key interface. The Cipher object is set in the constructor. Like all filter stream constructors, these constructors also take another input stream as an argument:

public CipherInputStream(InputStream in, Cipher c)
public CipherOutputStream(InputStream in, Cipher c)

To get a properly initialized Cipher object, you use the static Cipher.getInstance( ) factory method. This Cipher object must be initialized for either encryption or decryption with init( ) before being passed into one of the previous constructors. For example, this code fragment prepares a CipherInputStream for decryption using the password “two and not a fnord” and the Data Encryption Standard (DES) algorithm:

byte[] desKeyData = "two and not a fnord".getBytes(  );
DESKeySpec desKeySpec = new DESKeySpec(desKeyData);
SecretKeyFactory keyFactory = SecretKeyFactory.getInstance("DES");
SecretKey desKey = keyFactory.generateSecret(desKeySpec);
Cipher des = Cipher.getInstance("DES");
des.init(Cipher.DECRYPT_MODE, desKey);
CipherInputStream cin = new CipherInputStream(fin, des);

This fragment uses classes from the java.security, java.security.spec, javax.crypto, and javax.crypto.spec packages. Different implementations of the JCE support different groups of encryption algorithms. Common algorithms include DES, RSA, and Blowfish. The construction of a key is generally algorithm specific. Consult the documentation for your JCE implementation for more details.

CipherInputStream overrides most of the normal InputStream methods like read( ) and available( ). CipherOutputStream overrides most of the usual OutputStream methods like write( ) and flush( ). These methods are all invoked much as they would be for any other stream. However, as the data is read or written, the stream’s Cipher object either decrypts or encrypts the data. (Assuming your program wants to work with unencrypted data as is most commonly the case, a cipher input stream will decrypt the data, and a cipher output stream will encrypt the data.) For example, this code fragment encrypts the file secrets.txt using the password “Mary had a little spider”:

String infile   = "secrets.txt";
String outfile  = "secrets.des";
String password = "Mary had a little spider";
    
 try {

   FileInputStream fin = new FileInputStream(infile);
   FileOutputStream fout = new FileOutputStream(outfile);

   // register the provider that implements the algorithm
   Provider sunJce = new com.sun.crypto.provider.SunJCE(  );
   Security.addProvider(sunJce);

   // create a key
   char[] pbeKeyData = password.toCharArray(  );
   PBEKeySpec pbeKeySpec = new PBEKeySpec(pbeKeyData);
   SecretKeyFactory keyFactory = 
   SecretKeyFactory.getInstance("PBEWithMD5AndDES");
   SecretKey pbeKey = keyFactory.generateSecret(pbeKeySpec);

   // use Data Encryption Standard
   Cipher pbe = Cipher.getInstance("PBEWithMD5AndDES");
   pbe.init(Cipher.ENCRYPT_MODE, pbeKey);
   CipherOutputStream cout = new CipherOutputStream(fout, pbe);

   byte[] input = new byte[64];
   while (true) {
     int bytesRead = fin.read(input);
     if (bytesRead == -1) break;
     cout.write(input, 0, bytesRead);
   }
      
   cout.flush(  );
   cout.close(  );
   fin.close(  );

  }
 catch (Exception e) {
   System.err.println(e);
   e.printStackTrace(  );
 }

I admit that this is more complicated than it needs to be. There’s a lot of setup work involved in creating the Cipher object that actually performs the encryption. Partly that’s a result of key generation involving quite a bit more than a simple password. However, a large part of it is also due to inane U.S. export laws that prevent Sun from fully integrating the JCE with the JDK and JRE. To a large extent, the complex architecture used here is driven by a need to separate the actual encrypting and decrypting code from the cipher stream classes.

Get Java Network Programming, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.