Final

EE552

Data Compression Co-processor

Final Report

March 29, 1999

Dr. Duncan Elliott

Tim Bensler

Eric Chan

Abstract

The following report outlines the development of a data compression coprocessor implemented in hardware using an Altera EPF10K20RC240-4 chip interfacing with a personal computer via the RS-232 Serial Port. In this report is a description of how we built the RS-232 interface, how we interfaced with SRAM external to the FPGA, and a description of the compression algorithm that we employed.

We were able to successfully send and receive data with a personal computer at high speeds, and we were able to successfully access SRAM, which was critical for the compression algorithm we used.

Overview

In this project, the objective was to build a data compression co-processor. Based on the Lempel-Ziv algorithm, the chip is namely a trivial implementation of the LZ78 algorithm. Data is transferred as a binary data stream to/from a PC computer through a RS-232 serial port. In a client/server model point of view, the PC will act as the server, and our ASIC will act as the client, waiting for the server (host) requests.

Encoded in the FPGA is the logic to create a serial port, with handshaking, that will communicate with the computer. The clear to send (CTS) line is used to indicate when the co-processor is ready to receive more data. Logic to communicate with external SRAM is also needed, as our compression algorithm creates several tables that take up more memory than is available to us on the chip. The compression algorithm interfaces with these two components in order to create a useful co-processor.

The coprocessor works by taking a stream of uncompressed data sent into the compression engine. The data is compressed and sent back to the computer where it is received by the PC and saved as a file. Due to time constraints, and space limitations of the chip, only the compression algorithm was encoded to the chip.

A Graphical User Interface (GUI) written in the Java language is used to provide the user with an interactive feedback, and to establish a communication interface with the coprocessor. The GUI will also incorporate the software realization of the chip (meaning both the compression and the decompression program). The software will be used to prove the correctness of our hardware implementation.

In order for the PC and the Co-Processor to synchronize so that data integrity is not lost during data transmissions, we are using the following settings to establish our serial communication link.

Baud Rate: 9600bps

Non-Parity, 8bits per packet, 1start and 1 stop bit (N, 8, 1)

IC Information

Implementation Features:

The compression co-processor was implemented with the following features:

Serial Interface with a personal computer using the RS-232 serial port at 9600 baud
Lempel-Ziv-Welch compression algorithm (LZ78)
External SRAM for use by the compression algorithm
LED's on the UP1 board to indicate reception and transmission of data to and from the FPGA
Debounced start push button

Chip Information:

Module	Logic Cells Required (Total = 1152)	Speed
Byte Encoder (includes RS-232 interface)	322 (28%)	30.2 MHz
SRAM Controller and Interface	162 (14%)	32.4 MHz
LZ78 Encoder	646 (56%)	16.33 MHz

Pin Details:

Signal Name	Type	Pin/Hole	Description
Expansion Slot C
shiftin	In	175/15	Serial Data Stream into FPGA
shiftout	Out	184/19	Serial Data Stream to computer
CTS	Out	182/17	handshaking line to out to computer to indicate FPGA ready for more data
D(0)	INOUT	196/30	Data line to/from SRAM I/O1
D(1)	INOUT	194/28	Data line to/from SRAM I/O1
D(2)	INOUT	192/26	Data line to/from SRAM I/O1
D(3)	INOUT	191/25	Data line to/from SRAM I/O1
D(4)	INOUT	193/27	Data line to/from SRAM I/O1
D(5)	INOUT	195/29	Data line to/from SRAM I/O1
D(6)	INOUT	198/31	Data line to/from SRAM I/O1
D(7)	INOUT	200/33	Data line to/from SRAM I/O1
A(0)	OUT	199/32	Address line for SRAM A1
A(1)	OUT	201/34	Address line for SRAM A2
A(2)	OUT	203/36	Address line for SRAM A3
A(3)	OUT	206/38	Address line for SRAM A4
A(4)	OUT	208/40	Address line for SRAM A5
A(5)	OUT	215/42	Address line for SRAM A6
A(6)	OUT	218/44	Address line for SRAM A7
A(7)	OUT	220/46	Address line for SRAM A8
A(8)	OUT	217/43	Address line for SRAM A9
A(9)	OUT	214/41	Address line for SRAM A10
A(10)	OUT	204/37	Address line for SRAM A11
A(11)	OUT	207/39	Address line for SRAM A12
A(12)	OUT	222/48	Address line for SRAM A13
A(13)	OUT	219/45	Address line for SRAM A14
A(14)	OUT	221/47	Address line for SRAM A15
nce	OUT	202/35	SRAM chip enable line (ce')
noe	OUT	227/52	SRAM output enable line (oe')
nwe	OUT	229/54	SRAM write enable line (we')
reset	IN	29	reset button
button	IN	28	start button
Expansion Slot A
led_in	OUT	45	LED to indicate incoming data is being received in from serial port
led_out	OUT	48	LED to indicate data is being sent out the serial port

Table of Contents

Abstract *

Overview *

IC Information *

Implementation Features: *

Chip Information: *

Pin Details: *

Type *

Pin/Hole *

Description *

Design Details and Documentation *

RS-232 Serial Port *

SRAM Interface *

Compression Engine *

Lempl-Ziv-Welch (LZW) Data Compression Algorithm *

External Circuitry *

Tests and Experiments *

Design Verification *

RS232 Interface *

Byte Encoder *

SRAM Interface *

Compression Engine and Overall Architecture *

References *

VHDL Code *

Design Details and Documentation

The design of this chip can be broken down into three main areas: the serial port interface, the SRAM interface and the compression engine. The block diagram illustrates the roles of the various entities and how they are tied together. A detailed explanation of each area is provided below. (block diagram and design hierarchy)

RS-232 Serial Port

The serial port component consists of a receiver, RS232_In, and a transmitter, RS232_Out, component. Both components are written to send and receive a packet consisting of one stop bit (logic 0), then eight data bits followed by one stop bit (logic 1).

The receiver, operating at a 16X baud rate, waits for the value on the input line to go low. Since the input line is held high in the idle state, a falling edge will indicate that there is incoming data. In order for us to read in data at the middle of each clock period instead of the triggering at every clock edge, the first data bit is sampled after 24 clock cycles. We will then sample the middle of every data bit every 16 clock cycles thereafter until the complete data word has been read in. The stop bit (logic 1) indicates the end of transmission, and the cycle repeats.

The transmitter is a much simpler model than the receiver. When a data word is to be transmitted over the serial port, we place the data into a shift register. Before putting the data word into it however, we need to wait until the ready signal is high. Once it is ready, a start bit and one stop bit are appended to the data word. We then assert the load signal. This will trigger the register to start transmitting data. The ready signal will go low to indicate that the register is busy. Once the shift register has finished transmitting the data, ready signal is asserted to indicate that it is idle. Note that the transmitter operates at the same baud rate as the computer, and the data can be successfully sent.

We have successfully implemented the hardware handshaking protocol using the RTS/CTS lines on the serial port. This will eliminated any chances of a buffer underrun due to our processor's inability to process the incoming data fast enough.

The data traffic transmitted by the compression engine encodes data as 12-bit words. In order to transmit this data, a component is needed split the twelve bit code words into eight bit data words. The component dataStreamOut takes the first code word and transmits the most significant eight bits. It then takes the last four bits of the code word and the first four bits of the next data word to create the next 8-bit data word to be transmitted. This way, the component will correctly convert the 12-bit words into 8-bit words that can be transmitted over the serial interface. The entity byteEncoder ties the serial port and the dataStreamOut components together.

SRAM Interface

The SRAM interface is written to allow us to access a 32K X 8 Static RAM module. The SRAM has a 100ns cycle time for both read/write. Due to the limitations of our system clock (25.175MHz), we were only able to access the data from the SRAM with a cycle time of 140ns. This is not a big issue to us, because our transmission is at 9600bps with a cycle time of over 1000 ms when we factor into account that we read 10 bits at a time.

For the user to read data from the SRAM, the CE' signal is asserted and WE' signal should be "high". At the same time, the user should input a valid data address. The data on the data bus should not be read until the done signal is asserted.

For the user to write data from the SRAM, the CE' signal is asserted and WE' signal is also asserted. At the same time, the user should input a valid data address and a valid data word. The data won't be written to the SRAM until the done signal is asserted. To facilitate using the SRAM with our compression engine, an entity sram_access was written. Since the SRAM is 8 bits wide, and we require storage 12 bits wide, this module was created to take care of the double read/write that is required each time ram is accessed. This entity also takes care of initializing the entire SRAM when required. The SRAM is initialized with the constant H"0100" to indicate that the space is free to be written to.

Compression Engine

Lempl-Ziv-Welch (LZW) Data Compression Algorithm

The LZW algorithm is a dynamically adaptive algorithm was compresses binary data into a stream of codes. It does not analyze the incoming data. It merely adds every new string of characters (substrings) it sees to a Look-Up-Table (LUT) of strings.

The code that the LZW algorithm outputs codes that can be of any arbitrary length, but it must be expressed in more than a single byte (one character). For our project, 12bit codes were implemented. The first 256 codes are reserved for the standard character set with the remaining codes assigned to substrings as the algorithm is executed.

In a nutshell, the basic operation of the algorithm revolves around looking through the LUT for substrings which is composed of a prefix code and a append character. Every substring is composed of a prefix code and a append char. A prefix code can either reference another prefix code or a append char.

Input String = /WED/WE/WEE/WEB/WET
Character Input	Code Output	New code value	New String
/W	/	256	/W
E	W	257	WE
D	E	258	ED
/	D	259	D/
WE	256	260	/WE
/	E	261	E/
WEE	260	262	/WEE
/W	261	263	E/W
EB	257	264	WEB
/	B	265	B/

Figure 1: Example of how LZW is performed

The compression engine is a large state machine that implements the Lempel-Ziv-Welch compression algorithm (LZ78), which is a dictionary based coding algorithm. The algorithm works by looking for patterns in the data stream and matching them to patterns in the dictionary (string table). When a new pattern, or substring, is found it is added to the dictionary and a code is created based upon the data in the substring. If the substring is encountered again it can simply be encoded with the code that has already been identified. The compression engine entity, comp_eng, is tied together with the SRAM and the serial port in the entity coprocessor.

External Circuitry

The schematic for the external circuitry shows how we connected our circuit to the SRAM, as well as the circuitry needed for our serial port.

Connecting to the SRAM was simply a matter of connecting all the pins on the SRAM chip to one of the expansion slots on the UP1 board.

To connect to the computer we used voltage level converter chips to take the voltage from the computer's RS-232 and make it compatible with the voltages on the UP1 board. Using the MC1488 chip, a logic 1 (-15V to -3V) is taken from the computer and converted to a logic one (5V) for use by the FPGA. Likewise it takes any voltage from +3 volts to +15 volts and converts it to 0 volts for the UP1 board. The MC1489 chip does the compliment of the MC1488 chip, taking logic 0 from the board and converting it to +12 volts and in a similar manner takes +5 volts and converts it to -12volts. (schematic of circuit)

Tests and Experiments

Throughout the design process of the data co-processor, various tests and experiments were performed. Some of the specific problems we had were solved by the testing of different solutions.

On our RS-232 serial port interface, we experimented with different baud rates, and with the best way to implement the handshaking. We tested our serial port up to a baud rate of 56000, and found that it could successfully send and receive at this speed (although we will be using it at 9600 baud). Initially we had no handshaking and found that we were dropping approximately 10% of the data in a test which simply took the data (as an 8-bit word) and sent it immediately back out. We experimented with using a 4-byte FIFO, but found this prohibitively large, and it still gave us problems. With handshaking we were able to eliminate the problem of buffer underrun. We are only using the RTS/CTS lines, which means the DTR/DSR lines should be tied together to avoid errors with the software which will also check these lines in handshaking mode.

In the compression algorithm, we repeatedly perform 12-bit subtraction between two non-static registers, and these severely cut our maximum clock cycle. We tried converting the std_logic_vectors to integers first and then converting them back, but this was unsuccessful in improving the clock speed. We tried changing the number that was to be subtracted to a negative number in one state, and then adding it in the next state (thus saving some time in not having to convert and add all in one state). Although this improved the clock speed by several megahertz, it also increased our logic by 3%, and it still did not get our speed above 25.175MHz. Optimizing for speed in compilation only saved 1ns for a 1% cost in logic. We found that by splitting our state machine into several processes we were able to trim 8% from the logic size.

To test our SRAM controller, a test architecture was written that writes to the entire SRAM, and then, using the push button to advance to different addresses, would display the contents of each address to the seven-segment display. This test was carried out to ensure that all our timing and pin connections where correct for later use.

Design Verification

The ultimate verification of our design is a synthesised chip that does what we have set out to make it do. On the way to that end each individual module was tested and verified.

RS232 Interface

To test the serial port, each individual component was simulated, and verified to be working correctly.

The RS_232_In, and RS_232_Out components were verified to send/receive data as expected when we applied a sample data word. The two components were then combined into one component called serialPort.

A component loopback was written that would take data that was being read in and send the same data back to the computer. This also simulated correctly in MaxPlus2. We then took this to the hardware lab and synthesized it to ensure proper operation.

A test batch file was written that would send a file out the computer's serial port, and also receive the data that was looped back. It would write the looped back data to a new file so that we could verify that the serial port was not dropping any bits. The correctness of the design can be proven by comparing the transmitted data with the received data.

By using this method of testing our design, we can perform long transmissions at full speed to ensure that the design is reliable. Simulations cannot detect clocking skews or rare bugs in our design. In testing the hardware, we discovered that we can not get our clock divider to produce exactly 9600 baud from our 25.175MHz clock (clock frequency on the UP1 board). Ultimately, after long-term tests, we determined that hardware handshaking signals are required to ensure a reliable connection.

A testbench (loopbacktest) for the entire serial port was written to continuously send data to the serial port. In the testbench, two different data words were alternately sent to ensure the data was being read in correctly. The loopback architecture that we had written for the hardware test took the data that we were sending to the serial port and immediately sent it back to the testbench. To ensure that the serial port is sending and receiving correctly, assert statements were used to check the data coming back in.

Byte Encoder

Our compression engine will generate 12-bit word on its output, and so we needed to write a module that would take the 12-bit code and convert it to 8-bit words to be sent out the serial port. We simulated this module by loading data into the encoder, and checking to ensure that the data is correctly parsed into 8-bit words to be sent out the serial port. This module is coupled with the serial port architecture in the larger architecture byteEncoder. This architecture is responsible for taking data in from the serial port and passing it to the compression engine. It is also responsible for taking 12 bit code words from the compression engine and correctly sending the data back to the computer. To test the parsing of data done by this module, dataStreamOut was simulated by inputting 12 bit code words and loading them in when it indicates that it is ready to receive the new bytes. This module performed as expected. With these two modules working together correctly the overall byteEncoder is simulated and shown to be working correctly.

SRAM Interface

Although we cannot simulate the reading of data from the SRAM device through simulation due to lack of a VHDL module, we have developed a way to test our design. What we did was we wrote a test module which flooded the values H"00" to H"FF" repeatedly into the SRAM. Once we have written the data into the device, we read the data back from the SRAM back to a LED on the UP1 Flex board. By performing these series of operations, we were able to determine the correctness of our VHDL module.

The sram_access entity provides a front end to the SRAM controller so that the compression engine need only indicate that it wants to do a read, and the sram_access will take care of the details of writing/reading the 12-bit word to/from the 8-bit wide memory. Simulation shows that this entity performs correctly.

One thing we did notice when doing these simulations on MaxPlus2 was that if we simulated in our waveform the data line in to the SRAM as well as the same data line back from the SRAM (an INOUT signal) we received logic contention warnings (since it couldn't figure out who was driving that line and when). However, if we removed from the waveform the data line coming back from the SRAM then no warnings were given.

Compression Engine and Overall Architecture

The first concern that we have to contend during the implementation phase of this module was the testability of the design. Because we cannot read data back from the SRAM in our simulations (lack of VHDL model), we were not able to prove the correctness of the written code. However, due to the fact that this module is essentially a hardware counterpart of our working software code, we were confident that the problems are mainly in either the SRAM module or the byteEncoder module. However, as of this writing a test bench was written to show correct operation of the code during startup. This behaviour is also expressed when the code was programmed into the ALTERA chip. However, to no avail we were unable to determine why our project would not send back any data even though all operations seems to be working properly.

The simulation does show that the chip starts out as expected. After reset, the entire SRAM is initialised, and it waits for data to come in. As soon as the first byte is received the compression engine starts to process the data. Once the processing of data begins and it begins to write data to memory, and begin to rely on this data, our simulation becomes invalid.

Since it is hard to simulate how the SRAM will respond when it is called upon by the compression engine, we found that after carefully checking the code, the chip would best be verified by synthesising it, and verifying it in hardware.

To verify the compression engine was to verify the entire project, since verifying that the compression engine was working properly also meant that all the other components had to be working correctly. After testing the other modules and having some confidence that they would work as expected, we connected the components together in the coprocessor architecture, and synthesized the chip. Our design was a tight fit, using 97% of the logic cells, which left us very little logic to write in debugging signals/states. At the time of writing we have been unable to perform a complete verification of the chip.

References

Adaptive Data Compression, Ross N. Williams, Kluwer Academic Publishers, 1991

Data Compression, Gilbert Held, Wiley Heyden Publication, 1983

LZW Data Compression, Mark Nelson, Dr. Dobb's Journal, Oct. 1989

http://www.dogma.net/markn/articles/lzw/lzw.htm

Compression Algorithms Compared

http://www.iucr.ac.uk/iucr-top/cif/cbf/PAPER/performances.html

The Data Compression, Mark Nelson, M&T Books, 2^nd Edition, 1995

VHDL Code

VHDL Code

EE552 Final Presentation