EE 552

EE 552 Final Report

The uCMK Microprocessor

Michael Rivest
Kelly Lawson
Charlene Eriksen

March 30, 1999

Sections

Data Sheet

Abstract

The uCMK Microprocessor

Overall Design Diagram

Design Documentation and Interface Descriptions

The uCMK Microprocessor Data Sheet

Overview and Features

The uCMK microprocessor consists of the following:

16-bit load/store RISC architecture
11 16-bit instructions
8 general-purpose registers
5 pipeline stages so that an instruction is executed each clock cycle and five instructions are being executed at any one time
512 x 16-bit instruction ROM initialized by .mif file
256 x 16-bit data RAM initialized by .mif file
output connections for four 7-segment displays to represent data as 4 hexadecimal values
designed to run at a 12.58 MHz clock signal
uses 81% of the Altera FLEX10K20 logic cells and 100% of the embedded array blocks

Instruction Set

The uCMK microprocessor has been implemented with a set of 11 16-bit instructions. These instructions perform arithmetic operations, such as addition, subtraction, and comparisons, memory instructions, such as loading from memory and storing to memory, and branch and data movement instructions. The complete instruction set is described in the table below.

*Instruction*	Syntax	*Description*	*Binary Encoding³*	*Cycles required to execute*
load from memory	load rx, ry	Loads data from memory address in lower order 8 bits of rx into register ry	0001xxxyyy000000	1 normally, 2 if the next instruction uses the result in ry⁵
store to memory or to displays	store rx, ry	Stores data in register ry to address in lower order 9 bits of rx. See section on Memory Addressing and Output for more information.	0010xxxyyy000000	1
add	add rx, ry	Adds the value in rx and ry and places the result in rx	0011xxxyyy000000	1
subtract	sub rx, ry	Subtracts the value in ry from rx and places the result in rx	0100xxxyyy000000	1
compare greater than or equal to	cmpge rx, ry	If the value in rx is greater than or equal to the value in ry, the C bit is set to 1. Otherwise, the C bit is set to 0	0101xxxyyy000000	1
compare not equal to	cnet rx, ry	If the value in rx is not equal to the value in ry, the C bit is set to 1. Otherwise, the C bit is set to 0.	0110xxxyyy000000	1
move immediate	movi rx, iiiiiiiiiiiiiiii¹	Moves the immediate value specified into register rx	0111xxx000000000 iiiiiiiiiiiiiiii⁴	2 normally, 3 if followed by a branch or compare⁶
branch	branch aaaaaaaaa²	Branch to the instruction located at address aaaaaaaaa	1000000aaaaaaaaa	2 if branch not taken, 3 if branch taken
move between registers	move rx, ry	Move the contents of ry to register rx	1001xxxyyy000000	1
end program	fini	End of program - suspends processor operation	1010000000000000	1
no operation	noop	No operation occurs	0000000000000000	1

Notes:

For the movi instruction, the 16-bit immediate value is represented as "iiiiiiiiiiiiiiii".
The address specified with the branch instruction is a 9-bit absolute value represented as "aaaaaaaaa". When an assembler is used, this would be a label so that instruction addressing is transparent to the user.
Explanation of binary encoding:

xxx = three bit value for rx -> r0 = 000, r1 = 001, ..., r7 = 111

aaaaaaaaa = 9-bit absolute address to branch to

yyy = three bit value for ry -> r0 = 000, r1 = 001, ..., r7 = 111

iiiiiiiiiiiiiiii = 16-bit immediate value

The immediate value is actually the instruction following the movi instruction to allow for a full 16-bit immediate value. For example, movi r1, 0xF1FA would be encoded as 0111001000000000 for the first instruction and 1111000111111010 for the next instruction (the actual immediate value).
The extra clock cycle occurs because of automatic noop insertion by the assembler to allow for extra time for the destination register ry to be written to in this situation.
The extra clock cycle occurs because of automatic noop insertion by the assembler to allow ensure that immediate values are not interpreted as instructions.

Memory Addressing and Output

Two separate memory blocks are used, one for storing instructions and one for general-purpose data use. The instruction memory is accessed using 9 bits and all addressing is done using absolute values. Addressing of data memory is used to control storing of data to output or to RAM. If the first bit of the 9-bit address is a '0' during a store instruction, the data is written to RAM using the least significant 8 bits for the address. If the first bit of the address is a '1' during a store, the data is written to the output pins in seven-segment display format as 4 hexadecimal values. For a load instruction, the lower order 8 bits of the specified address register are automatically used for accessing the RAM.

IO Pins

The table below describes each of the input and output pins used on the UP-1 board. There are two inputs from push buttons to start and reset the processor. The clock signal from the oscillator on the UP-1 board is doubled and sent to an output which is fed back to the second clock pin on the FLEX10K20. Finally, there are 28 outputs for connection to cathode 7-segment displays for displaying data as 4 hexadecimal numbers. Pin numbers with (A21) for example, indicate the FLEX_EXPAN hole number, where the letter indicates which FLEX_EXPAN is used.

*Pin Number*	*Input/* *Output*	*Signal Name*	*Description*
29	input	start_button	Connected to active-low pushbutton used to start processing - no connection to be done
28	input	reset_button	Connected to active-low pushbutton used to reset the uCMK - no connection to be done
91	input	double_clock	Clock signal provided by on-board oscillator (25.175 MHz) - no connection to be done
211 (C14)	input	clock	Clock signal used by uCMK (designated clock pin) - connect to pin 181
181 (C16)	output	clock_out	Halved clock signal which must be fed back in to designated clock pin - connect to pin 211
56 (A24)	output	display1_0	For segment a of first hex digit (4 highest order bits of data to be displayed)
61 (A25)	output	display1_1	For segment b
63 (A27)	output	display1_2	For segment c
65 (A29)	output	display1_3	For segment d
64 (A28)	output	display1_4	For segment e
55 (A23)	output	display1_5	For segment f
62 (A26)	output	display1_6	For segment g
67 (A31)	output	display2_0	For segment a of second hex digit
68 (A32)	output	display2_1	For segment b
71 (A34)	output	display2_2	For segment c
73 (A36)	output	display2_3	For segment d
72 (A35)	output	display2_4	For segment e
66 (A30)	output	display2_5	For segment f
67 (A31)	output	display2_6	For segment g
75 (A38)	output	display3_0	For segment a of third hex digit
76 (A39)	output	display3_1	For segment b
79 (A41)	output	display3_2	For segment c
81 (A43)	output	display3_3	For segment d
80 (A42)	output	display3_4	For segment e
74 (A37)	output	display3_5	For segment f
78 (A40)	output	display3_6	For segment g
83 (A45)	output	display4_0	For segment a of fourth hex digit
84 (A46)	output	display4_1	For segment b
87 (A48)	output	display4_2	For segment c
94 (A50)	output	display4_3	For segment d
88 (A49)	output	display4_4	For segment e
82 (A44)	output	display4_5	For segment f
86 (A47)	output	display4_6	For segment g

Abstract

This document discusses the design of the uCMK, a load/store RISC microprocessor with five pipeline stages as in the DLX architecture, and a set of 11 instructions. To create the uCMK, we needed to determine how to transform our theoretical knowledge of the DLX architecture into a digital circuit realizable using VHDL and FPGAs. Many important design features were implemented, including the use of forwarding and stalling to control data and instruction flow through the pipeline stages. Our design also included the use of separate ROM and RAM memory blocks and the implementation of a general-purpose register file.

The details of all these design features are described in this report, along with descriptions of and solutions to specific design problems encountered. Explanations of experiments done during the design stage and the conclusions from these experiments are included. A general discussion of our testing method is found later in the document. Finally, complete design documentation and component interfacing is given.

As is shown in the document, our implementation of the uCMK worked very well. Our testing proved that our design met the requirements we initially designed for and our design choices worked well. The result is a simple microprocessor that can be used for many applications and can be expanded easily for more complex applications.

The uCMK Microprocessor

The uCMK microprocessor is a 16-bit load/store RISC machine with 16-bit instructions and 8 general-purpose registers. There are two separate memory blocks, one that stores instructions and one for general-purpose data use. It decodes and executes the following instructions, which are given by the user in binary format.

move to move values from register to register
movi to move immediate values to registers
load to load from memory to a register
store to store from a register to memory
add to add register values
sub to subtract register values
branch to branch to a different program location on C bit being set
cnet to compare register values to see if they are not equal to each other
cmpge to compare register values to see if they are greater than or equal to each other
noop for no operation
fini to indicate the end of the program

Each of these instructions are explained in the data sheet and the syntax is given as well.

The uCMK is programmed using binary files to initialize the instruction memory with the list of instructions to execute and the data memory with any data to initially be stored. The memory initialization files can be created with an assembler based on a simple text file containing instructions in ASCII format. When the uCMK is downloaded to the UP-1 board, the ROM and RAM are initialized with the values contained in these files.

After the uCMK is configured with the instructions and initial data the user can reset the processor to clear the register file and reset processing. This can be done any time during processing to reset operation as well. To start the processor, the start pushbutton is pressed and the processor will then start executing the instructions in memory.

Instruction addresses are transparent to the user, as labels will be used when programming, but the addresses for instructions are 9 bits. The data addresses are 9 bits but only 8 of those bits are used when writing to the data memory. If the most significant bit of the 9-bit address is 1 when a store instruction is executed, data is written to the output pins in cathode 7-segment display format instead of to memory. The output pins for connection to four 7-segment displays are explained in the data sheet. These displays are the main output interface of the processor and can be used to display data in any register. The final results of any processing can be written to the displays as well as any intermediate values.

Design Overview

The uCMK is pipelined to allow five instructions to execute at any given time, following the DLX architecture. The five stages of the pipeline are Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory (MEM), and Write Back (WB) and are shown on the Overall Diagram. Supporting hardware for program counter address calculation, and forwarding of data is also included. The pipeline registers between each stage hold instruction values and any operands required for the instruction to execute. Each instruction passes through each of the pipeline stages, though certain instructions may only affect certain stages. For example, the branch instruction doesn't perform any useful function after the Execute (EX) stage, since it doesn't modify memory or the register file. The function of each of the five stages of the processor is described in the Design Details section.

The two main features of our design are the use of forwarding and stalling. The pipeline registers are controlled by enables to allow the processor to stall. Processor stalling prevents certain pipeline registers from updating for one clock cycle, effectively freezing a portion of the microprocessor. In the uCMK, this function is used to prevent the processor from attempting to execute an incorrect instruction. Operand forwarding occurs when a current instruction requires the result of a previous instruction that has not yet been written to the register file. These features are discussed in the Design Details section as well.

In order to best implement the uCMK, we decided to break it up into a control path and data path. The data path consists of the pipeline registers and the combination logic for evaluating instructions. The control path contains the logic to control progression of instruction through registers (stalling) and the flow of data in the data path including forwarding. Also, during the implementation, we decided to create a design hierarchy with each stage in the datapath containing the pipeline register that holds the current instruction being evaluated for that stage and the combinational logic for evaluating the instruction. The control path contains a separate component for each of these stages and also includes a component that covers the overall control of processing through enabling of all the registers.

Notes on the Assembler

We did not create an assembler for this project but instead converted instructions into binary in the memory initialization files ourselves. The binary encoding for each instruction is given in the data sheet. We designed our microprocessor with the assumption that an assembler would be available and this solved some of our design problems easier than a hardware implementation. However, we did not have enough time to write an assembler because we felt that it was a lower priority than the report or having the VHDL code working properly. If we were to write an assembler, the design problems that it solved are discussed later and the solutions would make up part of the requirements for our assembler design.

Design Details

The details of our design are given in this section. It includes discussions of general design decisions made and explanations of each of the pipeline stages. Complete explanations of forwarding and stalling are also given in this section. The problems we faced and the solutions we found are discussed for each point.

Instruction Set Selection

One of the first design decisions to be made was the selection of an instruction set. We needed to select a set of instructions that was useful but a reasonable size. The instructions we decided on are mentioned above and the details of the implementation of each instruction are given in subsequent sections, while the syntax and binary encoding of each instruction is given in the datasheet.

We chose these basic instructions so as to limit the user as little as possible but to keep a reasonable number of instructions to implement. From these basic instructions, the user can do many operations. The add and subtract instructions can be used in loops to multiply or divide. Writing to memory or to external hardware is implemented with a single instruction (store) which keeps the number of instructions down. The compare instructions we chose cover all the possibilities. For a set of two numbers, the first can either be greater than, equal to, or less than the second. The cmpge instruction will be true if the first value is greater than or equal to the second value. It will be false if the first value is less than the second value. The cnet instruction can then be used to further narrow down the comparison. If the two values are not equal, this will return false. Therefore, if the cmpge instruction returns true and the cnet instruction returns false, the first value is greater than the second.

During testing of our microprocessor we found this instruction set to be fairly good. It seemed to give us many options but did not make writing of programs tedious due to few instructions.

Use of Generics

The use of generics was very important to our implementation of the uCMK and resulted in a very flexible design. Our design is set up so that the generics used propagate from the highest-level ucmk entity to the lowest level components. The package constants_pkg contains the global settings that these generics are mapped to at the highest level. A change to a constant in this file would be applied throughout the design.

We set up our code to have separate generics for the instruction width and data width. Currently they are the same size, but can be changed to customize the uCMK to the user's needs. We also have separate generics for the width of the instruction and data addresses. This is a very useful feature because the size of the memory blocks can be changed if necessary without needed to change lower level components.

We already have found this extra flexibility very useful. For example, we were originally unsure if our design would fit on the FLEX10K20, so it was important for us to be able to change the size of any features, such as the register file or RAM, in order to give more room for the logic.

In the end, we did not need to do this because the number of logic cells decreases as we changed a major part of the implementation. However, if we had needed to do this, it would have been very simple to do by just changing the global constant value in the package.

Pipeline Stages and Control of Flow

The details of each of the pipeline stages are discussed below and, later on, the use of forwarding and stalling is examined.

Instruction Fetch Stage (IF)

This stage is responsible for reading the next instruction from the instruction memory block (constructed from the lpm_rom component). It consists of 512 16-bit memory blocks. This means that there is a maximum of 512 instructions at the current size. If necessary, the memory can be expanded up to 4096 16-bit memory blocks if more memory is available on the FPGA without any changes in the control or data path of the processor. The limiting factor here is the fact that branch instructions include an absolute destination address in the 16 bit instruction. Since the branch instruction is 4 bits, 12 bits are left for the absolute address.

The only input to this stage is the Program Counter address, which is used to address the instruction ROM. The instruction memory block is separate from the data memory block, since we chose to use a Harvard architecture. This was necessary to avoid memory access conflicts that are possible when the data and instructions are stored in the same memory block. Conflicts would occur if the memory was not separate when a load or store instruction tried to access the memory block while an instruction is being read from the memory block.

Instruction Decode Stage (ID)

The input to the register block is set in the WB stage, and includes an 8 bit enable vector (to select the register that will receive the data), and a 16 bit data word.

The instruction decode stage is responsible for pre-processing of instructions. This section consists of the general-purpose register block (8 x 16-bit registers) and the logic to select the necessary operands from the register block for the next stage. The final design of this logic is discussed in the Experiments section. Partial processing of branch instructions is completed in this stage. Branch instructions are evaluated for the first time here to calculate the next instruction address as early as possible. This is discussed in the PC Address Calculation section.

Execution Stage (EX)

The inputs to this stage include the A and B input register values (from the selected registers in the register block), the forwarded values for the A and B inputs, the immediate number, and several mux select signals from the control path.

The Execution stage is responsible for the majority of calculations necessary for each instruction. Arithmetic instructions are added or subtracted here, compare instructions are evaluated here, and branch instructions are evaluated for the second time. This stage consists of a 16 bit adder/subtractor (implemented using the lpm_add_sub function), a 16 bit comparator (implemented using the lpm_compare function). There are also two 16-bit muxes to select the inputs to the adder and the comparator and a 2 input, 1-bit mux to choose the source of the C bit from the comparator. The purpose of the muxes is to choose the value to be used in the operation for each of the inputs to the adder/subtractor and comparator (A and B). For the A input, there are two possible inputs to the mux: the value from the register block in the ID stage, and a forwarded value. For the B input, there are three inputs to the mux: the value from the register block in the ID stage, the forwarded value, and the immediate value from the ID stage. The control path for the EX stage sets both the select signals for the two input muxes, as well as the select signals for the forwarding muxes. The 16-bit outputs from these two muxes are inputs to both the adder and comparator. The output from the A mux is also connected directly to the address output for this stage, and the output of the B mux is connected directly to the data output of this stage. The output from the adder/subtractor is connected to the ALU output from this stage, and the function is set to either add or subtract by the control path according to the current instruction. The comparator performs all compare functions, but only the outputs for 'greater than or equal to' and 'not equal to' are used. These signals are inputs to a 2 to 1 mux, the output of which is connected to the input of the C bit register. The mux select and the C bit register enable are both set by the control path based on the current instruction.

Halving of the Clock

During the design stage, we found that this section has the longest minimum clock period - approximately 63 ns. This critical path results from the use of forwarding and the delay comes from data going through the WB_muxes, to the ALU input muxes and through the adder/subtractor. We had already minimized the delay and since forwarding occurs quite often during regular program execution, we were forced to halve the clock signal. This was accomplished by taking the original clock signal, then defining a new clock signal that changes only on the rising edge of the original clock signal. This new clock was then sent to an output, which was connected to the secondary clock input. All this was done in the highest level of the design, in ucmk.vhd.

Memory Stage (MEM)

The inputs to this stage are the 16-bit ALU output from the EX stage, the 16-bit Data output from the EX stage, and the 9-bit address output from the EX stage.

The memory stage is responsible for reading from and writing to the data memory block. This stage consists of the memory block (256 x 16 bit) and a mux select for the data lines to bypass the memory when necessary. The memory was constructed using the lpm_ram component. The MEM stage also writes to the 7 segment displays using a store instruction. This is controlled by the address input - if the first bit is a 1, the data register contents are sent to the 7 segment displays. Otherwise, the data register contents are written to the proper address in the memory block. Most instructions do not modify the memory contents, so there is a path for the data register or ALU output register contents to bypass the memory block entirely. To control whether the data register contents or the ALU output register contents are passed to the next stage, both 16-bit signals are inputs to a 2-to-1 mux. The output of this mux is the 16-bit data value that is passed on to the next stage (WB). The 16-bit data output from the memory block is also passed to the next stage. This stage also sets an 8-bit enable vector for the WB stage. This vector indicates the register to which the WB stage will write its data. The vector is set by the control path based on the current instruction and destination register code.

The use of address decoding along with the use of generics in this stage allows the possibility to easily add more external hardware to the m CMK. The width of the address to send to the memory stage can be increased to up to 16 bits and various settings of the higher order bits can be used to control selection of external hardware.

Write Back Stage (WB)

The inputs to this stage are the data output from the MEM stage, the memory output from the MEM stage, and the 8 bit register block enable from the MEM stage.

The write back stage is responsible for writing the result of any operation back to the proper register in the general purpose register block of the ID stage. This stage consists of one 16-bit 2 to 1 mux that selects either the data output from the MEM stage, or the memory output from the MEM stage. At this point, the data output holds the result of an add, subtract, move or movi operation, and the memory output holds the result of a load operation. The mux select is set by the control path, based on the current instruction. The data output and the 8-bit register block enable vector are connected to the inputs of the ID stage. The 8-bit register block enable vector is taken directly from the output of the MEM stage.

Forwarding

Since our processor is pipelined, we need to use forwarding to handle a series of instructions that all use the same registers. The third pipeline stage, EX, is the stage that performs all calculations based on register inputs. However, since each instruction requires five clock cycles to complete execution, results of previous instructions may not be available in the register file for the EX stage to work with. Forwarding the result of the previous instruction directly to the ALU input by overriding the registered input value solves this problem.

Since there are two pipeline register stages following the EX stage, there are two possible stages to be considered as a source for forwarding. The two following stages are the Memory (MEM) and Write Back (WB). Forwarding is accomplished by checking the instructions in the MEM and WB stages to see if they are instructions that would have modified a register (Addition, Subtraction, Move, Move Immediate, and Load). If the instruction in the MEM or WB stage is one of these instructions, the register that was modified is compared with the two registers currently being used in the EX stage. If this register matches either of the registers currently being used, the value from the MEM or WB stage is forwarded to the appropriate register in the EX stage. The logic necessary for forwarding is relatively simple: A pair of 3-input 16 bit muxes, one for each input to the EX stage. The select signals for these muxes is set by the EX stage control path to choose the source of the forwarded value. The forwarded value is either an immediate value from the MEM stage, and ALU output from the MEM stage, or the output from the WB stage. There are many cases to consider for forwarding, some of which were neglected or incomplete until problems became apparent with extensive testing. The following are descriptions of each of the different classes of forwarding.

Either Register Input to EX Stage Forwarded from the MEM Stage:

If one of the two register inputs in the EX stage was modified by the instruction in the MEM stage, the value must be forwarded from that stage. The value forwarded in this case is either the output from the ALU (add or subtract), or a data value that bypassed the ALU entirely (move or move immediate instruction). If the instruction in the WB stage also modified the same register, the value from the MEM stage will take precedence and be forwarded. The value in the MEM stage must take precedence since it was modified more recently than the value in the WB stage.

Either Register Input to EX Stage Forwarded from the WB stage:

If one of the two register inputs in the EX stage was modified by the instruction in the WB stage, the value must be forwarded from that stage. The value forwarded in this case is the output being sent back to the register file as the result of an instruction (load, add, subtract, move, and move immediate instructions). Once again, if the instruction in the MEM stage also modified the same register as the instruction in the WB stage, the value in the MEM stage will take precedence and be forwarded.

One Register Input to EX Stage Forwarded from the MEM Stage, the other Register Input to EX Stage Forwarded from the WB Stage:

If one of the two register inputs in the EX stage was modified by the instruction in the WB stage, and the other of the two register inputs was modified by the instruction in the MEM stage, then both values must be forwarded to the respective registers. The values forwarded in this case are the same as those above for the MEM and WB stages.

The Source Register in the EX Stage for a Move Instruction Forwarded from the MEM or WB Stages:

Another special case occurs when the instruction in the EX stage is a move instruction. This instruction differs from others since only one of the registers is required to have the correct value. The source register must be correct, since it contains the value that will be moved into the destination register. The destination register, however, is not required to have the correct value, since it will be overwritten by the value in the source register. For this reason, only the source register may be overwritten by a forwarded value, which simplifies the checking process.

The C bit for the ID Stage Forwarded from the EX Stage:

This case of forwarding is required to minimize the delay for a Branch instruction. A Branch instruction must be evaluated as early as possible to allow the Program Counter address calculation logic to calculate the address of the next instruction. If a branch is taken, the next instruction will be found at the address contained in the Branch instruction, otherwise, the next instruction will be found at the current program counter address incremented by 1. The earliest possible opportunity to calculate the result of a Branch is in the ID stage, since this is the first stage that the actual instruction is known. If the C bit is set by a compare instruction in the EX stage and there is a Branch instruction in the ID stage, the new C value must be forwarded to the ID stage to allow the Branch result to be calculated.

Included below are examples of each case. The table represents each instruction as it moves through the pipeline. Time increases downward. Please see the above descriptions for each of the five types of forwarding. Table 1: Forwarding Examples

Case #	Clock Pulse	Forwarding Required	Stage ID
Case #	Clock Pulse	Forwarding Required	IF Stage	ID Stage	EX Stage	MEM Stage	WB Stage
1	1		Add r1, r2	Move r1, r2
	2		Sub r1, r3	Add r1, r2	Move r1, r2
	3	R1 from MEM to EX.		Sub r1, r3	Add r1, r2	Move r1, r2
	4	R1 from MEM to EX.			Sub r1, r3	Add r1, r2	Move r1, r2
	5					Sub r1, r3	Add r1, r2
2	1		Add r1, r2
	2		NoOp	Add r1, r2
	3		Store r2, r1	NoOp	Add r1, r2
	4			Store r2, r1	NoOp	Add r1, r2
	5	r1 from WB to EX.			Store r2, r1	NoOp	Add r1, r2
3	1		Add r1, r2
	2		Sub r2, r1	Add r1, r2
	3		Cnet r1, r2	Sub r2, r1	Add r1, r2
	4			Cnet r1, r2	Sub r2, r1	Add r1, r2
	5	r1 from WB to EX. r2 from MEM to EX.			Cnet r1, r2	Sub r2, r1	Add r1, r2
4	1		Add r1, r2
	2		Move r2, r1	Add r1, r2
	3			Move r2, r1	Add r1, r2
	4	r1 from MEM to EX.			Move r2, r1	Add r1, r2
	5					Move r2, r1	Add r1, r2
	1		Cmpge r2, r1
	2		Branch Loop	Cmpge r2, r1
	3	C bit from EX to ID		Branch Loop	Cmpge r2, r1
	4				Branch Loop	Cmpge r2, r1
	5					Branch Loop	Cmpge r2, r1

Starting, Resetting and Stalling the Pipeline

The progress of the microprocessor is controlled by the pipeline register enables - when the enables are set low, none of the registers update, and the pipeline is stalled. The enables are all set on the falling clock edge. This ensures that all enables will be set in time to affect the proper register on the next rising clock edge. Resetting the processor is accomplished by pressing the Reset button, which sets all of the pipeline register enables low. This can be done any time during execution of a program. Reset also sets the contents of all the registers to 0, which corresponds to a "No Operation" instruction for the instruction register. This ensures that the processor will not be executing any instructions before the Start input is received. To begin execution of the program in memory, the Start button must be pressed. At this time, all pipeline register enables are set high, and the first instruction is read in to the IF/ID register from address 0. Stalling may be required for different situations during execution of the program. There are three types of stalls, each of which is described below in order of precedence. The first stall, Branch True stall, has the highest priority. This means that if the conditions are present for this stall to occur, and the conditions are also present for one of the other stalls to occur, the Branch True stall will be given priority for setting pipeline register enables. The program will finish execution when it reaches a "fini" instruction. Problems encountered with the implementation of this instruction are also described below.

Branch True Stall:

This stall has the highest priority. This stall occurs when there is a Branch instruction in the Execute stage, and the branch is evaluated to be true. The branch result was not calculated until the ID stage, and at this time, the address of the instruction immediately after the Branch instruction was automatically loaded into the PC. This instruction address may be incorrect, depending on the result of the Branch instruction. For this reason, when the Branch instruction moves on to the EX stage, it will be evaluated again. If the result of the branch is true, the instruction immediately following the Branch instruction is known to be the incorrect instruction. To overwrite this instruction, the ID/EX, EX/MEM, and MEM/WB pipeline register enables are set low for one clock pulse. This allows the IF/ID and PC pipeline registers to update, but holds the instruction in the other pipeline registers. The instruction in the IF/ID register usually moves on to the ID/EX register on a rising clock edge, but since the ID/EX register enable is set low, the instruction in the IF/ID register has no where to go, and is overwritten by the next instruction. The instruction in the IF/ID register is the incorrect instruction that needed to be overwritten, and the new instruction read is the instruction pointed to by the branch statement. If the Branch statement in the EX stage was found to be false, no stall occurs, and the program continues without interruption.

Branch Timing Stall:

This stall occurs when a branch instruction is in the ID stage. This stall has higher priority than the Move Immediate stall, but has lower priority than the Branch True stall. The purpose of this stall is to provide an extra clock cycle to allow the calculation of the C bit in the Execute stage, and then allow the PC address calculation logic to finish calculating the next PC address. This is accomplished by setting all five of the pipeline register enables low, which freezes the processor for one clock cycle. After one clock cycle, all of the register enables will be set high again, and the processor will continue with the next instruction.

Move Immediate Stall:

This stall occurs when a Move Immediate instruction is in the MEM stage. A Move Immediate instruction is followed by the 16 bit immediate number, which is read as the next instruction. The immediate number is not stored as data until the Move Immediate instruction reaches the MEM stage. At this point, the immediate number is stored in the data register of the EX/MEM pipeline register. It is possible that the immediate number may be interpreted by the processor as a store function that could alter memory in the MEM stage. To prevent this, the immediate number is overwritten in the EX stage. This is accomplished by setting the enables for the EX/MEM and MEM/WB pipeline registers low for one clock cycle, which prevents them from updating on the next rising clock edge. Since the immediate value is in the ID/EX pipeline register, it is overwritten by the next instruction. Since the EX/MEM pipeline register cannot update, the immediate value will be deleted from the pipeline. The register enables will be set high again on the next clock cycle, and the processor will continue with the next instruction.

Fini Instruction:

This instruction serves to disable the microprocessor when it has reached the end of the current program. The instruction will disable the registers as it passes through them, so that all instructions currently in the pipeline will have a chance to complete operation before the pipeline is disabled. The three stalls cause problems in this situation, since they can conflict with the register settings required by the Fini instruction. For this reason, if there is a stall currently in the pipeline, it will be given priority over the Fini instruction to set the pipeline register enables. This is necessary to allow the three stalls to continue operation as normal. Once the stall has passed, the Fini instruction will be given priority to set the enables low.

The table below shows examples of each of the three types of stalls. Time increases downward. A stall is denoted by an asterisk (*) beside the instruction. See above for detailed descriptions of each stall. Table 2: Stalling Examples

Case #	Clock Pulse	Stall	Stage ID
Case #	Clock Pulse	Stall	IF Stage	ID Stage	EX Stage	MEM Stage	WB Stage
1 - Branch True Stall	1		Load r1, r3	Branch	Cnet r1, r2	NoOp
	2	Stall	Store r1, r3	Load r1, r3	Branch*	Cnet r1, r2*	NoOp*
	3		NoOp	Store r1, r3	Branch	Cnet r1, r2	NoOp
	4			NoOp	Store r1, r3	Branch	Cnet r1, r2
	5				NoOp	Store r1, r3	Branch
2 - Branch Timing Stall	1		Branch	Cmpge r1,r2	NoOp
	2	Stall	Add r1, r2*	Branch*	Cmpge* r1,r2	NoOp*	NoOp*
	3		Add r1, r2	Branch	Cmpge r1,r2	Cmpge r1,r2	NoOp
	4		NoOp	Add r1, r2	Branch	Cmpge r1,r2	Cmpge r1,r2
	5		NoOp	NoOp	Add r1, r2	Branch	Cmpge r1,r2
3 - Move Immed.Stall	1		Immed. #	Movi r1	NoOp
	2		Sub r2, r1	Immed. #	Movi r1	NoOp
	3	Stall	Store r5, r7	Sub r2, r1	Immed. #	Movi r1*	NoOp*
	4		NoOp	Store r5, r7	Sub r2, r1	Movi r1	NoOp
	5			NoOp	Store r5, r7	Sub r2, r1	Movi r1

Comparing, Branching and Program Counter Address Calculation

A branch instruction is evaluated by checking the C bit. The C bit is stored in a 1 bit register, and the enable to this register is only set high when there is a compare instruction in the ID/EX stage. This ensures that only a compare instruction sets the C bit. In the original design, the C bit was set by the control path, then sent to the C bit register. This presented timing problems (discussed in detail in the Experiments section), so an alternative solution was found. The lpm_comparator function was used to implement the necessary compare functions in the data path, and the control path chooses the proper lpm_comparator function to set the C bit based on the compare instruction and enables the C bit register.

The Program Counter (PC) register holds the address of the next instruction in the instruction memory. The program counter address is normally calculated by simply adding 1 to this address. This requires a single 8 bit adder. However, when there is a branch instruction that is found to be true in the ID stage, the PC address is loaded directly from the branch instruction. The first design attempt for this section used an offset address stored in the Branch instruction, rather than an absolute address. This method required an additional adder to add the offset to the PC address. However, since the instruction memory address is only 8 bits, all possible instruction addresses can be reached with an absolute address, making an offset unnecessary.

Assembler Inserted Non-operation Instructions

There are three instances where instructions will interfere with one another as they pass through the pipeline. To solve this problem it becomes necessary to space instructions apart within the pipeline. Two ways to space instructions apart are partial pipeline stalling, and inserting a NoOp between the two instructions. In these three cases, pipeline stalling is more difficult to implement than inserting a NoOp instruction, since pipeline stalling is realized using hardware and inserting a NoOp can be accomplished by assembler software. Each of the three situations is described in detail below.

Load Instruction:

This is the case where a load operation is followed immediately by an operation that uses the result of the load. The problem occurs when the load instruction is in the MEM stage, and the instruction that requires the result of the load is in the EX stage. In this case, forwarding cannot be used to move the correct value to the instruction following the load, since the load result will not be retrieved from memory until the end of the MEM stage. Once the load instruction moves on to the WB stage (along with the resulting data), forwarding can be used to move the correct value to the EX stage. However, if there was an instruction that required the load result immediately after the load instruction, it has now moved on to the MEM stage, and did not use the correct value in the EX stage. To correct this, a Non-Operation instruction must be inserted after the load instruction. In this way, we can be sure that there will not be an instruction in the EX stage that requires the result of a load in the MEM stage.

There are two ways to insert the NoOp instruction. The first is realized in hardware, and involves stalling the instruction in the EX stage for one clock cycle if it requires the result of a load instruction in the MEM stage. The load instruction is then allowed to move on to the WB stage on the next rising clock edge, where the load result can be forwarded back to the EX stage. A NoOp instruction must then be written into the vacant spot in the MEM stage to ensure that no unexpected operations will be performed. This method requires hardware to be dedicated to the stalling and the NoOp insertion.

The next method of inserting the NoOp is simpler. In this case, the assembler inserts the NoOp automatically into the coded program when necessary. If there is a load instruction followed directly by an instruction that requires the load result, the assembler will place a NoOp instruction after the Load instruction. This method is functionally equivalent to the first method, but now extra hardware is needed for its realization. For this reason, this method was implemented in our design.

Move Immediate Instruction followed by a Compare Instruction:

When a Movi instruction is followed by a Compare instruction, protecting the C bit register from being erroneously updated becomes a problem. The Move Immediate instructions are actually read from memory as two separate instructions; a Movi identifier, and the 16 bit immediate number. Since the immediate number is first read into the instruction register, we must ensure that the processor does not interpret the immediate number as an instruction and change any memory elements accordingly. This is discussed above in the Starting, Resetting and Stalling the Pipeline section. This section deals with removing the immediate value from the pipeline before it reaches the MEM stage, so that the working memory block is not written to. However, the EX stage contains a memory element in the C bit register. Since the immediate value is not properly stored until the Movi instruction reaches the MEM stage, the immediate value is read into the EX instruction register. At this point, if the processor interprets the immediate value as a compare instruction, the C bit could be modified incorrectly. To correct this, the C bit enable will not be set high when there is a Movi instruction in the MEM stage. There is a problem with this logic, however, because if a Movi instruction is followed directly by a legitimate compare instruction, when the immediate number is overwritten by the compare instruction, the C bit register enable will still be set low. This is very difficult to correct using hardware, but the assembler can insert a Non Operation instruction between the Move Immediate and Compare instructions quite easily. Therefore, when the assembler reads a Movi Immediate function followed by either compare instruction, it will automatically insert a Non Operation instruction between the two. This solution results in a loss of one clock cycle when a Movi instruction is followed by a Compare, but this was justified by avoiding the complicated logic involved in the verification of a Compare instruction following a Movi instruction. The logic required in this case is complicated by the fact that the control path is not clocked, so it becomes difficult to make a distinction between valid and invalid Compare instructions.

Move Immediate Instruction Followed by a Branch Instruction

When a Move Immediate value is followed by a branch instruction, stalling priority becomes a problem. When the Move immediate instruction reaches the MEM stage, the immediate value is in the Execute stage, and the branch instruction is in the ID stage. This causes a conflict, since both the Movi instruction and the Branch instruction will attempt to stall the pipeline in different ways. Since the Branch stall has higher priority, the Movi stall will not occur. The branch stall will automatically re-enable all of the pipeline registers, and the Movi stall will not occur. This will prevent the immediate value from being overwritten. Once again, this is difficult to correct logically, since the control path has no memory of stalls that have occurred. To avoid complicated logic solutions, this conflict was resolved by inserting a Non Operation instruction between the Movi instruction and the Branch instruction. This was once again implemented within the assembler software. This solution ensures that a Movi instruction will never be followed by a Branch instruction.

The following table shows examples of the above three cases. For each case, the problem is illustrated by show the instructions in each stage without the necessary Non-operation instruction. Please see above for detailed descriptions of the three situations. Table 3: Inserting Non-operation Examples

Case #	Clock Pulse	Description	Stage ID
Case #	Clock Pulse	Description	IF Stage	ID Stage	EX Stage	MEM Stage	WB Stage
1 - Load instruction	1		Add r1, r2	Load r2,r1
	2		Store r1,r2	Add r1, r2	Load r2,r1
	3	Load has not read r1 from memory yet, so it can't be forwarded.		Store r1,r2	Add r1, r2	Load r2,r1
	4	Add has old value of r1.			Store r1,r2	Add r1, r2	Load r2,r1
	5					Store r1,r2	Add r1, r2
Solution	1		NoOp	Load r2,r1
	2		Add r1, r2	NoOp	Load r2,r1
	3		Store r1,r2	Add r1, r2	NoOp	Load r2,r1
	4	r1 can now be forwarded		Store r1,r2	Add r1, r2	NoOp	Load r2,r1
	5				Store r1,r2	Add r1, r2	NoOp
2 - Movi followed by Cmpge or Cnet	1		Immed(15 -12) = Cnet	Movi r2
	2		Cnet r2,r6	Immed. #	Movi r2
	3		Add r1, r2	Cnet r2,r6	Immed. # (C_reg_enable = 0)	Movi r2*
	4	Cnet instr. Is ignored becsause of the Movi instr.		Add r1, r2	Cnet r2,r6 (C_reg_enable = 0)	Movi r2
	5				Add r1, r2	Cnet r2,r6	Movi r2
Solution	1		Immed(15 -12) = Cnet	Movi r2
	2		NoOp	Immed. #	Movi r2
	3		Cnet r2,r6	NoOp	Immed. # (C_reg_enable = 0)	Movi r2*
	4		Add r1, r2	Cnet r2,r6	NoOp	Movi r2
	5	Cnet instr. works properly		Add r1, r2	Cnet r2,r6 (C_reg_enable = 1)	NoOp	Movi r2
3 - Movi followed by branch	1		Immed. #	Movi r1	NoOp
	2		Branch	Immed. #	Movi r1	NoOp
	3	Branch stalls (higher priority)	NoOp*	Branch*	Immed. #*	Movi r1*	NoOp*
	4	Branch stall removed, Immed. # is not overwritten	NoOp	Branch	Immed. #	Movi r1	NoOp
	5			NoOp	Branch	Immed. #	Movi r1
Solution	1		Immed. #	Movi r1	NoOp
	2		NoOp	Immed. #	Movi r1	NoOp
	3	Immed. # is overwritten.	Branch	NoOp	Immed. #	Movi r1*	NoOp*
	4	Branch stall occurs.	NoOp*	Branch*	NoOp*	Movi r1*	NoOp*
	5			Branch	NoOp	Movi r1	NoOp

Experiments

In this section, we discuss the various experiments performed during the design and implementation stages. Most of our concerns involved timing issues because of the critical path involving forwarding of values to the EX stage and setting the C bit from these values to choose the correct program counter value for a branch instruction. Other experiments on implementation of the register file and other components are explained and results and conclusions are given for each.

Subtractor

The lpm_add_sub component is used for the ALU in the uCMK datapath. Test programs that used addition worked fine all the time. However, we found that if we subtracted 1 from 1, the result was not consistent. Sometimes it would work as expected but, sometimes, there would be a large section of time whether the output value was fluctuating and finally, after about 68 ns, the output value became valid. This caused problems with our processor because the adder/subtractor has been calculated previously to take around 30 ns or less. This problem was only found when subtracting 1 from 1 and caused one of our programs to loop infinitely because the result of the subtraction never reached the value of 0 needed to skip the branch statement.

To determine what the problem was, we wrote test code that instantiated the lpm_add_sub component just like in our EX_alu.vhd code. We then simulated it and found that it also gave inconsistent results. Because of this, we feel that this is not happening due to a problem in another part of the datapath.

Even before this problem, we found that the lpm_add_sub was producing incorrect results and used our test program to determine why. We initially had the cin input to the lpm_add_sub component mapped to a signal that was set equal to '0'. For some reason this gave incorrect results and when we removed the cin mapping altogether, the problem was fixed.

We are not sure why these problems have appeared. It seems that the problem with the cin mapping was fixed but the subtraction of 1 and 1 is very inconsistent. We looked at the documentation on the lpm_add_sub component and found that our implementation followed what was expected. We are not using the cout or cin signals on the lpm_add_sub component but, according to the documentation, these signals are not required. Since our small test program isolates the lpm_add_sub component, it seems that the implementation of this must be the source of the errors.

Setting the C Bit

The C bit is used to evaluate a branch instruction. If the C bit is 1, then the branch is found true. The C bit is held in a register in the EX stage of the microprocessor. The C bit is set only by the two compare instructions, Compare Not Equal To (Cnet) and Compare Greater Than or Equal To (Cmpgte). We investigated two methods for setting the C bit, and both are discussed below.

The first method was implemented behaviorally in the control path for the EX stage. The VHDL conditional statements were used to generate the necessary functions by comparing the inputs to the ALU. For example, for the compare greater than instruction, we used the following code segment

                if A >= B then
                    C <= '1';
                else
                    C <= '0';
                end if;

The C bit was then written directly into the C bit register. The C bit register only reads input when its enable is set high. This enable is set in the control path for the pipeline register enables. The enable is only set high when one of the two compare instructions is being evaluated in the EX stage. This method of setting the C bit was relatively simple to code, however, the "not equal to" and "greater than or equal to" statements took too much time to implement. The average time was about 34ns, and since the compare instruction is part of the critical path for our microprocessor, as mentioned earlier, this was unacceptable.

The second method, and our chosen solution, was to use the lpm_comparator component to implement our compare instructions in the datapath. In this case, the required outputs for "not equal to" and "greater than or equal to" were both taken from the comparator to a 2-to-1 mux, the output of which was connected to the C bit register. The mux select is controlled by the control path for the EX stage, and it is based on the instruction currently in the EX stage. This mux selects the comparator function that writes to the C bit. The enable for the C bit register is set the same as before, within the control path for the pipeline register enables. This method was much faster, and an average compare instruction now requires around 13 ns to set the C bit.

Muxes after Register File in ID stage

The two multiplexers that follow the register file block located in the ID stage presented a problem early on. These 8-to-1 muxes select which two registers are outputted to the EX stage from the register file in the ID stage. The multiplexers have eight inputs, and each input is a 16 bit vector. The enables for the muxes are taken from the instruction register in the ID stage, the 3 bit source and destination registers in the IR are decoded to give two 8 bit enable signals. When an instruction requires data that is being written to the register file, data is sent through the output of the WB stage and sent to the register file, and through the output muxes and towards the EX stage. This delay was too large, approximately 62 ns (this was before the clock for the system was changed from 40 ns to 80 ns) for certain combination of instructions, and an alternate solution was necessary for the output muxes. Forming multiplexers out of eight 2 input 'AND' gates and a single 8 input 'OR' gate reduced the delay, every output from the eight 'AND' gates is an input to the 8 input 'OR' gate. The output of each register was each sent to one of the inputs of an 'AND' gate. The other input to the same 'AND' gate consisted of the corresponding enable bit (i.e. reg0 => enable(0), reg4 => enable(4), ...) concatenated with itself 16 times to form a uniform 16 bit vector. With only one bit in the enable signal being set at one time, only one output of the eight 'AND' gates would not be zero, and thus allowing only one register to pass data through the 'OR' gate. This solution helped reduced the delay to allow the worst case combination of instructions to be completed under 40 ns with the new mux implementation taking only about 12 ns.

Synchronous Data Memory

The data memory in the MEM stage was originally designed to be asynchronous. Originally, this was meant to save time when we were working with a 40 ns clock period. We found that the average memory access took about 30 ns, so therefore, it could not be based on the clock. While testing instruction programs that required the use of the 'store' instruction, the program would not properly store the data to memory because of glitches in both the data and the memory write enable signal, which is set by combinational logic in the control path. At this point we had already decided to make the clock period 80 ns so now we had room to make the data memory synchronous on the falling edge of the clock. With the registers that separate each stage being clocked on the rising edge, data could still enter and leave the MEM stage in one clock cycle. Instruction programs that were tested after making the data memory synchronous properly stored data to memory, and the data loaded in memory could be written to a register. This problem does not apply to the instruction memory because it is only read from and the address to read from comes directly from a register.

The enable for the register file block located in the ID stage was originally set from the IR (instruction register) in the WB stage. The control path would examine the IR and, depending on the instruction, an 8-bit enable vector would be decoded and used to enable writing to the register file. Due to the control path being purely combinational, the enable signal contains glitches. The enable glitches that were generated caused the register file block to be forced into an unstable state during simulation. The register file block would output a string of 'don't cares' for each register that would be sustained for the duration of the simulation. In order to solve this, the IR was looked at one stage earlier, in the MEM stage, and the decoded 8 bit enable vector would transition through a clocked register stage before reaching the register file block. By clocking the enable signal the register file only receives the steady-state value, and thus the proper register is enabled without error.

Synchronous Register File

The original design for the register file block used an asynchronous design. The main motivation was that only the registers separating stages in the datapath were designed to be synchronous. Everything else was planned to be asynchronous, in order to have no timing conflicts (An instruction could only take one clock cycle to move from stage to stage in the datapath). The register file block takes two inputs, both from the WB stage. The first is the enable for the egister file block, and the second is data that is sent to load a register. The above paragraph talks about the enable signal, the problems associated with it and how clocking the enable signal solved it. However, problems were still found with the register file block, and they were determined to be because the data sent to the registers was the output of a mux, and therefore not registered. While the data was fluctuating, incorrect data values would be sent and this caused the register file block to crash, just as in the above paragraph. It was determined that the best way to solve this was to make the register file block synchronous and sensitive to the falling edge of the clock. This was implemented and found to work adequately without conflicting with any timing constraints. Actually, if this design would have started with a synchronous register file block neither this problem nor the above problem would have been encountered.

Besides solving the problem of invalid data being written and making the register file unstable, the new implementations of the ID stage greatly reduced the size of the register file. Originally, the register file and muxes in the ID stage took up about 45% of the total logic cells on the FLEX10K20. This left little room for the rest of the components. Now, the ID stage only takes up about 28% of the logic cells and we no longer have any problems with our design exceeding the size of the FPGA.

Stalling and the Clock Period

After connecting the complete datapath, the timing analyzer in MAXplusII was used to determine the critical timing constraints. It was found that the worst case timing was approximately 96 ns. By writing test programs that implemented only certain sections of the datapath to isolate the timing problems, we found that the path for forwarding data from the WB stage to EX stage takes 64 ns. Also, setting the C bit from the EX stage to selecting the new program counter value for a branch instruction takes about 36 ns. This involves a combination of instructions that needs to forward data as well as forward the C bit because of a followed branch instruction. Although this is very probable, it is not the most common set of instructions, and therefore can be thought of as a rare worst case scenario.

At this point the clock period for the system was 80 ns, thus a solution was necessary. Therefore to remedy this problem one of two solutions could be used. The first solution would be to increase the clock period to 160 ns, and the second solution was to stall the system for this case for one clock cycle to give time to components and modules to output steady-state values. In essence the second solution is similar to the first because the system is stalled for one clock cycle and then allowed to proceed for one cycle. The result being two clock cycle or 160 ns. The advantage to stalling is that it is a rare combination of instructions that causes this worst case timing conflict. Whenever they occur the control path can be set to stall the system, and the processor will not be continuously slowed down because of a rare set of instructions. For a detailed table of the cases where stalling is implemented see Table 2: Stalling Examples.

Design Verification

To verify our design, we organized test cases and ran simulations throughout the implementation stage. As mentioned in previous documents, we made sure to do extensive testing of the lower level components to reduce the amount of debugging necessary at the higher levels. Most of the testing of the complete uCMK was done for the Simulation Documentation. All the simulations we ran worked as expected, but at that time, we found that we needed to change our data memory to be synchronous. After changing this, we ran the same test programs and found that access to data memory now worked well. Our final step in testing was to configure the chip with our test programs loaded in memory. We wired up seven segment displays and checked that each of the tests gave the proper output. Since our simulation worked well, we were confident that on-chip testing would not bring out any problems. However, we did have problems previously with our demo. The simulation of the demo worked properly but the on-chip testing was not consistent. We found that this was because of timing problems. Therefore, we made sure to do complete testing on-chip to make sure that all timing issues were taken care of.

The test programs that we ran were designed to test as many combinations of instructions as possible. Many of the tests test various types of forwarding when subsequent instructions require data before it is written to the register file. We also tested stalling for 'movi' and 'branch' instructions. For the 'branch' instructions, we made sure to test the case when the branch was taken and also when it was not. We designed programs to make sure that branching worked properly for both forward and backward movement through the program (i.e. positive displacement from current address and negative displacement). We were careful to cover each of the instructions to test its basic operation and also any situations where the data generated or used in that instruction may conflict with other instructions in the pipeline stages. However, we also realized that many of the instructions act much the same in certain situations and this reduced the number of test cases. For example, the forwarding that is done for an 'add' instruction is exactly the same as that done for a 'sub' instruction. Therefore, we really only needed to test that the subtraction produced the correct result under normal circumstances and could assume that the rest of the testing for the 'sub' instruction would be taken care of by testing the 'add' instruction.

The test programs are described in detail below. All of the programs worked during on-chip testing, which verifies that we determined our critical path and have removed it as a potential problem. All but one of the programs worked during simulation. This program has worked during simulation before but we are unsure as to why it is not working now. The .sof file we have for on-chip testing works properly but may be from a different compilation. We did not have time to investigate this problem but are confident that it is fairly isolated. We feel that our testing shows that the uCMK works under most situations. It would be impossible to test all of the combinations of instructions but we feel we have covered most of them fairly well.

Test Programs

Prog_1

This program tests the movi, add, and store instructions. The program loads registers using the movi instruction, and then adds two of the loaded registers. The result of the add is stored to the 7-segment displays. This program also tests forwarding and stalling. Forwarding is necessary for the add instruction, and partial stalling is used with each movi instruction. The simulation worked correctly and testing on-chip was successful as well.

Prog_2

This program tests storing data to memory, loading data from memory, and the subtract instruction. The program starts by loading two registers with 'movi' instructions. Then one of the loaded values is stored to memory. Next, the stored data is loaded into an empty register. To ensure that the 'load' instruction was executed properly, a 'sub' instruction uses the register that contains data loaded from memory. The result of the 'sub' instruction is stored on the four 7-segment displays. This program worked as expected during simulation and on-chip testing.

Prog_3

Program 3 tests the compare instructions ('cnet' and 'cmpge') and the branch instruction. The program starts by loading two registers, r0 and r1, using the 'movi' instruction. A loop is entered and a 'sub' instruction decrements register r0 until r0 < r1 (determined by the compare greater than or equal to 'cmpge' instruction). Next, a move instruction is used to copy the data in r0 to r3. Registers r4, r5, and r6 are loaded using the 'movi' instruction. Next, a compare not equal to ('cnet') instruction compares r3 and r6. Because these registers are not equal, the C bit is set and the program branches on the next branch instruction. A 'store' instruction is used to display the contents of r6 to the four 7-segment displays. Finally, a 'fini' instruction is used to terminate the program. It should be mentioned that a 'noop' instruction was inserted after the 'movi r6', because the preceding instruction was a compare instruction, and the uCMK requires a noop after a 'movi' when it is followed by a compare instruction. The program simulated as expected and ran on chip without error.

Prog_4

This program tests whether the processor can handle nested loops and multiple branch instructions. The program uses multiple loops to delay a write to the 7-segment displays. This can be done if the user wishes to have more than one output displayed per program. This program also uses most instructions found on the uCMK, and therefore is a well-rounded program to use in order to test the processor. The program is too long to run on simulation but on-chip testing shows that it works as expected.

Prog_5

This program implements an infinite loop that outputs to the four 7-segment displays every 7 seconds. The output is incremented each time the program displays to the output, this allows the user to see the loop in motion. After the first 7-segment display is incremented to 'F', on the next output the second 7-segment display will be set to '1' and the first display will wrap over to '0'. This program was originally designed to test seven segment display decoding and the wiring. The program covers too much time to run with a simulation but works correctly when configured on the chip.

Prog_6

This program tests the branch instruction as well as a load followed by a store instruction. The program starts by loading registers r0 and r1 with the 'movi' instruction. Then the C bit is set using the compare instruction and the program branches, if the program didn't branch then the wrong value would be displayed and the program would terminate. The program does not terminate and proceeds to store data to memory, and then to load the same data to another register. An 'add' instruction is used and the result is displayed on the four 7-segment displays. The program requires the insertion of a 'noop' instruction after the 'load r2,r4', due to the preceding instruction that uses the result of the load. This program adequately tests forwarding and stalling and makes sure that any end instruction that should be missed because of a branch is missed. The simulation shows that this program works correctly. The on-chip testing ran as expected as well.

Prog_7

This program tests the compare greater than ('cmpge') and branch instruction. Two values are loaded into 'r0' and 'r5', and the compare instruction sets the C bit. This allows the program to jump to the store instruction and set the 7-segment displays. The 'noop's were inserted to place room in memory between the branch instruction and the destination address. The simulation worked as expected. On-chip testing yielded the same result.

Prog_8

This program tests if the processor will ignore a branch instruction when the C bit is not set, as well as whether an addition or subtraction instruction can have the same register for destination and source (i.e. add r4,r4 , sub r1,r1). The program loads register r1 and r2, the C bit is not set after a 'cmpge' instruction is used. Then register r3 is loaded using the 'movi' instruction. After, a branch instruction is ignored because the C bit was never set, and the program proceeds to a 'add' instruction. The program then outputs to the 7-segment displays and ends the program using a 'fini' instruction. The simulation of this program worked as expected, as did the on-chip testing.

Prog_9

This program checks that forwarding works as designed. Two 'movi' instructions are used to load r1 and r2. The program then needs to forward r1 and r2 to complete the next instruction which is 'add r1,r2'. The contents of r1 are copied to r3 and r4 in the next two 'move' instructions. Finally through consecutive store instructions the contents of r3 and then r4 are displayed. Although the user will not observe the first display, it was executed to see that the proper data would be displayed on the four 7-segment displays.

We realized when writing the report that we did not have a simulation diagram for this test. When we tried to create one, we got incorrect results. Previously this simulation did work and we are unsure why it doesn't know and did not have time to explore the problem. It seems to be a simple program with basic forwarding and stalling for the immediate values, which has been tested in many of the other test programs. The .sof file we have for this program works fine when configured on the chip.

Prog_10

This program loads pre-stored data from memory to registers r6 and r7. A 'sub' instruction is used to subtract r7 from r6. Then the result is displayed. The functionality of this program is to ensure that data set in memory, prior to the execution of the program, can be loaded at any time during the program. It also tests that unsigned subtraction works as we expect. The simulation results given on page TEST-18 show that the program works properly and on-chip testing gives the same result.

Prog_11

Program 11 tests to see that if the C bit is not set to '1' on a compare instruction, if a branch instruction will be ignored. A 'movi' instruction loads r1 and the value is copied into r6. Both registers are compared with a 'cnet' instruction, because they are equal the C bit is not set. The program then ignores the branch instruction and outputs the data in r6 to four 7-segment displays. The simulation results show that this program works and the on-chip testing did as well.

Prog_12

This program tests the subtract instruction as well as uses the branch instruction to create a simple loop, which is a common code structure in assembly programming. The value in register 'r4' is decremented until it holds the same value as register 'r2'. At this point the value is displayed to the 7-segment display. This program worked as expected both during simulation and on-chip testing.

Prog_13

This program moves 6 numbers from memory into registers, adds the first three together, and adds the last three numbers together. The absolute difference between the sums is then calculated. These sums are compared, and if sum2 < sum1, then sum1-sum2 is sent to the LCD. If sum2 >= sum1, then sum2-sum1 is sent to the LCD. This program is a good example of a real program that would be implemented using a microprocessor.

This program illustrates the following situations:

The necessary noop instruction inserted after a load instruction if the source/data register of the load instruction is used in the following instruction.
A branch instruction after a compare instruction, which requires the new c bit to be forwarded to the branch instruction.
Stalling performed by a branch instruction found to be true, and stalling to overwrite an immediate value once it has been stored.
Loading data from memory works properly.

Simulation and on-chip testing both show that this program works as expected.

Prog_14

This program tests if registers are initialized to zero on reset. Register r7 is loaded with data and then the 'add' instruction is used to add r7 to r1. Register r1 has not yet been given a value, thus the 'add' instruction should just copy the contents of r7 to r1 (adding zero to r7). The program branches to load an immediate value into r0 and then displays the value. If r1 was not greater than or equal to r7, the first branch is not taken and a different immediate value is written to r0 to display. This program also tests that a 'movi' right before the end of the program is not a problem. Because move immediate instructions involving stalling, we had to make sure that the stalling and disabling of registers because the end of the program was reached did not conflict. Simulation and on-chip testing both show that this program works as expected.

IC Test Measurements

Since there are very few hardware interfaces to our microprocessor, there are not really any test measurements to be made. Our main concern in our design was not external hardware, but our clock speed. As discussed in the previous section, to test our IC, we wired up cathode seven segment displays to the UP-1 board and ran test programs loaded into memory using .mif files. The displays showed the correct numbers, which confirms that our IC runs at the specified clock speed of 12.58 MHz.

References

During the design stage of our microprocessor, we referenced the following two textbooks quite often. The Computer Architecture book was the main source of information on the DLX architecture.

Computer Architecture - A Quantitative Approach, John L. Hennessy and David A. Patterson, Second Edition, Morgan Kauffmann, 1996
The Designer's Guide to VHDL, Peter J. Ashenden, Morgan Kauffman, 1998

Design Documentation and Interface Descriptions

In this section, each of the components are discussed briefly to provide a explanation of the interfaces between components. For more information, see the individual diagrams for each section.

IF Stage

The IF stage consist primarily of the Instruction Memory block that is used to store the programs that are run on the ucmk processor. The Instruction Memory is the only asynchronous block in the system, and it is always enabled. It receives a 9 bit address vector from the PC Register and outputs the corresponding 16 bit data value towards the ID stage.

ID Stage

The ID stage consists of the Register File block and two multiplexers. The Register File is a synchronous block sensitive to the falling edge of the clock, and holds all eight registers used to hold data in programs. The data and enable signals that are inputted to the Register File block come from the WB stage. The Register File outputs the data in all eight registers to each of the two multiplexers. The output of the muxes goes to the EX stage. The muxes are enabled from the IR (instruction register) in the IF / ID Register. The IF / ID Register is enabled from the control path and can be stalled when desired. The IR coming out of the IF / ID Register is broken up into five different signals. Two of the signals 'source_register' and 'destination_register' are used for the mux selects in the ID stage. Signal 'PC_offset' consists of the least significant 9 bits of the IR and is used if the program uses a branch instruction. The least significant four bits are sent to the control path in signal 'IR_to_cp'. The complete IR is sent to the next stage through signal 'IR_going_to_EX'.

EX Stage

The EX stage consists of two multiplexers, an ALU, a comparator, and a multiplexer for the selecting the output of the comparator. The first two multiplexers select which data is going to enter the ALU and comparator. Either forwarded data, an immediate value, or data from the Register File block in the ID stage can be selected. The output of the two muxes directly enters the ALU, comparator; and are sent to the MEM stage to set the address and data signals needed for load and store instructions. The ALU can be selected to add or subtract unsigned numbers. The output of the ALU proceeds to the MEM stage. The comparator block can either compare greater than or equal to, or compare not equal to. Both compares are performed and outputted to the mux which follows the compare block. This mux is selected from the control path. The output of the mux is used to set the C bit, which is used for branching. The IR in ID / EX Register is aliased to 10 bits from 16 bits, and sent directly to the EX / MEM Register in the MEM stage.

MEM Stage

The MEM stage consists of the Data Memory block and a multiplexer. The Data Memory block is used for load and store instructions. The Data Memory is synchronous and sensitive on the falling edge of the clock. The enable for the Data Memory is set in the control path. A load/store signal set through the control path determines whether data is stored to memory or loaded from memory. If a load instruction is executed the data on the address line is ignored, and during a store instruction the data that is written to memory is also outputted from the memory block. The mux is used to select between the output of the ALU and the data line. The data line was one of the inputs to the ALU that were passed on to this stage from the EX stage. The mux is set through the control path and is used to pass on immediate values ('movi' instruction), outputs of the ALU('add' and 'sub' instruction), and if a 'move' instruction was used.

Write Back Stage

This file includes the MEM/WB pipeline register, as well as the 16-bit 2-to-1 mux that selects data from memory or data from the ALU. The control for this mux is provided by the control path for the WB stage, and is based on the instruction in the WB stage. The inputs to the pipeline register are the Instruction Register from the MEM stage, the 16-bit data word read from memory (for a load instruction), and the 16-bit ALU output from the EX stage (which could be an immediate value, a number to be moved, or output from the adder/subtracter). The pipeline register is clocked and enabled. The outputs from this stage are the upper 7 bits of the instruction register (to the control path to determine the mux select), and the 16 bit data word to be written back to the general purpose register file in the ID stage.

Forwarding or WB Muxes

These components are used to provide forwarded data for the EX stage. They are controlled by the control path for the EX stage. There are two 16-bit 3 to 1 muxes, one for each of the two data inputs in the EX stage. The three inputs to these two muxes are the ALU output from the MEM stage, the immediate or move data output from the MEM stage, and the output data from the WB stage. The two outputs from the muxes are connected to the forwarded data inputs of the EX stage. The EX control path sets the mux selects based on the current requirements for forwarding. Forwarding is discussed in detail in the Design Details section.

Program Counter

This section is supporting hardware for calculating the program counter address. The section centers around the PC register, which is clocked and enabled. There is one 9-bit 2-to-1 mux that controls the address to be read into the PC. The inputs to this stage are the absolute destination address from a branch instruction, and the mux select from the control path. In addition, there is an adder to increment the PC by 1 on each clock cycle. The mux then selects the incremented PC value or the absolute branch destination address. The PC address is then connected to the address input of the IF stage. The control of this stage is discussed in more detail in the Design Details section.