Speech Processing and Recognition

Voice Controlled Remote Control

EE 552

High Level ASIC Design with CAD

Final Project

Dr. Duncan Elliott

December 6, 1999

Scott Medynski

Gabriel Ricardo

Michael Vandegriend

Abstract

Within the pages of this document, we explore the design and the results of a successful attempt to implement a voice-controlled television remote control with the aid of an Altera FLEX10K70 field programmable gate array and some interfacing circuitry. All of the internal workings of the device are synchronous and operate on a 5 MHz clock. Voice input is sampled into the FPGA at a rate of approximately 8kHz and with 8-bit resolution. The operation of the device makes use of four spoken commands, "power" "up", "down", and "surf". These commands correspond to turning the television on and off, changing the channel up and down, and an automatic surfing function that allows for a preview of three seconds on each channel before incrementing. Word boundary, zero crossings, and a modified energy analysis are techniques that the speech algorithm makes use of. The recognition algorithm is based upon isolated word utterances and speaker dependent with training required. In the active mode, the device is constantly listening so no physical contact with the device is necessary. The device also operates as a programmable remote in that it can accept various codes during the IR training in order to operate almost any single device at one time. Although designed with a Sony brand coding scheme in mind, the algorithm can be easily modified to suit almost any coding scheme in use today. In our controlled environment testing, we were able to achieve approximately 75% recognition of commands.

Declaration of Original Content

The project, its conception, and the contents of this report are entirely the original work of the authors except as follows:

1) clock divider has been modified from reference [1]

2) The design for the input hardware was inspired by, but not at all taken from reference [5]

3) Information on IR coding schemes [8]

4) Information on IR decoding and transmitting circuits [9]

5) The VHDL code for Mux_1, Mux2N_N, registerN were taken and modified from the EE 480 web-site[10]

6) The VHDL code for a ROM entity was taken and modified from the EE552 web-site [11]

7) Speech recognition methods, data processing and external RAM interfacing was referenced from [2], [3], [4], [6], [7], [8]

Scott Medynski

Gabriel Ricardo

Michael Vandegriend

Table of Content

Description of Operation (overview) 1

Description of Operation (Analog to digital converter) 3

Description of Operation (top level control) 4

Description of Operation (IR interfacing) 5

Description of Operation (Speech processing and recognition) 10

Achievements 12

FPGA utilization 15

Results of Experiments 16

References 21

Appendix A 22

Appendix B 82

Appendix C 152

Appendix D 166

Appendix E 176

Appendix F 189

Data - sheets 196

Description Of Operation

Overview

This is a general overview of how the system functions from the user’s point of view. Each sub-system is discussed separately in detail in following sections. Please refer to each sub-system’s section for comprehensive information on operation, design, implementation and results.

The system has the overall functionality of taking in voice commands, processing them and transmitting the appropriate IR codes to a target audio/visual electronic device (i.e. TV, VCR, Stereo etc.) The system has the capabilities of recognizing four speaker dependent voice commands once it has been trained to the user’s voice and spoken commands. The system also has the capabilities of being trained to store the IR codes of the target A/V device by beaming the codes at the IR detector of the system.

The system is divided up into four sub-systems: the analogue to digital front-end, the top-level control, the speech processing/recognition, and the IR detecting/transmitting. The top level control delegates control and actives the sub-systems responsible for IR training, Voice training, IR transmitting, Voice recognition, and noise sampling based on the user’s inputs. The user interacts with the system via the two push buttons on the UP1 board (and of course a microphone for speech input) and receives information on the systems status via the two seven segment displays. Each sub-system is granted control or access to the push buttons and seven-segment display by the top control according to which mode of operation it is in. The system has four modes of operation: Noise sample, Voice Training, IR training and Active. The modes are numbered respectively 0 through 3 on the seven-segment displays.

When the system is first activated the user can scroll between the four possible modes of operation by pushing push-button one on the UP1 board. The user can select which mode to enter by pushing the other push button (PB2). At first the user does not have the freedom to choose any mode of operation. The user must first complete a noise sample, Voice training and IR training before entering into active mode. In active mode the user is free to utter a voice command and the device responds by transmitting the associated IR code to the target A/V device.

The details of the first three modes of operation are dealt with in subsequent sections.

The Analog to Digital Converter

In everyday life, speech is an analog signal. Before speech can be interpreted and analyzed, the signal must be converted from its analog format into a digital representation which can be manipulated by the logic internal to the device. For this, a block interface circuitry was designed to bring the signal to significant, measurable levels and then to convert this signal into a digital representation. This is commonly referred to as analog-digital conversion or ADC and will be referred so throughout the remainder of this document.

The first part of the interface involves a transducer to bring the pressure wave into the realm of the electronic world. This stage took the form of a condenser microphone. The microphone element contains a mechanical means for which to convert the variances in pressure of the surrounding air (which we hear as speech or sound) into corresponding variations in voltage. The actual voltage variations are very minute and do not allow for much room in the signal swing for sampling. For this reason, the signal is now amplified. The amplification is in two stages which is separated into a stage of pre-amplification with very high gain and then a second stage of amplification of relatively low gain. The gain on the second stage is variable for the allowance of fine-tuning the strength of the signal. If the speaker’s voice is very quiet, the gain may be adjusted so as to allow the conversion to be fully effective in sampling the unique aspects of the signal. Once the signal is amplified, precautions are taken to ensure that the size of the signal at this point is not going to exceed the allowable limits of the input to the conversion chip. This is done with the help of a zener diode. The signal now undergoes sampling and conversion. The signal is sampled at a rate of approximately 8000 samples per second. Of these samples, each sample is converted to an eight bit digital representation. The conversion done is linear, that is, there is a direct relationship between the magnitude of the sample voltage and the binary representation that is placed at the outputs for each sample. At this point, the signal gets taken into the actual FPGA by means of a bank of two eight bit registers. The first register is clocked by the sampling frequency and the second register is clocked on the system clock. This last step is done to ensure the synchronicity of the data with all aspects of the FPGA. At this point, the synchronous data is passed onto the speech recognition algorithm.

The Control Path

The control path was designed in two stages that reflect the two main portions of operation of the device. Both of these stages were implemented in Moore type state machines to ensure synchronicity of all aspects.

The first manner of behavior to be described is that of the initialization and the training of the device. Upon reset of the device, training for the IR codes is required as well as the ambient noise sample and the spoken command training. Since proper operation in the active mode cannot be realized without all of these steps being completed at least once, entry into active recognition is not granted until all of the training steps have been completed. Even within the training sequence, a voice-training session will prove completely worthless without first testing the ambient noise in the environment. For these reasons, a flag system has been implemented. The general idea behind this system is that upon completion of each step in the training process, a flag is set. Entry into the voice-training session is not granted until the noise-sample flag has been set. Only after all of the training flags are set is entry into the active mode granted. This system ensures that a reliable, intuitive process for the initialization is followed.

The second distinct manner of operation involves the active listening and recognition of commands and the sending of the corresponding infrared code to the television or other piece of audio-visual equipment. This behavior has been dubbed the active mode of the device. In the conceptual stages of the design, approximately fourteen command words were identified. In order to handle such a large list of commands, a hierarchical command tree was constructed and a subset manner of identifying the relevant words at each location on the tree was developed. This would only require the speech recognition module to compare against a minimal list of words at any given time thereby increasing the percent recognition of the device. Although the final version of the device supports four words and no longer requires this feature, the capability to use this for larger instruction sets remains should the device be implemented in the future on a FPGA with more space. The active control path waits in an initial mode until the initial control path sets a flag indicating it has completed all training and the device is ready to begin active operation. Upon the receipt of this signal, the active control path lies dormant until the speech module recognizes a word and passes it to the control path. The control path determines whether or not the word recognized has any relevance in the current context. (For example commanding "TV… EJECT" would have no meaning in any context and would be ignored.) If the word is a valid command, the control path sends a request to the IR module that tells it to output the code corresponding to the command spoken. The control path then resumes waiting for another word to be recognized.

Infrared Interfacing

The IR interfacing of our system is performed by the IR-module, which is discussed in this section. The IR-module has two modes of operation allowing it to detect codes, store them in memory and re-transmit these codes at a later time. Thus the IR-module sub-system is divided up into two main components, the detector-module and the transmission-module. A top-level control activates the appropriate module depending on the inputs form the top-control of the system. A hierarchical diagram of the IR-module and how it interacts with key components can be found on the following page. A detailed description of the operation, design, testing and implementation results of the sub-system is described below.

Operation of the system

When the IR sub-system is queued by the top-level control it either enters into a training mode or a transmit mode, depending on the mode signal sent by the top-level control. In the training mode the user interacts with the system via push-button one and the seven segment displays. The display lists which code the user may beam at the device. The display shows either PO (for the power command), CU(for the channel up command) or Cd(for the channel down command). The user can scroll through these commands by pushing the push button. Once the appropriate code has been selected the user may then beam the code at the detector module. When a valid code sequence has been detected and recorded the display responds by illuminating all three horizontal segments for approximately a 2-second duration. This signifies that the code has been accepted and stored into memory. To exit the IR training mode the user must place the dip-switch 1 on the UP1 board into the closed position. When the IR-module is in transmit mode the user receives no useful information about the operation of the IR-module other than by observing the correct command taking place on the target A/V device.

Design of the system

The IR-detector

Originally our system was to have the capabilities of storing any IR code sequence in memory and transmitting it at a later time. In researching the variations in IR coding techniques used by different manufactures we concluded that the system could be greatly simplified and still have broad applications if we restricted our IR module to capture and train for only one type of IR code sequence. There are two predominant IR coding sequences. One which consists of a start sequence, 32 coded bits and a stop sequence and the other which has a start pulse, 12 coded bits and no stop bits. We decided to implement a system which can capture and transmit codes from IR devices using the latter code sequence.

(Note: For each coding sequence there are various coding schemes for representing ones and zeros. Information on these coding schemes and circuits for IR detection were taken from the internet[8], [9]. For our purposes it is not important to decode or distinguish between ones and zeros of the incoming pulse. The reason for this will become obvious in the discussion of our design. For a detailed look at coding schemes, coding standards and examples of typical IR pulses please refer to our student application notes titled IR_codes)

The IR module sub-system needed the capabilities to switch from training (code capturing mode) to a transmission mode. Training and transmitting required that the system have a control path able to set it into the appropriate mode and to capture the incoming bit sequence into memory. The sub-system components are now discussed.

The demodulation component:

This component of the sub-system consists firstly of a demodulation chip, purchased at Radio Shack, which demodulates a 38KHz light signal of 940nm +/- 100nm. The demodulation chip has three pins (Vcc, ground, Vout) and at the Vout pin produces a pseudo-demodulated 0 to 5V bit stream. The output of this chip gives the bits in whatever coding scheme is used by the manufacturer of the IR device being used (i.e. most bits are coded by some pulse width or pulse space coding scheme and the chip gives its output in this raw form). A typical bit-stream pulse output from the demodulation chip is shown below. A schematic of the demodulation chip and how it connects to the rest of our system can be found on the following page. Since our IR module only needs to capture the code and reproduce it there is no need to further decode the signal into ones and zeros. The next step of the demodulation component is to sample the incoming pulse stream (using 1MHz sampling rate) and use the output of the sampled data as the count enable of two counters. The two counters (using lpm library counters) count how long the pulse is high and how long it is low. In this way a number is produced representing the duration of high and low portions of the pulse representing the IR command. Then the output of the count is sent to the control path, which deals with the storing of the data, resetting of the counts and also determines when a valid pulse sequence has started and ended.

The transmission component:

The data stored in memory, representing how long the output pulse must be high or low, is loaded into a counter (lpm counter) which counts down at a rate of 1micro second (which is the same clock frequency as the sampling rate). The control path for the transmitter enables or disables the modulation of the out-going signal depending on what point in memory it is at. It turns out that the modulation of this pulse sequence to a 38KHz carrier is quite simple. All low portions of the sequence are modulated to 38KHz and all high portions are not. So a 38KHz clock was created as the output to the base of a BJT, which controls the current passing through the IR LED. Thus the modulating signal drives the base of the transistor and either turns it on (at 5V input) or turns it off (at 0V input). A schematic for this circuit is shown on the following page. The value of the resistor R is only limited by how much current should ideally be drawn through the two IR LEDs. Two IR LEDs are used to increase the range of the transmitter and a resistor of 4.3kW is used to draw a current of 1mA through the two IR LEDs.

Address/Code select component:

This component simply passes a code either from the higher level control or from the users inputs to the lpm_ram address to select the appropriate address location. This way the address is always at the beginning of the code desired, either when in training mode to start storing in memory or in transmit mode to start transmitting from memory.

The control path:

The control path was implemented as a simple moore state machine. It receives data from all components and the higher level control of the system to assess which mode it is in, whether to transmit or store counts to memory. It also delegates control of the address counter (lpm counter which points to memory). By using two muliplexers for the increment and clear signals of the address counter the control path appropriately selects either the IR detector or the IR transmitter to access address incrementing and clearing.

The LPM_RAM memory

For storing the counts, representing the pulse sequence, an lpm_ram module has been created.

For addressing the memory an lpm counter is used to feed in the least significant five bits of the address. The two most significant bits of the address are sent straight to the ram from the Address/Code select controller. In this manner chunks of 32 addresses are reserved for each code. Also, this method of breaking up the memory address and allowing the counter to only address the least five bits allows for memory wrapping and protection of memory incase of incorrect incrementation of the address counter. The ram requires a data width of 12-bits (since no valid count should exceed 4000) and each bit consists of a high and low portion. The start pulse is not stored in memory since it can be expected to begin every code. Since each command takes up 32 address locations and each address location uses 12-bits the memory works out to be:

(12-bits * 32) = 384 bits per code.

Our system implements 4 voice commands which correspond to 3 IR commands. This leads to a memory requirement of:

384bits/code*3codes = 1152 bits

In the lpm_ram this is implemented as a data width of 12 and an address width of 7 which uses up 2 EABs.

Further design considerations

It is important to note that certain alterations to the design were made along the way to achieving a system with its desired functionality.

The count values from the lpm counters to the detector control path were at first unregistered. This meant that the detector control was making decisions based on a 12-bit vector that was changing every microsecond and not in sync with the control path. Although no errors were directly traced to this, the design was changed and the count values were registered and synchronized with the control path.
The lpm address counter was at first being incremented by sending it an increment signal in place of a regular clock. This was changed to having all counters on the system clock and sent a count enable signal when the count was to be incremented. This allowed for synchronous resets or loads to be used in place of asynchronous.
The lpm ram was initially unregistered. This was changed to registered input, registered output, and registered address.

These alterations allowed for the system to be more synchronous and more coherent in its interaction. Synchronous systems proved much more reliable.

Speech Processing and Recognition

The voice recognition system implemented for this project is a speaker-dependent, isolated word system. This system can be used by only one person at a time who must train the recognition algorithms. For this type of recognition system to work, words must be spoken with pauses separating each word. Once initialized and trained, the voice recognition is always active and is able to distinguish words from surrounding noise. This system has been designed for 8-bit speech data, sampled at a rate of 8 kHz.

The voice recognition system consists of three major algorithms, the word-boundary detection and speech processing algorithm, the training algorithm, and the recognition algorithm. The word-boundary detection algorithm differentiates the beginning and end of a spoken word from the ambient noise. In order to determine the threshold at which the start of a word is detected, the noise characteristics must be analyzed first.

When the word-boundary detector is set to sample the ambient noise, 1024 data values of the noise are averaged to find the zero level of the sound data. Once this zero level is found, the number of zero-crossings and the energy of the noise, referenced to the zero level, is found for a set of 256 data values. The peak noise value, referenced to the zero level, is also found. To deal with space constraints, the energy is calculated as the absolute value of the amplitude of the input data, and not the square of the amplitude of the data. The data from this analysis is used to determine the threshold level at which a word is detected.

While in the active mode, the word-boundary detector constantly processes incoming data. The detection process compares the incoming data to the peak noise value, and promptly ignores the data unless the peak noise value is exceeded. When one value exceeds the peak noise value, the detection algorithm determines the energy and number of zero-crossings of it and the next 255 data values. If the energy and zero-crossings of the 256-sample block exceed the noise threshold level, then the start of a word is detected and the energy and zero-crossings data is passed to the training and recognition algorithms. The next 256 data values are then analyzed and their energy and zero-crossings data are likewise passed on. This process continues until the energy and zero-crossings of the incoming speech data fall below the noise threshold level, at which point the end of the word is indicated. Note that the data used to determine the start and end of a word is also used for the recognition process.

When the training algorithm is activated, the user is prompted, via the 7-segment LED display, to speak a sequence of words, saying each word four times. The energy and zero-crossings data from each utterance of a word are stored in memory, until all four utterances have been received, at which point the data is averaged and stored in memory. After all the words have been trained, the memory contains a set of energy and zero-crossings data for each word, representing the speech characteristics of the words.

The recognition algorithm activates when the start of a word is detected. The algorithm then finds the difference between the incoming analyzed speech data, and the data for each trained word. These differences are cumulatively summed, separately for each word, until the word ends. At the end of the incoming word, each trained word is represented by two distance values, one for energy and one for zero-crossings. These values are compared between each word to find the smallest energy and zero-crossings distances. If both the smallest distances correspond to the same trained word, then that word is chosen as the recognized word, otherwise the incoming word is deemed invalid and ignored.

The following diagram shows the processing steps during voice recognition.

Speech data in

Trained

Energy

Peak Values

Comparator

Energy Threshold Energy Distance

Calculation Comparison Distance Comparators

Zero-cross Threshold Zero-cross Distance

Calculation Comparison Distance Comparators

Trained

Zero-cross

Values

Word Selector

Recognized word

Achievements

Speech Processing and Recognition

Two versions of the speech recognition and processing block have been created, the full version, which fits on a Flex 10K70 series FPGA, and a smaller version that fits on the Flex 10K20 series FPGA. The full version has been simulated without known error; however, due to time constraints, it has not been testing in hardware. Hardware testing of the full version with undoubtedly reveal some errors, since complete simulation of the system requires too much time. The smaller version has been tested on the FPGA.

To make the smaller version fit on the 10K20 FPGA, the energy analysis method was removed, leaving the zero-crossings analysis as the speech analysis method. Removing the energy analysis required that the zero level for the input speech data be set as a constant at 128, the midpoint between the input data range of 0 to 255. Relying only on the zero-crossings analysis means that other words and loud noises will be recognized as one of the four trained words, as the recognition algorithm simply chooses the trained word with the closest match to the incoming word.

Space limitations have affected our design in other areas. The recognition and training algorithms have been limited to a total of four words, which allows for an easy comparison algorithm. More words require much more complex algorithms, which would take up a lot more space. Other features, such as a digital filter, have also been eliminated due to space and time constraints.

Using just zero-crossings analysis for the recognition algorithm limits the recognition accuracy. Zero-crossings analysis provides a crude representation of speech and should not be expected to provide above 90% recognition accuracy. The accuracy of this method depends on how different the trained words are from each other. The greater the difference, the better the recognition accuracy. Thus, the recognition accuracy depends on the set of words selected for training.

The word boundary detector is able to distinguish the start and end of words from the ambient noise; however, some modifications on the original algorithm were needed. Initially, the word boundary detector would often indicate the end of a word during the middle of a multi-syllable word, and then trigger again at the start of the next syllable, which confused the training and recognition algorithms. To fix this problem, the algorithm was modified so that the end of a word is detected when two zero-crossing counts fail to exceed the threshold. This modification allows for a low zero-crossing count in the middle of the word without triggering the end of the word signal.

The Analog to Digital Conversion Interface

In the creation of the ADC interface, various obstacles were overcome. These ranged from deciphering poorly documented work from previous terms to determining the physical reasoning as to why certain approaches were not working to faulty test equipment. At first, the input ADC circuit was to be taken directly from a previous design for the same purpose completed by students in the previous term. Upon further investigation of this circuit, it was not clear as to why certain stages and components were used, nor even how they were connected. In light of this, it was decided that a re-design of the circuit would be beneficial from both the educational aspect and for reasons of comfort in the ability and stability of the circuit itself. In the re-design of the circuit, the first obstacle was to interpret the minimal detail provided on the specification sheet for the microphone element that was purchased. The only information given was a recommended voltage over the device and a small diagram on the connection. With no knowledge of the inner workings of the component, it seemed an impossible task to successfully connect the device. The signal emanating from the element itself appeared to have no measurable voltage and thus was assumed to require at least pre-amplification for the signal to be verified. This problem was brought to solution within the hour after the assumption that the component had variable impedance. Later on this was confirmed by the professor, but with one small twist - that is the device, originally assumed to be of variable resistance, turned out to have a variable capacitance. The next stumbling block faced was that our pre-amp appeared to be clipping the signal at one-tenth of its intended value. This was quickly revealed to be due to a combination of faulty equipment (an oscilloscope probe was set to 1x internally although the switch upon it indicated a setting of 10x) and a resistance being out by a magnitude. These two points were resolved and the design continued. Borrowing from the schematic of our colleagues in the previous term, it was assumed that the ADC0809 chip requires a stable signal held for proper conversion to take place. After much time and energy devoted to study of the specification sheet and operational testing, the sample and hold chip would not yield consistent results. In fact, the amplified signal actually appeared to be affected merely by connection to the powered chip. The entire stage was opted out of the new design under the new assumption that the ADC0809 chip would be self-sufficient in its conversion of signals. Testing later confirmed this assumption. The actual test performed to conclude this involved feeding the ADC0809 chip a steady clock pulse and connection of its input to an oscillating source. The start of conversion signal was triggered and the output was observed upon the oscilloscope. The frequency of the signal was varied and the output was observed. The results of this allowed for confirmation of the above-made assumption. Once the circuit was operational on both ends, and before connection, the signal’s amplitude was capable of reaching voltages that were twice the allowable limits to input to the ADC integrated circuit and therefore required to be limitations. The actual manner of implementing this was tried three separate ways. The first attempt involved running two diodes in series from ground to power and running the signal through the node between them. Theoretically, this would froward bias one of the diodes at the -0.7volt and +5.7volt marks. In practice, the voltage of the signal at that node ranged from –0.7volts to approximately 7.0 volts. This was deemed unacceptable and the idea was disregarded from further consideration. The next attempt involved utilizing the operational amplifier’s property to not being able to surpass its supply voltage. In light of this, a buffer was connected with the supplies connected to ground and +5volts. For reasons not fully investigated, the circuit seemed to oscillate up to nearly +8volts at the application of a loud whistle to the circuit. The investigation ceased upon the implementation of our third and final attempt to limit the signal’s swing with a 5.1volt Zener diode. This final implementation allowed for a simple method of limitation of the signal near the –0.7volt and 5.2volt levels. In the end, the design of the input ADC circuit proved to work fully.

The on-chip support for the ADC converter is done with the help of our aforementioned colleagues. The actual interfacing code was modified slightly to tailor it to our specific needs. The details of our modifications can be seen in the code and involve adding an extra process to their input_reader.vhd to ensure synchronicity of the input data.

Test cases that were used to confirm proper operation

The Control Path

Testing for the control path was done extensively in simulation and confirmed with the results of real-time operation. Alone, a battery of simulations was performed in order to examine the behavior of the entity under various situations. In particular, simulations were done to test the training and initialization in every possible configuration. This was to ensure that nothing the user could do would cause the device to operate incorrectly. In the proceedings of these simulations, some of the problematic issues that arose did not pertain to the actual digital operation, but to that of the overall device. The sequence of training was very highly dependent upon the successful operation of the device. If, for instance the voice training was done without the knowledge of the ambient noise, the information in the voice training could potentially be rendered useless. Another problem, this one more directly related to the digital implementation, was the issue of when the surf function was sensitive to the command to cease and exit into normal active operation. It was found that if the command to cease was issued at a particular time, while the surf function was counting its delay, the request would be ignored. This proved to be potentially fatal to the reliability of the surf function and was uncovered only after simulations were done to specifically target that aspect of the design. In real-time testing, the control path was tested used dummy signals and longer time durations to first confirm the accuracy of the simulations before the large-scale integration of the various parts. Although it proved rather tricky to manipulate the various signals manually and with the correct sequence, the majority, if not all of the functionality was confirmed to the designer’s satisfaction.

IR interfacing

The first attempt at implementing the IR module had the following results:

Each individual component (demodulator, transmitter, code-selector, control-path and memory interface) simulated with the Max+plus2 waveform editor with what appeared to be successful results.
All components were combined into a higher level entity and simulated with the Max+plus2 waveform editor with what appeared to be successful results.
The higher level entity was downloaded onto the hardware and did not perform any of its required functionality.
The higher level entity was broken up and simplified to trace the source of the design flaw.
No functionality was obtained from the simplified and modularized versions.

Note: Simulations of over 10ms were performed on the higher level entity to better check the design. This required many hours of simulation and no useful results.

With the failure of the initial system, the design was rethought and completely redone. The control path was broken up into three separate controller entities: the Decoder control, the top-level control, the address/display control and the transmitter control (as shown in the hierarchy diagram). Each control path was remodeled after Altera’s recommended state machine design (i.e. Moore Machine). The separating of sequential logic and combinational logic was a key difference in changing the design. A new and improved functionality for the training mode was developed and a more modularized testing approach was taken.

Due to the failures and time-consuming nature of the simulations a new testing strategy was developed. Each control path was designed to be capable of separately controlling a section of the system and was independently verifiable on the hardware. In this manner separate portions of the overall design could be tested and verified on the hardware before being integrated. As each component was tested successfully a new component would be added to the existing design. The design strategy was to build up the overall module from existing functioning components, eventually leading to a functioning overall system. This approach worked quite well and the overall IR module entity was completed with only one major problem along the way. While testing to see if correct data was being stored in the memory, (on the hardware) it became apparent that the system was not storing expected sequences in memory. This resulted in a long and time consuming hunt for the ‘bug’ causing the incorrect values to appear in memory. Eventually it was discovered that the system was being over-clocked. The module was modified to run at a slower clock and the system performed as intended.

Space Requirements

Speech processing and recognition: 784 logic cells, 2 EABs

Input logic: 117 logic cells

Control path: 107 logic cells

Infrared Interface: 365 logic cells, 2 EABs

Total: 1373 logic cells

The total number of logic cells exceeds the capacity of the flex10K20 and thus we were forced to use the flex10K70.

Results of Experiments

Matlab Simulation of Speech Analysis and Recognition Algorithms

Before the speech analysis and recognition algorithms were implemented in VHDL, they were written in Matlab for evaluation and to generate test vectors for VHDL simulations. The Matlab algorithms were written to produce the same results as the VHDL implementations, using methods such as truncating the decimal portion of the results of integer division. The Matlab code is provided in Appendix C.

Four different words were used to test the Matlab algorithms, channel, up, down, and TV. Each word was recorded four times, to provide enough data for training. The input speech data samples used by the Matlab simulations were recorded as wav files using Microsoft's Sound Recorder program. Speech samples were digitally recorded using 8-bit sampling at an 8 kHz frequency.

The word boundary detector was the first algorithm tested. This algorithm could accurately detect the start and end of a word, as illustrated by the graphs on the following pages. These graphs show the raw speech waveform, the speech waveform after word boundary detection, and the zero-crossings and energy analysis for several different words.

Using the results from the word boundary detector, the training and recognition algorithms were tested. The recognition algorithm was tested with the same data used for training the system. Two versions of the recognition and training algorithms were evaluated, one version performed the speech analysis over blocks of 256 samples of the input speech data, while the other version used 128 samples. The following tables

show the results of the recognition tests for the zero-crossings analysis, the energy analysis, and the combined analysis.

Recognition Results For Blocks of 256 Samples

Test Word	Zero-Crossings Result	Energy Analysis Result	Combined Result
down #1	up	up	up
down #2	up	up	up
down #3	down	channel	invalid
down #4	down	down	down
up #1	up	up	up
up #2	up	up	up
up #3	up	up	up
up #4	up	up	up
channel #1	channel	up	invalid
channel #2	channel	channel	channel
channel #3	channel	tv	invalid
channel #4	channel	channel	channel
tv #1	tv	tv	tv
tv #2	up	up	up
tv #3	tv	tv	tv
tv #4	up	tv	invalid
Total matches	12	10	9
Recognition Accuracy	75 %	62.5 %	56.3 %

Recognition Results For Blocks of 128 Samples

Test Word	Zero-Crossings Results	Energy Analysis Results	Combined Results
down #1	up	down	invalid
down #2	up	up	up
down #3	down	down	down
down #4	up	down	invalid
up #1	up	down	invalid
up #2	up	down	invalid
up #3	up	up	up
up #4	up	tv	invalid
channel #1	channel	channel	channel
channel #2	channel	channel	channel
channel #3	channel	channel	channel
channel #4	channel	channel	channel
tv #1	tv	channel	invalid
tv #2	up	up	up
tv #3	tv	tv	tv
tv #4	tv	tv	tv
Total Matches	12	10	8
Recognition Accuracy	75 %	62.5 %	50 %

From these results, the two version have about the same performance. However, the method using 128 samples in a block requires more memory than the 256 sample method, since each word will require twice as many blocks to represent it.

Testing of IR module

As mentioned above, a new approach to verifying the design was undertaken and involved the hardware testing of small, simple components rather then the simulation of more complex entities and hardware verification of higher level entities. The module was tested in these stages:

Display and addressing.
Code sequence detection/storage
Top level training mode and memory verification
Transmission verification from ROM
Integration of Transmission and detection entities
Integration with higher-level system

Display and addressing:

This section allows the user to scroll through a list of IR commands and select which one they will ‘train’ for. This is the new functionality that was mentioned above (the initial design did not allow the user to select the IR command). This sub-system displays the command on the seven-segment displays and also outputs the appropriate address location of the command (the starting location in memory of the code). This section was tested by routing all ‘would be’ internal signals to observable/alterable I/O devices (such as LED, push buttons and dipswitches) to confirm the correct operation of the entity. All aspects of this sub-section worked on the hardware.

Code sequence detection/storage:

This section independently receives the count values from the IR decoder (the IR decoder module, called bit stream counter in diagram, was reused from the initial design) and performs the necessary incrementing of the address and output of data (count values) to the lpm_ram (also reused from initial design). The entity sends a signal to the Display and addressing control to inform it when a full code was received. The Display then lets the user know that a code was acknowledged (by displaying an acknowledged symbol on the seven-segment display). This sub-unit was tested in the same way as above. All the ‘would be’ internal signals were routed to observable/alterable I/O devices and the sub-system was tested as it should perform. The sub-system performed as required on the hardware. The sub-system was integrated with the Display and addressing sub-system and they were tested together using the same techniques. The integrated sub-systems performed as required on the hardware. Code for the entity used for testing the combination of these two sub-systems can be found in appendix A and is called 'detector_module.vhd'.

Top level training mode and memory verification:

To complete the full operation of the system for training mode, (i.e. grabbing codes and dumping them into the appropriate memory location) the sub-systems described above were integrated with a top-level control, which selects between a training and a transmitting mode, and the lpm memory. To verify that correct data was being stored in memory, the transmitting mode of the top-level controller was replaced (for modular verification purposes) with a "view memory" mode. This mode allowed for the data in memory to be output to 12 LEDs (the width of the data) showing the contents of the memory. As well a feature was built in allowing for all memory locations to be scrolled through one at a time. In this manner the full-training side of the IR module was tested on the hardware. Results are currently being verified but appear to show correct operation of the module thus far. Code for this entity and all the sub-entities and components can be found in appendix IR. Since no significant simulation was performed no waveforms are presented. A demonstration of the correct operation of the IR system thus far can be arrange to demonstrate that full testing and verification of the system has been accomplished successfully. The code for the current entity used to test the system on the hardware can be found in appendix A and is called 'frontmodule.vhd'.

Transmission verification from ROM:

The transmission module was first tested on the hardware by separating it from all other components and having it read values from a ROM[11]. This allowed for the verification of this entity independently of the rest of the module. The correct transmission of codes was verified by using the demodulator circuit to detect and observe the bit sequence on an oscilloscope. The oscilloscope waveform was compared to the actual output from a remote control and proved to be virtually identical. Below is the waveform of the transmitted output as seen on the oscilloscope.

Integration of Transmission and detection entities:

The frontmodule.vhd entity was integrated with the transmission module to form the IRmodule.vhd. This final entity was also tested independently of the other sub-systems before integration.

Concluding thoughts:

Overall our system performed according to our goals laid out at the beginning of the term. Our system became much more limited in its vocabulary and IR code compatibility than our initial plan. This was a result of the limitations of the flex10K20 board and the time constraints of having a smaller group. Our group lost its fourth member after the project proposal and we nonetheless attempted to complete the

initial functionality laid out in the project proposal. The complexity of our system was perhaps incongruent with the size of our group and man-hours available for the project. The amount of documentation required for the project was the largest obstacle in hindering our efforts to complete the project. Invaluable knowledge was gained in the area of digital design, signal processing and recognition and IR coding.

References

[1] Asiedu-Ampem, P. et al. EE 552 Application Notes – Clock Divider. Internet: 1998.

www.ee.ualberta.ca/~elliot/ee552/studentAppNotes/98f/clk_div/clk_div.html

[2] Bensler, T. and E. Chan. Interfacing External SRAM. Internet: 1999.

www.ee.ualberta.ca/~elliot/ee552/studentAppNotes/99w/SRAM/

[3] Bo, N., K. Leung, and D. Ritter. FIR Filter Design. Internet: 1999.

www.ee.ualberta.ca/~elliot/ee552/studentAppNotes/99w/FIRFilter/

[4] Furui, S. Digital Speech Processing, Synthesis, and Recognition. New York: Marcel Dekker,

Inc., 1989.

[5] Gould, D., K. Grant, and A. Stanley-Jones. Voice-Activated RC Car. Internet: 1999.

www.ee.ualberta.ca/~elliot/ee552/projects/99w/voice_activated_rc_car/

[6] Robinson, T. Speech Analysis. Internet: 1998. svr-www.eng.cam.ac.uk/~ajr/SA95/

[7] Yadunandana, R. Speech Recognition Using Hidden Markov Models. Internet 1999.

www.angelfire.com/ny/yadunandana/report1.html

[8] Willmott, K. IR Remote Control Codes. Newsgroup Posting displayed via Internet 1999.

http://www.hut.fi/Misc/Electronics/docs/ir/ircodes.html

http://www.ee.washington.edu/eeca/text/ircodes.txt

[9] University of Washington, Circuit Archive, Internet 1999

http://www.ee.washington.edu/eeca/circuits/

[10] University of Alberta, EE 480 gcd package VHDL code, Internet 1999

http://www.ee.ualberta.ca/~ee480

[11] Univerisity of Alberta, EE 552 lab 6 ROM VHDL code, Internet 1999

http://www.ee.ualbert.ca/~elliott/ee552

SGM Electronics

Data-sheet for Product: Proto-A Voice-IR interface system

General Description

The Proto-A Voice-IR interface is an integrated system composed of an Altera Flex10K70 field programmable gate array mounted on the UP1 board, condenser microphone, dual 741 op-amp, 8-bit analog-to-digital converter, IR demodulator and IR-LED transmitter. The Proto-A has preprogrammed logic configuration downloaded onto the Flex10K20. This allows the FPGA to interact with the peripherals listed above to provide the user with superior voice control of consumer Audio/Visual electronics equipment. The system has four modes of operation giving it a wide range of user and A/V electronics interaction. The first mode performs a noise sample of the ambient noise to set the threshold for speech training and recognition. The second mode allows the user to train the system to their specific voice commands. The third mode allows for the user to train the Proto-A to transmit their A/V electronics IR codes. The fourth mode allows the user to utter voice commands and the Proto-A transmits the appropriate IR code, thus controlling the A/V device by voice. The Proto-A features zero crossing and energy analysis speech processing techniques for speaker dependent recognition of up to four voice commands. The IR demodulator allows the user to send the Proto-A codes for memory storage and subsequent transmission via the IR LEDs.

The device eliminates the need for tedious, repetitive, and wearing motions of continually pushing the buttons of A/V electronics remote controls. Easy, wireless and exclusive control is made possible with the Proto-A system. The design of the Proto-A has been optimized to include such calorie saving functions as ‘surf’, which allows a single IR command to be repeated every 3 seconds. These features make the Proto-A ideally suited for such applications as ‘couch-potato’ to ‘TV-hog’.

Features

4 command, voice dependent recognition

Hands free control of consumer A/V electronics

Programmable for Sony TVs and CD players

IR output allows for up to 5 feet range

Blocks out ambient noise

Key Specifications

+/-12V and 5V power supplies required

recognizes 12-bit RECS-80 protocol IR-codes

5MHz maximum clocking frequency of FPGA

8kHz sampling frequency of A/D converter

Block Diagram/ Schematic

IR output

Vcc A

IR input

Vcc = 5V

Pin Connections to UP1 Board Expand A-slot

D0 Pin 20

D1 Pin 21

D2 Pin 22

D3 Pin 23

D4 Pin 24

D5 Pin 25

D6 Pin 26

D7 Pin 27

SOC Pin 28

EOC Pin 29

A Pin 15

B Pin 17

Appendix A

This appendix contains all the VHDL code for the IR module including the code of individually tested components. Simulations of key entities are also contained in this appendix.

Index:

frontmodule.vhd – entity used for testing training mode of IR module

topview.vhd – top level control of IR module, modified for viewing the contents of memory during hardware verification.

Mux_1.vhd – 1 bit, 2 input multiplexer for control of address incrementing and clearing. (modified form EE480 web-site)[10]

Simplecontrol.vhd – control path for displaying and addressing

delay_counter – introduces a delay flag for portions of design requiring a delay

IR_decoder – two lpm counters for sending data on demodulated IR bit stream count

IRmodule – Top level IRmodule entity

Irransmit.vhd – top level entity for transmission

Mux2N_N.vhd – two input N-wide multiplexer Taken and modified from EE480 web-site [10]

Modulator.vhd – entity that outputs 38kHz 50% duty cycle signal.

Rom.vhd – used for independent test of testtrans2 – taken and modified from EE552 web-site[11]

Testtrans2.vhd – entity used for independent test of transmission module

Filter.vhd – used to display and increment code requested during training mode

Mother_control.vhd – this is a modified and final version of the detector control path

RegisterN.vhd – an N-bit register used for registering the count values from IR_decoder to mother_control

taken from the EE480 web-site[10]

Subtestdet.vhd – this is the sub-control responsible for write-enable, increment address of mother-control.

Testdet.vhd – this is the main control for mother_control.

Topcontrol.vhd – this is the top level control of the IR module.

Trans_control – this controls the modulation of the out-going IR code.

Detector_module.vhd – entity used for testing the counting sequence of IR bit stream input