EE552 Project:

Voice Recognition Telephone Dialer

Final Report

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Prepared for Dr. Elliott

by Tim Golding                        

     Eric Cheung                        

     Felicia Cheng                      

     Wilson (Tin) Kwan

     David Li                                         


Abstract

 

            Our project is a voice recognition telephone dialer system. The first major goal of this design is to develop a system that will have a vocabulary of between four to twenty plus words.  In addition, this system will be capable of generating the required DTMF (dual-tone multi frequencies) to permit direct dialing of the telephone number.  Finally, the system will have a user-friendly interface to permit easy programming of hot numbers.

            The user can store a vocabulary consisting of the numbers “zero” through to “nine”, “dial”, “delete” and various hot words such as “help” to dial 911, or “Susan” to call a friend may be used.  A zero-crossing pattern recognition algorithm is used for the speech matching.

            The system has the capability to generate the DTMF signals that will allow for direct dialing of the number.  The present solution to this goal is to use pulse-width modulation to shape the tone and then low pass filter the signal to generate it.

            The user interface will require both LCD and keypad interfacing and we believe that there are existing application notes that could be incorporated here.  We believe that we can implement this design on the Altera UP1 Education board.


Description of operation

Introduction

 

            The voice recognition telephone dialer is a telephone accessory that will permit hands free dialing of telephone numbers with the utterance of simple words.  Many people will find this accessory very useful, especially the disabled or people who suffer attacks from diseases such as MS or arthritis.  

            The operation of the voice recognition telephone dialer will be easy to learn and will have a user-friendly interface.  This is necessary since a training mode is required to increase word recognition accuracy.  This recognition system will be speaker dependent however, if another user wishes to operate the dialer they would need to re-train the system for their voice patterns.  By using word recognition it should be possible to dial a telephone number by speaking the digits of the number and then issuing the dial command.

            An additional feature that may be included with the dialer is the ability to store hot numbers.  These hot numbers will be stored in memory and a specific word or name will be assigned to this number.  When this word is spoken the dialer will recognize it, and the full number will then be displayed and dialed when the dial command is given.  This feature will require a programming mode and user interface so these numbers can be stored in memory.  Other features that may be included are DTMF generation, keypad and LCD interfacing.

Operation

 

            As stated in the introduction the best way to design an accurate word recognition system is to make it user specific.  Since the memory that used in this design is nonvolatile, the system will not need to start in training mode whenever power is off.  However, the training mode can be selected via the user interface, the main reason for this is the user may wish to add hot numbers or try to increase the recognition rate.

            When initiated, this mode will require that the user repeat each digit or command for a few times.  Once the training mode is completed, the dialer will then enter the active mode.  While in active mode, the system continuously monitors the environment for recognizable word through the microphone or a signal from the keypad. Whenever the dialer recognizes a digit, that digit will be stored in a register in memory and simultaneously displayed on the LCD panel. The system will continue this process until 11 digits are in memory or the dial command is issued.  

If a wrong digit is entered it will be possible to delete this by using a clear command for which will overwrite the last entry in the register.  Once the user has entered the number key they wish to dial, the dial command is given to dial the number.  The number in the register is then passed to the dialer to generate the DTMF signals.  The system will then return to active mode once the call is terminated.

 

Control flow diagram


 


Voice recognition methodology

                The Voice Recognition System must generate a signal whenever any familiar voice instruction is detected.  In order to accomplish this, the analog signal must be manipulated and transformed so the dialer can recognize the word.  First, a special algorithm is needed for word boundary detection.  Once the beginning of the word is identified, the word data will be passed to the next stage for pattern recognition.

            The pattern recognition method has changed from the resource requirement document.  We will no longer utilize Dynamic Time Warping do pattern comparison.  For the time constraints has limited our research area, we are forced to use some simpler recognition methodology like Zero Crossing and Average Power.

            The idea behind Zero Crossing and Average Power is simple.  Voice data consists of data points that oscillate between positive and negative voltage, higher frequency data contains higher rate of oscillations.  One way to extract the frequency characteristic of the data is to count the number of times the waveform crosses the x-axis, in order words, zero crossing.  As the name suggests, Average Power is just the average power of the waveform during the time identified as a word.

            During the training mode, the zero crossing and average power information of the templates will be stored into memory.  During the active mode, any input that can be identified as a word is passed to the algorithm to extract the zero crossing and average power.  Then these data will be compared with each template to find a match.  Upon a match with high accuracy, that template will be declared as a winner and a signal will be sent to the system.

 

Matlab interpretation of the word 'two' in a wave file format:


Hardware requirements

            To implement this design, the following major components are needed.  The first component that will be needed is a re-programmable FPGA, after examining the various selections available we have decided to use the Altera UP1 education board.  The main deciding factor behind this selection is that the UP1 board has two devices the EPM7128S with 2500 gates, and the EPF10K20 with 20,000 gates.  In addition, the education board has a large number of input and output pins available on the expansion ports.

            Since we are dealing with analog signals, the need exists to convert the analog signal to a usable digital format.  The easiest way to manage this is use an analog to digital converter that is presently available on the market.  After examining various types of converters, the team decided to use the AD7823, which is manufactured by Analog Devices, and the data sheet is included in Appendix I.

            In order to provide a large enough signal to the A/D converter that has a bandwidth of 5 kHz some associated analog circuitry will be needed, specifically a band pass filter (high pass and low pass filters) and some sort of amplifier.  The decision was made to implement the analog filter and any amplifiers with the LM741 operational amplifier.  This op-amp is used in the circuits to amplify both the input analog signal and the outgoing DTMF signal.  The data sheet is also included in Appendix I.

 

Since this design will be dealing with a top frequency of around 5 kHz the smallest sampling rate is given by the Nyquist rate:

 

Nyquist Rate

            Tsample < 1/2B

            Tsample = sampling interval

            B = Bandwidth

 

By applying the Nyquist rate we can approximate the required memory that will be needed to store words and templates.  Since the bandwidth is around 5kHz the sample rate will be 100ms which will be 10,000 samples per word.  The A/D being used will represent each sample in 8 bits of data.  Therefore, each word will require around 80,000 bits of memory.

            The goal of this design is to achieve a substantial vocabulary of around 20 words.  Therefore, this design will require a maximum of 1.6 Mbits to 2.0 Mbits of memory.  However, this requirement may be lowered by applying the method outlined in the voice recognition methodology section.  Since the Altera UP1 board functions at 5 volts the decision was made to go with static ram, specifically DS1258Y 128K×16 from Dallas Semiconductor. 

 

Demonstration methodology

            Using the above component we expect to be able to provide a working prototype by the end of the semester.  To properly demonstrate the dialer, we will need to train the dialer before hand in order to save some times, since the memory is nonvolatile, the data will not lost even if the system is not powered.  During the demonstration a “Hot Number” could be configured and various numbers could be dialed using voice recognition system. 

 


IC Circuits

The following page has three separate analog circuits that will be needed to interface the voice recognition dialer with the outside world.  A brief description of each circuit and their use follows:

 

1)         The first circuit is a 800 to 3200 Hz active band pass filter.  This filter is used to limit the input frequency to the A/D converter.  The main reason to utilize this filter is to limit the amount of data that has to be handled by the word recognition circuit.  The first op-amp is a high pass filters with a 3db frequency of 800 Hz.  The second op-amp is used as a buffer to isolate the previous stage from the output stage and permits sharper response.  The final op-amp is configured as low pass filters with a 3db frequency of 3200 Hz.  The 3db frequency was set by selecting the resistors according to the following formula:

                                    R = 1/(F3dB*2pi*C)

 

2)         The amplifier will be used to amplify the microphone signal to the proper level used by the A/D converter.  In addition, this circuit will also be used to amplify the DTMF signal.  The gain of the inverting op-amp can be adjusted if required by the following formula:

                                    Av = R1/R2

R1 will be a variable resistor in order to adjust the amplification of the amplifier as needed.

 

3)                  The telephone line interface is used to connect the DTMF section of the design to the telephone system.  

 

 


Datasheet (voice recognition telephone dialer)

 

Features

 

-          build in 16 characters x 2 lines LCD display

-          Specification at 5V ± 10%

-          10-year minimum data retention in the absence of external power

-          Data is automatically protected during a power loss

-          Auto-protection from loud input signal

-          maximum speed 12.5MHz

 

 

I/O pins descriptions

 

Pin No.

Mnemonic

Description

Signal direction

1

VIN+

Positive input power (+5V)

Input

2

VIN-

Negative input power (-5V)

Input

3

GND

Ground reference

Input

4

MODE

Training mode or user mode.  When MODE is set to low, the system is in training mode, whenever reset is being pressed, user has to train the system before used.  If MODE is set to high, the reset will act as bypass system.

Input

5

Start

The system will start training or being used according to MODE

Input

6

AINPUT

Analog Input

Input

8

Output

Output frequency to corresponding Analog input

Output

9

Reset

When reset is low, the system will reset itself according to MODE

Input

 

 

 

 

11-14

Row 0-3

Reserved for Keypad (16 buttons –001)

Row0-3 corresponding to Row0-3 of the keypad

Output

15-18

Col 0-3

Reserved for Keypad (16 buttons –001)

Col0-2 corresponding to Col0-3 of the keypad

Input

 

General Description

The voice recognition telephone dialer (VRTD) is a hand-free telephone dialer, in which has a user-friendly interfacing that includes a LCD display.  The VRTD can also be operated using a specific keypad (16 buttons –001) if wanted.  The operation of the voice recognition telephone dialer is easy to learn.  Prior to use the VRTD, the user needs to train the system in training mode in order to increase word recognition accuracy.  The voice recognition telephone dialer can recognize up to four hot numbers.

 

 

 


 

Design hierarchy

 

 

Description

A/D Converter

ADC controller

SRAM

SRAM controller

Main User Interface

Control flow of system

LCD

LCD controller

Keypad

4x4 keypad controller

DTMF

Dial tone generator

Hot Number Programming

For quick dial

Voice Recognition Algorithm

Control flow of algorithm

 

 

Design hierarchy chart


Keypad

Description of operation

 

The keypad decoder was modified from the existing keydecode vhdl file from the application note. The keypad decoder applied a common column and row-scanning algorithm. The Moore state machine is used in this design. The state machine started by driving all the columns to zero and detecting the rows, which is connected to the 4.7KW pull up resistors, were driven to zero. If any one of the row signals was driven to zero, it indicated that a key had been depressed. The state machine entered to prepare drive state and waited until the key –bouncing transients to die down. Then, it detected which particular row of the keypad was depressed and that row becomes zero. When it detects a zero in a particular row, the intersection of the column was driven to zero. It showed that which key was being depressed by intersection of the zero row and zero column. The resultant key value was latched to an output register where the value was held until another keypress was detected. Also, the keyvalid signal is triggered for 1 clock period following the detection of a valid keypress. The resultant keyvalue was outputted to the LED decoder for displaying the corresponding characters on the LED. Furthermore, the keyvalue decoder in the tone generator module decoded the keyvalue to generate the particular output tone.

 

LCD

Description of operation

 

The LCD encoder was modified from the code in the 1997 application notes.  A mealy state machine was used in this design.  The LCD encoder will initialize the LCD according by setting the appropriate enable signal, read/write signal, and register select signal.  After initialization, the encoder then process the 4-bit input signal from the keypad and send the 8-bit data signal to the LCD to display the number.  There is a certain time delay associate with every state because the LCD needs a specific time period to process the instruction.

 

DTMF

 

            DTMF, or dual tone modulated frequency, is how touch tone telephones operate.   The system consists of combining two frequencies that are not harmonically related to generate a tone that can be recognized by the telephone company.  To achieve this, the following tones are used according to the figure below:

 

 

1209 Hz

1336 Hz

1477 Hz

1633 Hz

697 Hz

1

2

3

A

770 Hz

4

5

6

B

852 Hz

7

8

9

C

941 Hz

*

0

#

D

Column not used

 

 

            By pressing the appropriate button the two intersecting tones are generated together to produce a unique tone.  Two approaches were initially taken to design a tone generator.  The first approach was to use a pulse-width modulation scheme to generate the tones.  However, due to time constraints it was decided that a simpler approach was needed.

 

            As noted that in an application note any frequency could be generated by simply dividing down the master clock.  This was the approach taken for this portion of the design and the complete code for this block is included in Appendix I.  Due to the large periods of the waveforms it was impossible to test this complete design on Altera Maxplus2 with the required frequencies.

 

            However,  some basic testing was performed on the theory of this design by using short  period signals to test the control of the design.  Once the basic principle of the design was tested the next approach was to determine if this would generate the proper frequencies to within 1.5% of the required value.

 

            The complete program using the toggle switches on the UP1 board to enable and select the frequencies were downloaded to a UP1 board and the frequencies were measured.  The following table has the required values and the measured values along with the percent difference of the frequencies.

 

            As seen from the above table all the required frequencies are successfully generated using this method.  However, another problem came to light during the testing of the dialer it did not seem to work.  The suspected culprit behind this problem is the fact that a square wave consists of a large number of harmonics and these harmonics may be confusing the telephone system.  To overcome this problem a lowpass filter network is now being employed to eliminate these harmonics.   This design has yet to be fully tested but we believe that this filtering will eliminate enough of these harmonics to make dialing possible.

SRAM

 

Design

            A finite state machine (Moore) is used to interface the SRAM.  The SRAM has two operation modes, read and write cycles.  Each cycle has a different timing diagram, thus a state machine is used to ensure that proper timing for each cycle is achieved.  The VHDL code of the design is still under development.  An error has occurred during compilation and is under investigation.

 

 

Schematic diagram

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


The SRAM is 128K x 16 bit, non-volatile.  For the application of this project, the sampling level output from the ADC will be 8 bit width.  A design decision is made to use each row of the SRAM (16bit) as two separate 8 bit slots.  The design will handle the conversion between the 8 and 16 bit.  The address inside the system would be 18 bit rather than the 17 bit used by the SRAM.  The interface takes the most significant bit of the 18 bit address bus and decides whether to access the upper byte or the lower byte.  As long as the system is consistent on this design for both read and write cycles, the SRAM is treated as 256K x 8 bit.

 

The data bus connected to the SRAM is a bi-directional bus.  To avoid I/O contention, a tri-state bus is needed to act as a buffer between the SRAM and the interface.  The lpm_bustri is used as the tri-state bus.

 


State Diagram

Figure I: read cycle

 

Figure 2: write cycle
Mablab code

 

The following matlab code was written to demonstrate the accuracy of the zero-crossing pattern matching method.

 

% A matlab representation of zero crossing algorithm.

% File: alg.m

% Input a template of the word.

[y,fs,bits] = wavread ('hello.wav');

siz = wavread('hello.wav','size');

 

% plot the original waveform.

subplot (2,1,1), plot(y), grid;

 

mx = max (y);

mi = min (y);

avg = mean (y);

hiRef = (mx + avg) / 2;

lowRef = (mi + avg) /2;

 

hitavg1 = hit (y, siz, avg );

hithi1 = hit (y, siz, hiRef);

hitlow1 = hit (y, siz, lowRef);

 

% Input a test case of the word.

[y,fs,bits] = wavread ('two.wav');

siz = wavread('two.wav','size');

subplot (2,1,2), plot(y), grid;

 

mx = max (y);

mi = min (y);

avg = mean (y);

hiRef = (mx + avg) / 2;

lowRef = (mi + avg) /2;

 

hitavg2 = hit (y,siz,avg);

hithi2 = hit (y, siz, hiRef);

hitlow2 = hit (y, siz, lowRef);

 

% Try to find the percent of accuracy between the two sample.

matchratio = ((hitavg1/hitavg2) + (hithi1 / hithi2) + (hitlow1 / hitlow2 ) ) * 100 /3;

% -------------end of alg.m ---------------

 

% A function to find the number of zero crossing

% File: hit.m

% Input: the vector, size of the vector, the reference line

% Output: the number of zero crossing.

 

function result = hit ( x, siz, avg)

tmphit = 0;

oldpt = 0;

 

if x(1) == avg

   tmphit = tmphit + 1;

end;

oldpt = x(1);

for n = 2 : siz,

   curpt = x(n);

   if curpt == avg

      tmphit = tmphit + 1;

   elseif ((oldpt < avg) & (curpt > avg ) )| ((curpt < avg) & (oldpt > avg ))

      tmphit = tmphit + 1;

   end;

end;

result = tmphit;

% --------------- end of hit.m -------------------------

 

 

Reference

 

[1]Speech recognition by Machine, Technical Report CS-TR-92/2, October 1992
Victoria University of Wellington, Department of Computer Science, Andrew Kingston

 

[2]Hiroaki Sakoe and Seibi Chiba, “Dynamic Programming Algorithm Optimization for Spoken Word recognition”, IEEE Transactions On Acoustics, Speech, And Signal Processing, vol. ASSP-26, NO. 1, Feb. 1978

 

[3]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/lcd_driver/lcd_package.vhd

 

[4]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/lcd_driver/lcd.vhd

 

[5]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/lcd_driver/lcd_out.vhd

 

[6]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/lcd_driver/lcd.mif

 

[7]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/keydecoder/debounce.vhd

 

[8]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/keydecoder/display.vhd

 

[9]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/keydecoder/keypad.vhd

 

[10]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/keydecoder/keytop.vhd

 

[11]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/SRAM/

 

[12]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/musical_notes/music.html