EE552
Project:
Prepared for Dr. Elliott
by Tim Golding
Eric Cheung
Felicia Cheng
Wilson (Tin) Kwan
David Li
Our project is a voice recognition
telephone dialer system. The first major goal of this design is to develop a
system that will have a vocabulary of between four to twenty plus words. In addition, this system will be capable of
generating the required DTMF (dual-tone multi frequencies) to permit direct
dialing of the telephone number.
Finally, the system will have a user-friendly interface to permit easy
programming of hot numbers.
The user can store a vocabulary
consisting of the numbers “zero” through to “nine”, “dial”, “delete” and
various hot words such as “help” to dial 911, or “Susan” to call a friend may
be used. A zero-crossing pattern
recognition algorithm is used for the speech matching.
The system has the capability to
generate the DTMF signals that will allow for direct dialing of the
number. The present solution to this
goal is to use pulse-width modulation to shape the tone and then low pass
filter the signal to generate it.
The
user interface will require both LCD and keypad interfacing and we believe that
there are existing application notes that could be incorporated here. We believe that we can implement this design
on the Altera UP1 Education board.
The voice recognition telephone
dialer is a telephone accessory that will permit hands free dialing of
telephone numbers with the utterance of simple words. Many people will find this accessory very useful, especially the
disabled or people who suffer attacks from diseases such as MS or
arthritis.
The operation of the voice
recognition telephone dialer will be easy to learn and will have a
user-friendly interface. This is
necessary since a training mode is required to increase word recognition
accuracy. This recognition system will
be speaker dependent however, if another user wishes to operate the dialer they
would need to re-train the system for their voice patterns. By using word recognition it should be
possible to dial a telephone number by speaking the digits of the number and
then issuing the dial command.
An additional feature that may be
included with the dialer is the ability to store hot numbers. These hot numbers will be stored in memory
and a specific word or name will be assigned to this number. When this word is spoken the dialer will
recognize it, and the full number will then be displayed and dialed when the dial
command is given. This feature will require
a programming mode and user interface so these numbers can be stored in
memory. Other features that may be
included are DTMF generation, keypad and LCD interfacing.
As
stated in the introduction the best way to design an accurate word recognition
system is to make it user specific.
Since the memory that used in this design is nonvolatile, the system
will not need to start in training mode whenever power is off. However, the training mode can be selected
via the user interface, the main reason for this is the user may wish to add
hot numbers or try to increase the recognition rate.
When initiated, this mode will
require that the user repeat each digit or command for a few times. Once the training mode is completed, the
dialer will then enter the active mode.
While in active mode, the system continuously monitors the environment
for recognizable word through the microphone or a signal from the keypad.
Whenever the dialer recognizes a digit, that digit will be stored in a register
in memory and simultaneously displayed on the LCD panel. The system will
continue this process until 11 digits are in memory or the dial command is
issued.
If a wrong digit is entered it will be possible to delete this by using a
clear command for which will overwrite the last entry in the register. Once the user has entered the number key
they wish to dial, the dial command is given to dial the number. The number in the register is then passed to
the dialer to generate the DTMF signals.
The system will then return to active mode once the call is terminated.
The Voice Recognition System
must generate a signal whenever any familiar voice instruction is
detected. In order to accomplish this,
the analog signal must be manipulated and transformed so the dialer can
recognize the word. First, a special
algorithm is needed for word boundary detection. Once the beginning of the word is identified, the word data will
be passed to the next stage for pattern recognition.
The pattern recognition method has
changed from the resource requirement document. We will no longer utilize Dynamic Time Warping do pattern
comparison. For the time constraints
has limited our research area, we are forced to use some simpler recognition
methodology like Zero Crossing and Average Power.
The idea behind Zero Crossing and
Average Power is simple. Voice data
consists of data points that oscillate between positive and negative voltage,
higher frequency data contains higher rate of oscillations. One way to extract the frequency
characteristic of the data is to count the number of times the waveform crosses
the x-axis, in order words, zero crossing.
As the name suggests, Average Power is just the average power of the
waveform during the time identified as a word.
During the training mode, the zero
crossing and average power information of the templates will be stored into
memory. During the active mode, any
input that can be identified as a word is passed to the algorithm to extract
the zero crossing and average power.
Then these data will be compared with each template to find a
match. Upon a match with high accuracy,
that template will be declared as a winner and a signal will be sent to the
system.
Matlab
interpretation of the word 'two' in a wave file format:
![]() |
To implement this design, the
following major components are needed.
The first component that will be needed is a re-programmable FPGA, after
examining the various selections available we have decided to use the Altera
UP1 education board. The main deciding
factor behind this selection is that the UP1 board has two devices the EPM7128S
with 2500 gates, and the EPF10K20 with 20,000 gates. In addition, the education board has a large number of input and
output pins available on the expansion ports.
Since we are dealing with analog
signals, the need exists to convert the analog signal to a usable digital
format. The easiest way to manage this
is use an analog to digital converter that is presently available on the
market. After examining various types
of converters, the team decided to use the AD7823, which is manufactured by
Analog Devices, and the data sheet is included in Appendix I.
In order to provide a large enough
signal to the A/D converter that has a bandwidth of 5 kHz some associated
analog circuitry will be needed, specifically a band pass filter (high pass and
low pass filters) and some sort of amplifier.
The decision was made to implement the analog filter and any amplifiers
with the LM741 operational amplifier.
This op-amp is used in the circuits to amplify both the input analog
signal and the outgoing DTMF signal.
The data sheet is also included in Appendix I.
Since this design will
be dealing with a top frequency of around 5 kHz the smallest sampling rate is
given by the Nyquist rate:
Tsample < 1/2B
Tsample = sampling
interval
B = Bandwidth
By applying the
Nyquist rate we can approximate the required memory that will be needed to
store words and templates. Since the
bandwidth is around 5kHz the sample rate will be 100ms which will be 10,000 samples per word. The A/D being used will represent each
sample in 8 bits of data. Therefore,
each word will require around 80,000 bits of memory.
The goal of this design is to
achieve a substantial vocabulary of around 20 words. Therefore, this design will require a maximum of 1.6 Mbits to 2.0
Mbits of memory. However, this
requirement may be lowered by applying the method outlined in the voice
recognition methodology section. Since
the Altera UP1 board functions at 5 volts the decision was made to go with
static ram, specifically DS1258Y 128K×16 from Dallas Semiconductor.
Using the above component we expect
to be able to provide a working prototype by the end of the semester. To properly demonstrate the dialer, we will
need to train the dialer before hand in order to save some times, since the
memory is nonvolatile, the data will not lost even if the system is not
powered. During the demonstration a
“Hot Number” could be configured and various numbers could be dialed using
voice recognition system.
The following page has
three separate analog circuits that will be needed to interface the voice
recognition dialer with the outside world.
A brief description of each circuit and their use follows:
1) The first circuit is a 800 to 3200 Hz
active band pass filter. This filter is
used to limit the input frequency to the A/D converter. The main reason to utilize this filter is to
limit the amount of data that has to be handled by the word recognition
circuit. The first op-amp is a high
pass filters with a 3db frequency of 800 Hz.
The second op-amp is used as a buffer to isolate the previous stage from
the output stage and permits sharper response.
The final op-amp is configured as low pass filters with a 3db frequency
of 3200 Hz. The 3db frequency was set
by selecting the resistors according to the following formula:
R =
1/(F3dB*2pi*C)
2) The amplifier will be used to amplify
the microphone signal to the proper level used by the A/D converter. In addition, this circuit will also be used
to amplify the DTMF signal. The gain of
the inverting op-amp can be adjusted if required by the following formula:
Av = R1/R2
R1 will be a variable
resistor in order to adjust the amplification of the amplifier as needed.
3)
The telephone line
interface is used to connect the DTMF section of the design to the telephone
system.
Features
- build in 16 characters x 2 lines LCD display
- Specification at 5V ± 10%
- 10-year minimum data retention in the absence of external power
- Data is automatically protected during a power loss
- Auto-protection from loud input signal
- maximum speed 12.5MHz
I/O
pins descriptions
Pin No. |
Mnemonic |
Description |
Signal direction |
1 |
VIN+ |
Positive input power (+5V) |
Input |
2 |
VIN- |
Negative input power (-5V) |
Input |
3 |
GND |
Ground reference |
Input |
4 |
MODE |
Training mode or user mode. When MODE is set to low, the system is in training mode, whenever reset is being pressed, user has to train the system before used. If MODE is set to high, the reset will act as bypass system. |
Input |
5 |
Start |
The system will start training or being used according to MODE |
Input |
6 |
AINPUT |
Analog Input |
Input |
8 |
Output |
Output frequency to corresponding Analog input |
Output |
9 |
Reset |
When reset is low, the system will reset itself according to MODE |
Input |
|
|
|
|
11-14 |
Row 0-3 |
Reserved for Keypad (16 buttons –001) Row0-3 corresponding to Row0-3 of the keypad |
Output |
15-18 |
Col 0-3 |
Reserved for Keypad (16 buttons –001) Col0-2 corresponding to Col0-3 of the keypad |
Input |
General
Description
The voice recognition telephone dialer (VRTD) is a hand-free telephone dialer, in which has a user-friendly interfacing that includes a LCD display. The VRTD can also be operated using a specific keypad (16 buttons –001) if wanted. The operation of the voice recognition telephone dialer is easy to learn. Prior to use the VRTD, the user needs to train the system in training mode in order to increase word recognition accuracy. The voice recognition telephone dialer can recognize up to four hot numbers.
|
Description |
A/D Converter |
ADC controller |
SRAM |
SRAM controller |
Main User Interface |
Control flow of system |
LCD |
LCD controller |
Keypad |
4x4 keypad controller |
DTMF |
Dial tone generator |
Hot Number Programming |
For quick dial |
Voice Recognition Algorithm |
Control flow of algorithm |
Design hierarchy chart
The keypad decoder was modified from the existing keydecode vhdl file from the application note. The keypad decoder applied a common column and row-scanning algorithm. The Moore state machine is used in this design. The state machine started by driving all the columns to zero and detecting the rows, which is connected to the 4.7KW pull up resistors, were driven to zero. If any one of the row signals was driven to zero, it indicated that a key had been depressed. The state machine entered to prepare drive state and waited until the key –bouncing transients to die down. Then, it detected which particular row of the keypad was depressed and that row becomes zero. When it detects a zero in a particular row, the intersection of the column was driven to zero. It showed that which key was being depressed by intersection of the zero row and zero column. The resultant key value was latched to an output register where the value was held until another keypress was detected. Also, the keyvalid signal is triggered for 1 clock period following the detection of a valid keypress. The resultant keyvalue was outputted to the LED decoder for displaying the corresponding characters on the LED. Furthermore, the keyvalue decoder in the tone generator module decoded the keyvalue to generate the particular output tone.
The LCD encoder was modified from the code in the 1997 application notes. A mealy state machine was used in this design. The LCD encoder will initialize the LCD according by setting the appropriate enable signal, read/write signal, and register select signal. After initialization, the encoder then process the 4-bit input signal from the keypad and send the 8-bit data signal to the LCD to display the number. There is a certain time delay associate with every state because the LCD needs a specific time period to process the instruction.
DTMF, or dual tone modulated frequency, is how touch tone telephones operate. The system consists of combining two frequencies that are not harmonically related to generate a tone that can be recognized by the telephone company. To achieve this, the following tones are used according to the figure below:
|
1209
Hz |
1336
Hz |
1477
Hz |
1633
Hz |
697 Hz |
1 |
2 |
3 |
A |
770 Hz |
4 |
5 |
6 |
B |
852 Hz |
7 |
8 |
9 |
C |
|
* |
0 |
# |
D |
Column not used
By pressing the appropriate button the two intersecting tones are generated together to produce a unique tone. Two approaches were initially taken to design a tone generator. The first approach was to use a pulse-width modulation scheme to generate the tones. However, due to time constraints it was decided that a simpler approach was needed.
As noted that in an application note any frequency could be generated by simply dividing down the master clock. This was the approach taken for this portion of the design and the complete code for this block is included in Appendix I. Due to the large periods of the waveforms it was impossible to test this complete design on Altera Maxplus2 with the required frequencies.
However, some basic testing was performed on the theory of this design by using short period signals to test the control of the design. Once the basic principle of the design was tested the next approach was to determine if this would generate the proper frequencies to within 1.5% of the required value.
The complete program using the toggle switches on the UP1 board to enable and select the frequencies were downloaded to a UP1 board and the frequencies were measured. The following table has the required values and the measured values along with the percent difference of the frequencies.
As seen from the above table all the required frequencies are successfully generated using this method. However, another problem came to light during the testing of the dialer it did not seem to work. The suspected culprit behind this problem is the fact that a square wave consists of a large number of harmonics and these harmonics may be confusing the telephone system. To overcome this problem a lowpass filter network is now being employed to eliminate these harmonics. This design has yet to be fully tested but we believe that this filtering will eliminate enough of these harmonics to make dialing possible.
A finite state machine (Moore) is used to interface the SRAM. The SRAM has two operation modes, read and write cycles. Each cycle has a different timing diagram, thus a state machine is used to ensure that proper timing for each cycle is achieved. The VHDL code of the design is still under development. An error has occurred during compilation and is under investigation.
The SRAM is 128K x 16 bit, non-volatile. For the application of this project, the sampling level output from the ADC will be 8 bit width. A design decision is made to use each row of the SRAM (16bit) as two separate 8 bit slots. The design will handle the conversion between the 8 and 16 bit. The address inside the system would be 18 bit rather than the 17 bit used by the SRAM. The interface takes the most significant bit of the 18 bit address bus and decides whether to access the upper byte or the lower byte. As long as the system is consistent on this design for both read and write cycles, the SRAM is treated as 256K x 8 bit.
The data bus connected to the SRAM is a bi-directional bus. To avoid I/O contention, a tri-state bus is needed to act as a buffer between the SRAM and the interface. The lpm_bustri is used as the tri-state bus.
Figure I: read cycle
Figure 2: write cycle
Mablab code
The following matlab code was written to demonstrate the accuracy of the zero-crossing pattern matching method.
% A matlab representation of zero
crossing algorithm.
% File: alg.m
% Input a template of the word.
[y,fs,bits] = wavread ('hello.wav');
siz = wavread('hello.wav','size');
% plot the original waveform.
subplot (2,1,1), plot(y), grid;
mx = max (y);
mi = min (y);
avg = mean (y);
hiRef = (mx + avg) / 2;
lowRef = (mi + avg) /2;
hitavg1 = hit (y, siz, avg );
hithi1 = hit (y, siz, hiRef);
hitlow1 = hit (y, siz, lowRef);
% Input a test case of the word.
[y,fs,bits] = wavread ('two.wav');
siz = wavread('two.wav','size');
subplot (2,1,2), plot(y), grid;
mx = max (y);
mi = min (y);
avg = mean (y);
hiRef = (mx + avg) / 2;
lowRef = (mi + avg) /2;
hitavg2 = hit (y,siz,avg);
hithi2 = hit (y, siz, hiRef);
hitlow2 = hit (y, siz, lowRef);
% Try to find the percent of accuracy
between the two sample.
matchratio = ((hitavg1/hitavg2) +
(hithi1 / hithi2) + (hitlow1 / hitlow2 ) ) * 100 /3;
% -------------end of alg.m
---------------
% A function to find the number of zero
crossing
% File: hit.m
% Input: the vector, size of the
vector, the reference line
% Output: the number of zero crossing.
function result = hit ( x, siz, avg)
tmphit = 0;
oldpt = 0;
if x(1) == avg
tmphit = tmphit + 1;
end;
oldpt = x(1);
for n = 2 : siz,
curpt = x(n);
if curpt == avg
tmphit = tmphit + 1;
elseif ((oldpt < avg) & (curpt > avg ) )| ((curpt < avg)
& (oldpt > avg ))
tmphit = tmphit + 1;
end;
end;
result = tmphit;
% --------------- end of hit.m -------------------------
[1]Speech recognition by Machine, Technical Report
CS-TR-92/2, October 1992
Victoria University of Wellington, Department of Computer Science, Andrew
Kingston
[2]Hiroaki Sakoe and Seibi Chiba, “Dynamic Programming Algorithm Optimization for Spoken Word recognition”, IEEE Transactions On Acoustics, Speech, And Signal Processing, vol. ASSP-26, NO. 1, Feb. 1978
[3]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/lcd_driver/lcd_package.vhd
[4]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/lcd_driver/lcd.vhd
[5]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/lcd_driver/lcd_out.vhd
[6]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/lcd_driver/lcd.mif
[7]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/keydecoder/debounce.vhd
[8]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/keydecoder/display.vhd
[9]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/keydecoder/keypad.vhd
[10]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/keydecoder/keytop.vhd
[11]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999_w/SRAM/
[12]http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/1999f/musical_notes/music.html