Abstract
[*]Overview
[*]Design Details: History, Problems and Solutions
[*]Voice Recognition Algorithm
[*]Initial Single Chunk Algorithm
[*]40 Chunks With One Version
[*]40 Chunks With Four Versions and Voting
[*]40 Chunks With 4 Versions, Voting and Alignment
[*]40 Chunks With 4 Versions, Voting and Micro Alignment
[*]Summary
[*]Microphone Amplifier External Circuitry
[*]RC Car External Circuitry
[*]Tests and Experiments
[*]Microphone Input Amplifiers
[*]Initial Single Chunk Algorithm
[*]40 Chunks With One Version
[*]40 Chunks With Four Versions and Voting
[*]40 Chunks With 4 Versions, Voting and Alignment
[*]Sum of Absolute Differences
[*]Sum of Squared Differences
[*]40 Chunks With 4 Versions, Voting and MicroAlignment
[*]Speed vs. Logic Size Trade offs
[*]Speed vs. Style
[*]IC test measurements
[*]Closing Remarks
[*]References
[*]Declaration of Authenticity
[*]Appendix
[*]A - Brief Simulation Summary
[*]B - Schematics
[*]Amplifier and A/D
[*]Relay Control Circuitry (4, one for each control)
[*]This project implements a voice recognition system capable of issuing
four different commands to a radio-controlled (RC) car. The voice data is acquired using
an external microphone and circuitry, which amplifies and digitizes the voice at 11,000
samples per second. The digitized voice is then fed into a field-programmable gate array
(FPGA), the Altera EPF10K20RC240-4. The system has two main operating modes, training and
recognition. Training mode is used to initialize and train the system for each command
and/or speaker. Each of the four commands is recorded four times; thus, sixteen recordings
are required to completely train the system. In recognition mode, the FPGA performs voice
recognition and issues commands to the RC car. After pressing the "recognize" button, the
command is spoken. If a valid command was found then that command is issued to the RC car.
The RC car controller can issue the commands: forward, backward, turn left or turn right.
The car is normally stopped and the controller issues each command to the RC car for one
second, allowing the relatively slow voice commands to control the fast moving RC car. To
control the RC car we use the original commercial handheld controller with the mechanical
switches replaced with relays. Overall, the system performance was relatively good at 90%
recognition for a consistent speaker and audibly different words.
table of contents
The FPGA implements a training, voice recognition, and command translation system. The system uses two push buttons: one to activate the listening mode and one to activate the training mode. The latter also requires four switches, mutually exclusive, which define what command is being trained. The microphone is the most natural human interface of the system, allowing speech to be input to the machine. The microphone is secured to the user’s head to maintain a constant distance from the mouth. There are two user outputs, the RC car response and the train led. The "train next version" LED flashes to indicate the system is ready to sample the next version of a command.
The voice recognition algorithm requires that four samples be "trained" for
each command; in other words, 16 recordings to train all four commands. In addition, each
command is recorded for precisely a two-second period, divided into twenty 100 ms time chunks.
A count of zero-axis crossings is recorded for each time chunk, for a total of 20 eight-bit
counts per voice, or 320 counts in total. The counts are stored in the onboard embedded array
blocks. To train voices, the system counts crossings at each time chunk, storing the count in
the appropriate memory address when the time chunk ends and resetting the counter. A
"crossing" is defined as two consecutive samples having values on either side of the defined
axis value. The analog-to-digital (A/D) converter is eight bits wide, meaning samples have
values from 0 to 255, so the axis value is defined in this range. For no input signal, the
output from the A/D is 127. After some testing, the axis was chosen to be slightly below the
exact centre to eliminate background and system noise. Please refer to the attached source
code file "toplevel.vhd" for the exact generic value chosen for the zero axis. To avoid
recording information of initial silence, the machine has been designed to only begin its
two-second period when a substantial number of crossings are detected. By cutting out initial
periods of silence, there is less chance of accidentally differentiating between two signals
that differ primarily due to shifting in time.
table of contents
The design of both the FPGA-based hardware and the external circuitry
underwent some significant changes throughout the course of the project. The FPGA was used to
implement the voice recognition algorithm, which required many changes. The external circuitry
required some minor enhancements to reduce signal noise and to improve reliability.
Initially, a simple, zero-crossing algorithm was implemented where a count of crossings on the entire sample was performed. However, this approach quickly proved to be ineffective, and numerous modifications were made. This section details the problems encountered, and the solutions. A summary of problems and solutions appears at the end of this section as well.
The first algorithm was very simple, the system listened for 2 seconds after the listen was pushed and counted the number of zero crossings made. Training was done the very same way, a command was selected and the number of crossings was saved into memory.
Problems with the one-chunk zero crossing count algorithm:
Solution:
To improve voice recognition, more information was needed. Time is split into 40 chunks of 50 ms each. The number of crossings in each chunk is recorded and compared against the trained samples. The chunks allows voices which have the same total number of crossings to be differentiated by their different frequency contents at different times. Different samples are compared by taking the difference between each respective chunk in the samples and summing for the total difference. The total difference is then compared against a threshold value, if the sum is less than the threshold, it is considered a success.
Problems with the 40 zero crossing count algorithm:
Solutions:
Our third attempt makes use of 4 different versions of each command and voting. Several features had to be added. Four versions of each command are now recorded. To do this we also had to connect a signal to an external LED which flashes when waiting for the next sample. The user then pushes the train button again and another sample is recorded. This continues until the 4 versions are recorded. The design of the voter involved designing a separate state machine and control logic to control it.
There is one problem remaining, however. Using this algorithm, it is still possible that two different analog signals could have similar waveforms, but be shifted in time. The information that is stored on such samples would be insufficient to handle this case, resulting in the two voices being seen as "different". This is a major problem for a system with human interfaces, because the likelihood of recording two versions of the same voice differing by shifts in time is very high.
Problem:
Solution:
The easiest solution to this problem would be to ignore initial silence. Suppose, for example, that on one command the user hits the button and waits 100 ms before saying a word, while on the next sample the user waits 200 ms before saying the same word. These samples can be compared correctly by eliminating the 100 ms and 200 ms periods of silence preceding the voice in each respective case. This brings us to the fourth algorithm, which introduces sample alignment. Basically, the two-second recording duration does not officially begin when the button is pressed, but rather as soon as the silence breaks within the duration of one chunk. This modification has the added benefit of maximizing the voice capable of being recorded in the allotted time. However, this also causes a potential problem, due to the uncertainty as to which point within a chunk the silence will be. Chunks are very long, which means that data appearing anywhere in the chunk might trigger a recording, even if the chunk consists of mostly silence!
Problem:
Solution:
This prompted yet another modification to the algorithm. In this fifth edition of the voice recognition algorithm, we now increase the size of chunks, and therefore decrease the number of chunks in the two-second period. With large chunks, you tend to get a better sum-of-differences output due to larger differences and sharper boundaries.
However, this solution was not ideal: after all, if a chunk is large, a tremendous amount of silence might be introduced into a sample, greatly reducing the likelihood of successful similarity between similar voices. The solution is to change the algorithm to accept no more than 5 ms of data containing silence. Specifically, the first 5 ms period containing substantial crossings marked the beginning of the two-second recording period. This reduced the effects of the two recent problems mentioned above.
One problem still remained. The sum- Throughout the design of many parts of the voice recognition algorithm, it has been
necessary to decide on appropriate thresholds, usually for choosing what is a "match" between two voices,
what constitutes voice vs. noise, etc. These thresholds are best designed when they are flexible and easily
configurable.
Summary of voice recognition problems (all solved): Summary of solutions to voice recognition problems:
The microphone amplifier circuitry consisted mostly of discrete components, and required some enhancements to reduce the minor problems encountered.
Problems with the microphone amplifier circuitry:
The noise output from the amplifier was initially a problem, but was corrected by
setting the zero- The feedback oscillation was reduced by tying a filter capacitor from the voltage
source VCC to ground. Had we not corrected the oscillation, the microphone input would have been
completely useless, since the input voltage levels would be inconsistent! Another solution would have
been to make the entire amplifier inverting, but that would have required a redesign of amplifier stages.
A circuit was built for the purpose of driving the RC car's remote control whenever
the FPGA sent signals in the form of commands. The first idea was to use enhancement- Problems encountered initially: The solution to the transistor problems listed above has been to use relays.
However, their switching speeds were not fast enough to accommodate speed control of the vehicle. In
the scope of this project, we do not actually require speed control: but, in the interests of keeping
the system flexible, we chose to upgrade the hardware to use read relays, which are fast enough for
that purpose. The problem of the radio frequency being close to a harmonic of the system clock has
not been solved in this design.
During the entire circuit design test and experimentation was required to set various parameters. Such parameters were the gain of the microphone amplifier circuit, the correct threshold levels, etc.
Gain through the input amplifiers is controlled by the two resistors which affect the gain of each op amp. The first stage amplifier was fixed at a gain of 52 dB, and the second change was at first varied with a potentiometer then fixed with a resistor. Testing was done by measuring the amplifier output on the oscilloscope while speaking into the microphone and looking for saturation, etc. A good value of gain for the entire circuit that provides a good input signal that is not too adversely affected by noise in the lab was found to be 93 dB.
This was our initial test at finding the differences in sums of values. Testing
was not very rigorous at this stage since it was quickly recognized that the algorithm was impractical.
Sample words were spoken into the microphone and the zero crossings counter was output to the Altera's
seven segment display. The output from the zero crossings counter was a 12 bit value, but the
7- Word Range(approx) "Forward"
0x06 to 0x0F
"Forwards" (emphasis on "s")
0x0A to 0x2F
"Backward"
0x0E to 0x1A
"Backwards" (emphasis on "s")
0x10 to 0x25
As can be seen from the table various words do have different counts, unfortunately
the ranges overlap and vary greatly. This provided a proof of concept, but unfortunately would not provide
enough accuracy for voice recognition.
This was our first attempt at using the "chunk" idea and splitting the voice sample into
small pieces. Unfortunately this involved a major re-write of the entire circuit, and much of the time testing
was spent debugging the state machines and control logic.
Debugging and testing was done by watching the output command selection signals and looking
at the sum of differences module reset line vs. the comparator match signal. As the sum of differences value
increased due to a bad match the comparator will eventually fall to no match. By comparing the reset pulse to
the amount of time required for the comparator to change we could tell how well a match or non- It is at this point that the sum of differences value of about 400-500 became evident for
approximate differences between words (this will be seen in more convincing detail later). A couple of
problems started becoming evident during testing. Since our process stopped as soon as it found a match it
would often match a word too quickly and exit having selected the wrong word. The solution was of course to
make the requirements more restrictive on what was considered a match. The effect was often to miss all words,
including the correct one! While trying various combinations we found that if we recorded all four samples to
the same word we could make the requirements far more stringent and the chances were good that it would match
at least one of the four versions.
As the system began to grow with the addition of the voter and versioning the traditional
testing of the system by bringing a few selected inputs out began to become a problem. Testing the correct
operation of the memory and voter turned out be a challenge. The latter was solved by bringing out most of
the control signals so we could examine the outcome of the vote.
Little in- Our initial tests of the alignment proved very unsatisfactory. Our recognition which was
somewhat usable become worthless. To solve this problem we needed to design testability into the main system
since looking at control signals was not proving useful.
For testing we decided to modify the main state machine to include a new branch to dump
the contents of memory and to provide switchable control of the LED display output. By dumping the contents
of memory through the LED with a 2 second pause it was possible to verify memory contents to make sure they
had been written as expected. This is the main debugging routine used for the remainder of the project to
look at the chunks their values and similarity between them.
The first dump made of a raw sample for voice 1 ("Go Forwards") and voice 2 (Go Back):
Voice 1 V1 Voice 1 V 2 Voice 1 V3 Voice 1 V4 Voice 2 V1 Voice 2 V2 Voice 2 V3 Voice 2 V4 1 0 0 0 0 0 0 30 0 2 0 0 0 0 0 0 90 60 3 0 0 0 0 0 14 0 0 4 2 0 2 0 2 34 8 0 5 0 0 0 17 26 34 2 1 6 0 0 0 2a 20 1c 21 6 7 0 0 0 27 22 10 34 22 8 0 24 0 22 20 2 24 22 9 0 21 0 20 1A 1a 18 16 10 0 16 0 16 20 76 10 16 11 D 0 0 10 0c 0 c 6 12 18 0 4 4 0 0 42 6 13 16 0 14 0 0 0 a6 20 14 25 0 20 0 6 0 4 3c 15 1E 2C 1c 0 0 0 0 0 16 2 22 21 0 0 0 0 0 17 0 24 28 0 0 0 0 0 18 0 6 1a 0 0 0 0 0 19 0 0 4 0 0 0 0 0 20 0 0 0 0 0 0 0 0 21 0 0 0 0 0 0 0 0 22 0 0 0 0 0 0 0 0 23 10 c 0 0 0 0 0 0 24 29 28 0 2 0 2 0 0 25 30 2e 0 0 0 0 0 0 26 22 2b 17 0 0 0 0 0 27 1C 18 2a 0 0 0 0 0 28 20 20 27 0 0 0 0 0 29 10 14 22 0 15 0 0 0 30 2 1e 20 0 4a 0 0 0 31 0 0 16 0 51 c 0 0 32 0 2 10 1c 32 3a 2d 8 33 0 0 4 26 0 3d 3e 2f 34 0 0 0 2c 0 4d 4a 3e 35 0 0 0 2a 17 8 28 44 36 0 0 0 22 1c 2 0 10 37 0 0 0 18 0 0 0 0 38 0 0 0 4 26 e 18 0 39 0 0 0 0 50 c 32 1e 40 0 0 0 0 0 0 14 22 From the dump several interesting points were found: Changes made: A table of values for after #1 and #2 now shows:
Voice 1 Ver 1 Voice 1 Ver 2 Voice 1 Ver 3 Voice 1 Ver 4 Voice 2 Ver 1 Voice 2 Ver 2 Voice 2 Ver 3 Rec Samp 1 34 79 75 2c 98 68 3e 7b 2 66 52 54 77 54 71 20 80 3 6b 4a 4c 56 4 14 0 a 4 3a 16 2c 59 1a 2 c 1a 5 0 0 0 30 0 12 4 c 6 0 0 0 0 8 82 2a 11 7 0 0 0 0 0 24 0 E 8 36 f 0 0 0 0 0 0 9 69 62 62 55 0 0 0 0 10 46 4e 70 6a 0 0 0 16 11 3c 4a 44 4d 0 0 0 0 12 a 1e 34 48 0 0 0 0 13 2b 0 c 8 0 0 0 0 14 0 8c 0 a 0 0 0 0 15 0 0 30 0 0 0 0 0 16 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 Two of the main problems in the prior set were solved, and minor variations
can also be eliminated using the short chunk method (#3 above). From the above table we can
now compare various methods for looking for correlation between the data sets.
This is the method used for this algorithm. It is simple to implement and
since it consists of two adders and two registers it is fairly compact. The equation is:
Sets Compared Sum Voice 1 Ver 1 to Recorded Sample 589 Voice 1 Ver 2 to Recorded Sample 631 Voice 1 Ver 3 to Recorded Sample 631 Voice 1 Ver 4 to Recorded Sample 630 Voice 2 Ver 1 to Recorded Sample 274 Voice 2 Ver 2 to Recorded Sample 231 Voice 2 Ver 3 to Recorded Sample 250 Voice 1 Ver 1 to Voice 1 Ver 2 518 Voice 1 Ver 1 to Voice 1 Ver 3 472 Voice 1 Ver 1 to Voice 1 Ver 4 459 Voice 1 Ver 1 to Voice 2 Ver 1 603 Voice 1 Ver 1 to Voice 2 Ver 2 732 Voice 1 Ver 1 to Voice 2 Ver 3 621 The sums highlighted in bold-italics are sets that are suppose to match. The
table suggests that a threshold for match would be around 500. The one value above 500, 518,
could be a bad recording, and could therefore simply be ignored since the voting will still
succeed on 2 matches.
Even though the sum of absolute differences provides basic information about
cross- If we apply this same algorithm to the same numbers as above the new table is:
Sets Compared Sum2 Voice 1 Ver 1 to Recorded Sample 36673 Voice 1 Ver 2 to Recorded Sample 51667 Voice 1 Ver 3 to Recorded Sample 41927 Voice 1 Ver 4 to Recorded Sample 44330 Voice 2 Ver 1 to Recorded Sample 3718 Voice 2 Ver 2 to Recorded Sample 15035 Voice 2 Ver 3 to Recorded Sample 14602 Voice 1 Ver 1 to Voice 1 Ver 2 39288 Voice 1 Ver 1 to Voice 1 Ver 3 26232 Voice 1 Ver 1 to Voice 1 Ver 4 26025 Voice 1 Ver 1 to Voice 2 Ver 1 46731 Voice 1 Ver 1 to Voice 2 Ver 2 56760 Voice 1 Ver 1 to Voice 2 Ver 3 45055 When compared with the plain sum of absolute differences the differences for
failure to match is actually much further separated from a match. For example a threshold of 35000
would now assure at least 2 matches for every word stored in memory.
The commands used for this test are two- The prior algorithm runs into problems depending on when you start speaking
within one of the 100 ms chunks. To solve this problem we have reduced the first chunk length
down to 5 ms and look for a few crossings within that period. Testing for multiple words
followed the same procedure as the prior one; dumping memory, comparing sum of squared differences,
etc. A matching threshold of 35000 was used. For multiple words the following commands were found
to be about 90% accurate and easy to say in a similar way each time. Suggested words (spelled to
suggest pronunciation): Once again we tried to see if we could recognize single word commands. Four
words were recorded and then compared recognition was attempted. The input into the voter was
watched for the number of matches on each word. Initially at a threshold of 35000 all the words
received 4 matches (the maximum possible). The threshold was successively lowered until matches
were made only on one word. We found that with a threshold value of 7500 we were now able to
distinguish between these words: The above command words are ideal, but unfortunately we ran into two problems.
The first was our inability to say the word in a similar way. At a difference of 7500 you must
not change inflection or pitch greatly. The second problem was background lab noise. When we
performed the test the lab was fairly noisy and if noise, such a stool squealing was recorded,
it would result in a very large sum. For example say a stool squeal caused 50 crossings, which
is easy considering it is loud and of high frequency, we would receive an additional 2500 on the
sum of square differences, in addition to difference in voice differences. We found the recognition
dropped to about 50-60% for a trained speaker.
Throughout the design there has been re-use of large components such as counters
and timers, so space tended not to be an issue. The major area of speed vs. size was in the
sum- The initial design was done with a more decentralized approach and the number of
logic cells was found to be getting large rather quickly. After redesigning the system with a central
controller and separate data path we found that we were able to re-use many large components, saving
a large amount of space.
Since all the values in the system are designed or checked via trial and error,
few signal measurements were left to actually measure.
The final design of the project meets the original design goals, voice recognition of four different commands with reasonable accurary on the Altera Flex10K. With careful training, reasonable accuracy and a quiet room the system can be set up to distinguish and recognize single English words.
Limitations of the system are:
Strengths of the system are:
The algorithm we used was also limited by the amount of memory we have on the system. Currently we use all but one embedded array block (EAB), leaving little room for storing more data. To use a better voice recognition algorithm the use of external memory would be a requirement.
Overall, the system performs as expected and meets all the initial design goals.
[1] Ullmann, J.R., Pattern Recognition Techniques. Butterword & Co. (Publishers) Ltd., London 1973
[2] Dr. Elliott for suggesting the technique of using zero- [3] http://www.btinternet.com/~netsurf/SudburyRC/coxamp.html
[4] http://www.darkportal.com/cc/index.htm
[5] Ashenden, P.J., The Designer's Guide To VHDL. Morgan Kaufmann Publishers, Inc. San Francisco 1996.
[6] http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/
[7] M. L. Stanley-Jones
[8] Altera Code for LPM modules and from LPM Wizard.
We the authors, namely Andrew Stanley-Jones, Kevin Grant, and Darren Gould,
declare that to the best of our knowledge all the code used in this document and the document is itself an
original work with the exception of:
table of contents
table of contents
table of contents
(sorry, simulation output is not available in a web-readable format)
(to see an enlarged version of the
diagram below, click on it)
not available for web viewing