Voice-Activated RC Car

Final Project Report (April 1, 1999)

Authors:
Andrew Stanley-Jones
Kevin Grant
Darren Gould

Complete VHDL source code available here, as a compressed (gzipped) "tar" file.

Presentation slides for this project are available here, as a Microsoft PowerPoint 97/98 presentation file.

Design files (pin files, etc.) are here, as a compressed (gzipped) "tar" file.

Table of Contents
Abstract [*]
Overview [*]
Design Details: History, Problems and Solutions [*]
Voice Recognition Algorithm [*]
Initial Single Chunk Algorithm [*]
40 Chunks With One Version [*]
40 Chunks With Four Versions and Voting [*]
40 Chunks With 4 Versions, Voting and Alignment [*]
40 Chunks With 4 Versions, Voting and Micro Alignment [*]
Summary [*]
Microphone Amplifier External Circuitry [*]
RC Car External Circuitry [*]
Tests and Experiments [*]
Microphone Input Amplifiers [*]
Initial Single Chunk Algorithm [*]
40 Chunks With One Version [*]
40 Chunks With Four Versions and Voting [*]
40 Chunks With 4 Versions, Voting and Alignment [*]
Sum of Absolute Differences [*]
Sum of Squared Differences [*]
40 Chunks With 4 Versions, Voting and MicroAlignment [*]
Speed vs. Logic Size Trade offs [*]
Speed vs. Style [*]
IC test measurements [*]
Closing Remarks [*]
References [*]
Declaration of Authenticity [*]
Appendix [*]
A - Brief Simulation Summary [*]
B - Schematics [*]
Amplifier and A/D [*]
Relay Control Circuitry (4, one for each control) [*]
Abstract
table of contents
This project implements a voice recognition system capable of issuing four different commands to a radio-controlled (RC) car. The voice data is acquired using an external microphone and circuitry, which amplifies and digitizes the voice at 11,000 samples per second. The digitized voice is then fed into a field-programmable gate array (FPGA), the Altera EPF10K20RC240-4. The system has two main operating modes, training and recognition. Training mode is used to initialize and train the system for each command and/or speaker. Each of the four commands is recorded four times; thus, sixteen recordings are required to completely train the system. In recognition mode, the FPGA performs voice recognition and issues commands to the RC car. After pressing the "recognize" button, the command is spoken. If a valid command was found then that command is issued to the RC car. The RC car controller can issue the commands: forward, backward, turn left or turn right. The car is normally stopped and the controller issues each command to the RC car for one second, allowing the relatively slow voice commands to control the fast moving RC car. To control the RC car we use the original commercial handheld controller with the mechanical switches replaced with relays. Overall, the system performance was relatively good at 90% recognition for a consistent speaker and audibly different words.
Overview
table of contents
The FPGA implements a training, voice recognition, and command translation system. The system uses two push buttons: one to activate the listening mode and one to activate the training mode. The latter also requires four switches, mutually exclusive, which define what command is being trained. The microphone is the most natural human interface of the system, allowing speech to be input to the machine. The microphone is secured to the user’s head to maintain a constant distance from the mouth. There are two user outputs, the RC car response and the train led. The "train next version" LED flashes to indicate the system is ready to sample the next version of a command.
The voice recognition algorithm requires that four samples be "trained" for each command; in other words, 16 recordings to train all four commands. In addition, each command is recorded for precisely a two-second period, divided into twenty 100 ms time chunks. A count of zero-axis crossings is recorded for each time chunk, for a total of 20 eight-bit counts per voice, or 320 counts in total. The counts are stored in the onboard embedded array blocks. To train voices, the system counts crossings at each time chunk, storing the count in the appropriate memory address when the time chunk ends and resetting the counter. A "crossing" is defined as two consecutive samples having values on either side of the defined axis value. The analog-to-digital (A/D) converter is eight bits wide, meaning samples have values from 0 to 255, so the axis value is defined in this range. For no input signal, the output from the A/D is 127. After some testing, the axis was chosen to be slightly below the exact centre to eliminate background and system noise. Please refer to the attached source code file "toplevel.vhd" for the exact generic value chosen for the zero axis. To avoid recording information of initial silence, the machine has been designed to only begin its two-second period when a substantial number of crossings are detected. By cutting out initial periods of silence, there is less chance of accidentally differentiating between two signals that differ primarily due to shifting in time.
Design Details: History, Problems and Solutions
table of contents
The design of both the FPGA-based hardware and the external circuitry underwent some significant changes throughout the course of the project. The FPGA was used to implement the voice recognition algorithm, which required many changes. The external circuitry required some minor enhancements to reduce signal noise and to improve reliability.
- Voice Recognition Algorithm
  Initially, a simple, zero-crossing algorithm was implemented where a count of crossings on the entire sample was performed. However, this approach quickly proved to be ineffective, and numerous modifications were made. This section details the problems encountered, and the solutions. A summary of problems and solutions appears at the end of this section as well.
  1. Initial Single Chunk Algorithm
    
    The first algorithm was very simple, the system listened for 2 seconds after the listen was pushed and counted the number of zero crossings made. Training was done the very same way, a command was selected and the number of crossings was saved into memory.
    Problems with the one-chunk zero crossing count algorithm:
    - Since voice counts varied so much between recordings, they were hard to distinguish.
    - Small inconsistencies in two samples of "the same" voice would cause the algorithm to fail because it does not have enough information about the samples.
    - Again because of crude information per sample, totally different analog signals could produce very similar or identical crossing counts.
    Solution:
    - Record the number of crossings over many small periods of time, we call each period a "chunk". By breaking the voice into chunks different words will have different values as the frequency changes during pronunciation, thus making it easier to differentiate between words.
  2. 40 Chunks With One Version
    
    To improve voice recognition, more information was needed. Time is split into 40 chunks of 50 ms each. The number of crossings in each chunk is recorded and compared against the trained samples. The chunks allows voices which have the same total number of crossings to be differentiated by their different frequency contents at different times. Different samples are compared by taking the difference between each respective chunk in the samples and summing for the total difference. The total difference is then compared against a threshold value, if the sum is less than the threshold, it is considered a success.
    Problems with the 40 zero crossing count algorithm:
    - Voices with similar analog waveforms might be recorded shifted in time.
    - Count varied too much, so they were hard to distinguish.
    - The algorithm tests each voice and when a match is found stops searching. This leads to problems where two or more voices were potential matches since they were all under the threshold value. The system would pick the first voice that it found, generally the wrong one.
    Solutions:
    - Record 4 different samples. This gives us more to test against.
    - Instead of using an "early-out" type matching check, compare each voice against the recorded sample, and count the number of matches for each command. The command with the most matches is the command issued. In case of a tie or no matches, no command is issued.
  3. 40 Chunks With 4 Versions and Voting
    
    Our third attempt makes use of 4 different versions of each command and voting. Several features had to be added. Four versions of each command are now recorded. To do this we also had to connect a signal to an external LED which flashes when waiting for the next sample. The user then pushes the train button again and another sample is recorded. This continues until the 4 versions are recorded. The design of the voter involved designing a separate state machine and control logic to control it.
    There is one problem remaining, however. Using this algorithm, it is still possible that two different analog signals could have similar waveforms, but be shifted in time. The information that is stored on such samples would be insufficient to handle this case, resulting in the two voices being seen as "different". This is a major problem for a system with human interfaces, because the likelihood of recording two versions of the same voice differing by shifts in time is very high.
    Problem:
    - Voices with similar analog waveforms might be recorded shifted in time.
    Solution:
    - Ignore leading silence.
  4. 40 Chunks With 4 Versions, Voting and Alignment
    
    The easiest solution to this problem would be to ignore initial silence. Suppose, for example, that on one command the user hits the button and waits 100 ms before saying a word, while on the next sample the user waits 200 ms before saying the same word. These samples can be compared correctly by eliminating the 100 ms and 200 ms periods of silence preceding the voice in each respective case. This brings us to the fourth algorithm, which introduces sample alignment. Basically, the two-second recording duration does not officially begin when the button is pressed, but rather as soon as the silence breaks within the duration of one chunk. This modification has the added benefit of maximizing the voice capable of being recorded in the allotted time. However, this also causes a potential problem, due to the uncertainty as to which point within a chunk the silence will be. Chunks are very long, which means that data appearing anywhere in the chunk might trigger a recording, even if the chunk consists of mostly silence!
    Problem:
    - Due to the voice starting anywhere within the large chunks alignment problems can occur between chunks.
    Solution:
    - The first chunk should be very short so you can align samples more accurately.
  5. 40 Chunks With 4 Versions, Voting and Micro Alignment
    
    This prompted yet another modification to the algorithm. In this fifth edition of the voice recognition algorithm, we now increase the size of chunks, and therefore decrease the number of chunks in the two-second period. With large chunks, you tend to get a better sum-of-differences output due to larger differences and sharper boundaries.
    However, this solution was not ideal: after all, if a chunk is large, a tremendous amount of silence might be introduced into a sample, greatly reducing the likelihood of successful similarity between similar voices. The solution is to change the algorithm to accept no more than 5 ms of data containing silence. Specifically, the first 5 ms period containing substantial crossings marked the beginning of the two-second recording period. This reduced the effects of the two recent problems mentioned above.
    One problem still remained. The sum-of-differences problem did not distinguish between samples with very large differences and samples that had lots of little differences. This prompted the change of the sum of absolute differences to a sum of squared differences. The most trivial method in which this can be done is to square the differences with a multiplier. There are several ways to square the value, but since we had many free logic cells, we used an LPM multiplier.
  6. Summary
    
    Throughout the design of many parts of the voice recognition algorithm, it has been necessary to decide on appropriate thresholds, usually for choosing what is a "match" between two voices, what constitutes voice vs. noise, etc. These thresholds are best designed when they are flexible and easily configurable.
    Summary of voice recognition problems (all solved):
    - since voices varied so much, they were hard to distinguish
    - small inconsistencies in two samples of "the same" voice caused the algorithm to fail
    - totally different analog signals could produce very similar or identical crossing counts
    - voices with similar analog waveforms might be recorded out of phase in time
    - there may not be enough crossings in a chunk to justify the start of a recording
    Summary of solutions to voice recognition problems:
    - create 20 time chunks per voice (instead of one big chunk), each 100 ms in duration
    - align the samples in time by tossing out all initial 5 ms time slices which contain silence; this means that no more than 5 ms of silence will ever be considered as the beginning of a recording
    - use a sum of squared differences approach, adding up the squares of the differences between zero crossings from respective time chunks in the two data sets; this allows one piece of data to represent a set of zero crossing counts, and allows thresholds to be chosen without fear of incorrectly classifying matching samples due to the magnification of large differences.
    - record four versions of each of the four commands, for a total of 16 versions stored
    - use a voting system to pick the command with the most matching versions; ties are discarded as being invalid speech
- Microphone Amplifier External Circuitry
  The microphone amplifier circuitry consisted mostly of discrete components, and required some enhancements to reduce the minor problems encountered.
  Problems with the microphone amplifier circuitry:
  - there is noise output from the amplifier, which is hard to control
  - oscillation occurs due to the feedback in the system
  The noise output from the amplifier was initially a problem, but was corrected by setting the zero-crossing threshold in the FPGA to account for the noise. In fact, the FPGA was already accounting for noise entering the microphone from the background of the room in which recording takes place, so the amplifier noise correction was simply another factor to consider.
  The feedback oscillation was reduced by tying a filter capacitor from the voltage source VCC to ground. Had we not corrected the oscillation, the microphone input would have been completely useless, since the input voltage levels would be inconsistent! Another solution would have been to make the entire amplifier inverting, but that would have required a redesign of amplifier stages.
- RC Car External Circuitry
  A circuit was built for the purpose of driving the RC car's remote control whenever the FPGA sent signals in the form of commands. The first idea was to use enhancement-mode field-effect transistors (FETs) for the task, but this proved to be too simplistic. Details are in this section.
  Problems encountered initially:
  - transistors as discrete components are inconsistent devices; the circuitry we built, consisting of ten transistors, did not behave properly all the time
  - when turned on the source and drain are both pulled high which results in a Vt drop across the FETs and is insufficient to activate the circuit
  - the circuit's behavior was unpredictable, apparently due to the reaction between the FPGA system clock (25 MHz) and the RF (49 MHz), which is the very close to it's first harmonic
  The solution to the transistor problems listed above has been to use relays. However, their switching speeds were not fast enough to accommodate speed control of the vehicle. In the scope of this project, we do not actually require speed control: but, in the interests of keeping the system flexible, we chose to upgrade the hardware to use read relays, which are fast enough for that purpose. The problem of the radio frequency being close to a harmonic of the system clock has not been solved in this design.

Tests and Experiments
table of contents

During the entire circuit design test and experimentation was required to set various parameters. Such parameters were the gain of the microphone amplifier circuit, the correct threshold levels, etc.

Microphone Input Amplifiers

Gain through the input amplifiers is controlled by the two resistors which affect the gain of each op amp. The first stage amplifier was fixed at a gain of 52 dB, and the second change was at first varied with a potentiometer then fixed with a resistor. Testing was done by measuring the amplifier output on the oscilloscope while speaking into the microphone and looking for saturation, etc. A good value of gain for the entire circuit that provides a good input signal that is not too adversely affected by noise in the lab was found to be 93 dB.

Initial Single Chunk Algorithm

This was our initial test at finding the differences in sums of values. Testing was not very rigorous at this stage since it was quickly recognized that the algorithm was impractical. Sample words were spoken into the microphone and the zero crossings counter was output to the Altera's seven segment display. The output from the zero crossings counter was a 12 bit value, but the 7-segment LEDs can only display a total of 8 bits, so we could only look at the part of the count. We found after a little testing that the most two significant bits almost never changed in practice, hence we displayed bits 10 down to 2. A rough table of counts for various words:

Word

Range(approx)

"Forward"

0x06 to 0x0F

"Forwards" (emphasis on "s")

0x0A to 0x2F

"Backward"

0x0E to 0x1A

"Backwards" (emphasis on "s")

0x10 to 0x25

As can be seen from the table various words do have different counts, unfortunately the ranges overlap and vary greatly. This provided a proof of concept, but unfortunately would not provide enough accuracy for voice recognition.

40 Chunks With One Version

This was our first attempt at using the "chunk" idea and splitting the voice sample into small pieces. Unfortunately this involved a major re-write of the entire circuit, and much of the time testing was spent debugging the state machines and control logic.
Debugging and testing was done by watching the output command selection signals and looking at the sum of differences module reset line vs. the comparator match signal. As the sum of differences value increased due to a bad match the comparator will eventually fall to no match. By comparing the reset pulse to the amount of time required for the comparator to change we could tell how well a match or non-match differed.
It is at this point that the sum of differences value of about 400-500 became evident for approximate differences between words (this will be seen in more convincing detail later). A couple of problems started becoming evident during testing. Since our process stopped as soon as it found a match it would often match a word too quickly and exit having selected the wrong word. The solution was of course to make the requirements more restrictive on what was considered a match. The effect was often to miss all words, including the correct one! While trying various combinations we found that if we recorded all four samples to the same word we could make the requirements far more stringent and the chances were good that it would match at least one of the four versions.
40 Chunks With Four Versions and Voting

As the system began to grow with the addition of the voter and versioning the traditional testing of the system by bringing a few selected inputs out began to become a problem. Testing the correct operation of the memory and voter turned out be a challenge. The latter was solved by bringing out most of the control signals so we could examine the outcome of the vote.
Little in-depth testing of the entire system for recognition was done since the alignment problem became apparent due to poor recognition. The solution turned out to be extremely simple; the addition of two states and one control line we stopped working on the current model.

40 Chunks With 4 Versions, Voting and Alignment

Our initial tests of the alignment proved very unsatisfactory. Our recognition which was somewhat usable become worthless. To solve this problem we needed to design testability into the main system since looking at control signals was not proving useful.

For testing we decided to modify the main state machine to include a new branch to dump the contents of memory and to provide switchable control of the LED display output. By dumping the contents of memory through the LED with a 2 second pause it was possible to verify memory contents to make sure they had been written as expected. This is the main debugging routine used for the remainder of the project to look at the chunks their values and similarity between them.

The first dump made of a raw sample for voice 1 ("Go Forwards") and voice 2 (Go Back):

Voice 1 V1

Voice 1 V 2

Voice 1 V3

Voice 1 V4

Voice 2 V1

Voice 2 V2

Voice 2 V3

Voice 2 V4

1

0

0

0

0

0

0

30

0

2

0

0

0

0

0

0

90

60

3

0

0

0

0

0

14

0

0

4

2

0

2

0

2

34

8

0

5

0

0

0

17

26

34

2

1

6

0

0

0

2a

20

1c

21

6

7

0

0

0

27

22

10

34

22

8

0

24

0

22

20

2

24

22

9

0

21

0

20

1A

1a

18

16

10

0

16

0

16

20

76

10

16

11

D

0

0

10

0c

0

c

6

12

18

0

4

4

0

0

42

6

13

16

0

14

0

0

0

a6

20

14

25

0

20

0

6

0

4

3c

15

1E

2C

1c

0

0

0

0

0

16

2

22

21

0

0

0

0

0

17

0

24

28

0

0

0

0

0

18

0

6

1a

0

0

0

0

0

19

0

0

4

0

0

0

0

0

20

0

0

0

0

0

0

0

0

21

0

0

0

0

0

0

0

0

22

0

0

0

0

0

0

0

0

23

10

c

0

0

0

0

0

0

24

29

28

0

2

0

2

0

0

25

30

2e

0

0

0

0

0

0

26

22

2b

17

0

0

0

0

0

27

1C

18

2a

0

0

0

0

0

28

20

20

27

0

0

0

0

0

29

10

14

22

0

15

0

0

0

30

2

1e

20

0

4a

0

0

0

31

0

0

16

0

51

c

0

0

32

0

2

10

1c

32

3a

2d

8

33

0

0

4

26

0

3d

3e

2f

34

0

0

0

2c

0

4d

4a

3e

35

0

0

0

2a

17

8

28

44

36

0

0

0

22

1c

2

0

10

37

0

0

0

18

0

0

0

0

38

0

0

0

4

26

e

18

0

39

0

0

0

0

50

c

32

1e

40

0

0

0

0

0

0

14

22

From the dump several interesting points were found:

Alignment of the voice samples was not being recorded to memory. This was found to be a mistake in the training state machine.
The differences between voices 1 and 2 are distinct and visibly recognizable.
There are similarities between the versions, though each is offset due to the first point, above, not working.
The values are relatively small ranging from 0 to 2F (hex) which is less than one third of the possible range.

Changes made:

Fixed alignment code.
To increase the crossings count we doubled the time period from 50 ms to 100 ms.
Align the crossing counts with an initial short chunk to make sure the chunks start at the same time within the final time period.

A table of values for after #1 and #2 now shows:

Voice 1 is "Go Forwards"
Voice 2 is "Back"
The "Rec Samp" is a recorded sample done through the normal record and recognize session. The sample is "Back"

	Voice 1 Ver 1	Voice 1 Ver 2	Voice 1 Ver 3	Voice 1 Ver 4	Voice 2 Ver 1	Voice 2 Ver 2	Voice 2 Ver 3	Rec Samp
1	34	79	75	2c	98	68	3e	7b
2	66	52	54	77	54	71	20	80
3	6b	4a	4c	56	4	14	0	a
4	3a	16	2c	59	1a	2	c	1a
5	0	0	0	30	0	12	4	c
6	0	0	0	0	8	82	2a	11
7	0	0	0	0	0	24	0	E
8	36	f	0	0	0	0	0	0
9	69	62	62	55	0	0	0	0
10	46	4e	70	6a	0	0	0	16
11	3c	4a	44	4d	0	0	0	0
12	a	1e	34	48	0	0	0	0
13	2b	0	c	8	0	0	0	0
14	0	8c	0	a	0	0	0	0
15	0	0	30	0	0	0	0	0
16	0	0	0	0	0	0	0	0
17	0	0	0	0	0	0	0	0
18	0	0	0	0	0	0	0	0
19	0	0	0	0	0	0	0	0
20	0	0	0	0	0	0	0	0

Two of the main problems in the prior set were solved, and minor variations can also be eliminated using the short chunk method (#3 above). From the above table we can now compare various methods for looking for correlation between the data sets.

Sum of Absolute Differences

This is the method used for this algorithm. It is simple to implement and since it consists of two adders and two registers it is fairly compact. The equation is:

Sets Compared

Sum

Voice 1 Ver 1 to Recorded Sample

589

Voice 1 Ver 2 to Recorded Sample

631

Voice 1 Ver 3 to Recorded Sample

631

Voice 1 Ver 4 to Recorded Sample

630

Voice 2 Ver 1 to Recorded Sample

274

Voice 2 Ver 2 to Recorded Sample

231

Voice 2 Ver 3 to Recorded Sample

250

Voice 1 Ver 1 to Voice 1 Ver 2

518

Voice 1 Ver 1 to Voice 1 Ver 3

472

Voice 1 Ver 1 to Voice 1 Ver 4

459

Voice 1 Ver 1 to Voice 2 Ver 1

603

Voice 1 Ver 1 to Voice 2 Ver 2

732

Voice 1 Ver 1 to Voice 2 Ver 3

621

The sums highlighted in bold-italics are sets that are suppose to match. The table suggests that a threshold for match would be around 500. The one value above 500, 518, could be a bad recording, and could therefore simply be ignored since the voting will still succeed on 2 matches.

Sum of Squared Differences

Even though the sum of absolute differences provides basic information about cross-correction of sets, perhaps we could implement a better method. One level of extra complexity is to square each individual difference before making the sum. The benefit of this approach is the magnification of large differences relative to smaller (hopefully minor) differences in the voice sample. The equation for the sum of square differences is:

If we apply this same algorithm to the same numbers as above the new table is:

Sets Compared

Sum²

Voice 1 Ver 1 to Recorded Sample

36673

Voice 1 Ver 2 to Recorded Sample

51667

Voice 1 Ver 3 to Recorded Sample

41927

Voice 1 Ver 4 to Recorded Sample

44330

Voice 2 Ver 1 to Recorded Sample

3718

Voice 2 Ver 2 to Recorded Sample

15035

Voice 2 Ver 3 to Recorded Sample

14602

Voice 1 Ver 1 to Voice 1 Ver 2

39288

Voice 1 Ver 1 to Voice 1 Ver 3

26232

Voice 1 Ver 1 to Voice 1 Ver 4

26025

Voice 1 Ver 1 to Voice 2 Ver 1

46731

Voice 1 Ver 1 to Voice 2 Ver 2

56760

Voice 1 Ver 1 to Voice 2 Ver 3

45055

When compared with the plain sum of absolute differences the differences for failure to match is actually much further separated from a match. For example a threshold of 35000 would now assure at least 2 matches for every word stored in memory.

The commands used for this test are two-word vs. one-word commands, a choice which allows the effect of a longer word to create larger differences. Testing was done using single word commands with very a tight threshold, but no success was seen, so we proceeded to implementing an initial short chunk.

40 Chunks With 4 Versions, Voting and MicroAlignment

The prior algorithm runs into problems depending on when you start speaking within one of the 100 ms chunks. To solve this problem we have reduced the first chunk length down to 5 ms and look for a few crossings within that period. Testing for multiple words followed the same procedure as the prior one; dumping memory, comparing sum of squared differences, etc. A matching threshold of 35000 was used. For multiple words the following commands were found to be about 90% accurate and easy to say in a similar way each time. Suggested words (spelled to suggest pronunciation):
1. "Gooo! Forwardss"
2. "Backward"
3. "Turn-RRight"
4. "HHard-To-Port"
Once again we tried to see if we could recognize single word commands. Four words were recorded and then compared recognition was attempted. The input into the voter was watched for the number of matches on each word. Initially at a threshold of 35000 all the words received 4 matches (the maximum possible). The threshold was successively lowered until matches were made only on one word. We found that with a threshold value of 7500 we were now able to distinguish between these words:
1. "Forward"
2. "Backwards"
3. "Left"
4. "Right"
The above command words are ideal, but unfortunately we ran into two problems. The first was our inability to say the word in a similar way. At a difference of 7500 you must not change inflection or pitch greatly. The second problem was background lab noise. When we performed the test the lab was fairly noisy and if noise, such a stool squealing was recorded, it would result in a very large sum. For example say a stool squeal caused 50 crossings, which is easy considering it is loud and of high frequency, we would receive an additional 2500 on the sum of square differences, in addition to difference in voice differences. We found the recognition dropped to about 50-60% for a trained speaker.
Speed vs. Logic Size Trade-Offs

Throughout the design there has been re-use of large components such as counters and timers, so space tended not to be an issue. The major area of speed vs. size was in the sum-of-differences module where the multiplier was used. The avoid waiting on multiple additions or other iterative squaring algorithms we used a rather large multiplier. Since we had the space it was not a major concern in logic use. At the same time multipliers are slow, but since our main clock frequency was 500 kHz the multiplier worked well.
Speed vs. Style

The initial design was done with a more decentralized approach and the number of logic cells was found to be getting large rather quickly. After redesigning the system with a central controller and separate data path we found that we were able to re-use many large components, saving a large amount of space.
IC Test Measurements

Since all the values in the system are designed or checked via trial and error, few signal measurements were left to actually measure.
- Actual clock rate: 790 kHz.
- EOC Fall Time: Same time as rise of SOC, hence concern over EOC staying high until after SOC falls was unwarranted.
- Sample time: 2 s on oscilloscope with 1 s sweep (within reason).
- Conversion time, hard to measure since sample time is 2 ms, but estimated at < 1ms; seems correct.
- Fast clock: 25 MHz
- Maximum clock rate: 1.6 MHz; A/D converter beings to fail at this rate, invalid data input. (Maximum rate from specifications is 1.4 MHz)

Closing Remarks
table of contents
The final design of the project meets the original design goals, voice recognition of four different commands with reasonable accurary on the Altera Flex10K. With careful training, reasonable accuracy and a quiet room the system can be set up to distinguish and recognize single English words.
Limitations of the system are:
1. Limited to a single speaker.
2. Ability of the speaker to say the same word the same way every time must be good.
3. You must press a button for the listening to take hold.
Strengths of the system are:
1. Small and compact. Uses less than 600 logic cells leaving room for adding more complex logic or features.
2. Good recognition on different words.
3. Simple external circuitry and interface.
4. Fast operation, performs recognition in a maximum 0.64 ms (after listen is complete). Can be improved with faster clock rates up to a possible sixty-four microseconds with a 5 MHz clock.
The algorithm we used was also limited by the amount of memory we have on the system. Currently we use all but one embedded array block (EAB), leaving little room for storing more data. To use a better voice recognition algorithm the use of external memory would be a requirement.
Overall, the system performs as expected and meets all the initial design goals.
References
table of contents
[1] Ullmann, J.R., Pattern Recognition Techniques. Butterword & Co. (Publishers) Ltd., London 1973
- Used for zero-crossing and voice recognition techniques. This book is the source of ideas for several recognition techniques, including a derivative-based zero-crossing-counting algorithm and the concept of time-chunk differential analysis.
[2] Dr. Elliott for suggesting the technique of using zero-crossing counts for voice recognition, for suggesting time-chunk analysis, and for suggesting a small initial chunk time.
[3] http://www.btinternet.com/~netsurf/SudburyRC/coxamp.html
- For microphone amplifier circuit.
[4] http://www.darkportal.com/cc/index.htm
- For base A/D and microphone amplifier circuits. ("Circuit Central")
[5] Ashenden, P.J., The Designer's Guide To VHDL. Morgan Kaufmann Publishers, Inc. San Francisco 1996.
- Used for understanding the IEEE-1164 standard and for help on VHDL type conversion functions. A constant source of pointers on efficient ways to use VHDL, but unfortunately Max+PlusII supports very few of them!
[6] http://www.ee.ualberta.ca/~elliott/ee552/studentAppNotes/
- Various pieces of code (a clock divisor, for example).
[7] M. L. Stanley-Jones
- Supplemental parts and resistors unavailable through the university.
- Ideas and design ideas for amplifier and sample/hold.
[8] Altera Code for LPM modules and from LPM Wizard.
Declaration of Authenticity
table of contents
We the authors, namely Andrew Stanley-Jones, Kevin Grant, and Darren Gould, declare that to the best of our knowledge all the code used in this document and the document is itself an original work with the exception of:
1. The microphone amplifier circuit was based upon ideas from [3] and [4].
2. The voice recognition algorithm in various forms was upon ideas and suggestions from [1] and [2].
3. The RC controller switch replacement idea was obtained initially from [7] and then [2].
4. Clock divider code as obtained from [6].
5. Sod_mult.vhd from the Altera LPM Wizard ([8]).
Appendix
table of contents
- A. Brief Simulation Summary
  (sorry, simulation output is not available in a web-readable format)
1. Simulation of the Matcher sub-module. Input varies from 0 to 90 with the threshold at 50. The output falls low at 60 indicating the value has increased beyond the threshold. Two major glitches can be seen, but do not affect performance in this system.
2. Simulation of the Recorder (recorder.vhd) sub-module. This module controls read/write access to the EABs. When tl is high the data is written to the long term stored memory, when rec is high data is written to the current memory which is reset during each voice recognition session. The chunk lines are changed and the output values reflect the values written into memory earlier.
3. The sum of difference (subt.vhd) module is simulated. The simulation shows the diff_sum line correctly increments when the enable line is high and hold constant otherwise.
4. The zero-crossing counter (cross.vhd) is simulated with rapidly changing input and the number of zero crossings verified. The threshold for change is 90 hence as the signal traverses 90 and enable is 1 the counter should increment. It does.
5. Simulation of the input reader (input_reader.vhd). The input reader interfaces with the external A/D converter and passes data to the internal data path. Its main purpose is to control Sample and SOC receive EOC and pass the data to the output. Except for a few glitches on the output which does not affect the system everything operates as expected.
6. Simulation of the RC Controller (rccontroller.vhd). The controller's purpose is to transfer commands to the RC car for a 1 second period (reduced for purposes of simulation) on doCommand going to 1. After 1 second the output should be reset. Simulates as expected.
7. Simulation of the main system controller (control.vhd). This is a listen cycle in which we first listen then perform 16 compares and vote on the best match. Normally the recording section would last 2 seconds but has been reduced for simulation. The vote is between 1 count on voice 2 and 2 counts on voice 1. The output from the voter would have been 1 if it were connected.
8. Simulation of the main system controller (control.vhd). This is a train cycle in which 4 recodings are made and stored into memory. The initial chunks you can notice have a very short record time (about 1:5 on the simulation, but 1:100 in normal operation), this is the short chunk initial adjustment. Due to a slight mistake from a change in code and forgetting to recompile the machine goes into an undefined state. This was corrected on the next simulation.
9. Simulation of the voter (voter.scf). The voter votes on which signal received the greatest number of pulses during the operating cycle. Two counts are made on voice 1 and 1 count on voice 2 (see a7). After raising the vote signal voice 1 is picked as having the most counters, which is correct. Glitches can be seen on the invalid signal, but these can be ignored since they occur between clock cycles and are corrected before the done signal is raised.
10. Simulation of the entire system (toplevel.vhd). The simulation is of a listen cycle over 1 ms. Only one toplevel simulation was done due to the complexity of simulating the toplevel. The simulation is a listen cycle with no data which will result in 4 ties and an invalid result. The results are as expected.
- B. Schematics
  1. Amplifier and A/D
    (to see an enlarged version of the diagram below, click on it)
  2. Relay Control Circuitry (4, one for each control)
    not available for web viewing

return to top

*Word*	*Range(approx)*
"Forward"	0x06 to 0x0F
"Forwards" (emphasis on "s")	0x0A to 0x2F
"Backward"	0x0E to 0x1A
"Backwards" (emphasis on "s")	0x10 to 0x25

	Voice 1 V1	Voice 1 V 2	Voice 1 V3	Voice 1 V4	Voice 2 V1	Voice 2 V2	Voice 2 V3	Voice 2 V4
1	0	0	0	0	0	0	30	0
2	0	0	0	0	0	0	90	60
3	0	0	0	0	0	14	0	0
4	2	0	2	0	2	34	8	0
5	0	0	0	17	26	34	2	1
6	0	0	0	2a	20	1c	21	6
7	0	0	0	27	22	10	34	22
8	0	24	0	22	20	2	24	22
9	0	21	0	20	1A	1a	18	16
10	0	16	0	16	20	76	10	16
11	D	0	0	10	0c	0	c	6
12	18	0	4	4	0	0	42	6
13	16	0	14	0	0	0	a6	20
14	25	0	20	0	6	0	4	3c
15	1E	2C	1c	0	0	0	0	0
16	2	22	21	0	0	0	0	0
17	0	24	28	0	0	0	0	0
18	0	6	1a	0	0	0	0	0
19	0	0	4	0	0	0	0	0
20	0	0	0	0	0	0	0	0
21	0	0	0	0	0	0	0	0
22	0	0	0	0	0	0	0	0
23	10	c	0	0	0	0	0	0
24	29	28	0	2	0	2	0	0
25	30	2e	0	0	0	0	0	0
26	22	2b	17	0	0	0	0	0
27	1C	18	2a	0	0	0	0	0
28	20	20	27	0	0	0	0	0
29	10	14	22	0	15	0	0	0
30	2	1e	20	0	4a	0	0	0
31	0	0	16	0	51	c	0	0
32	0	2	10	1c	32	3a	2d	8
33	0	0	4	26	0	3d	3e	2f
34	0	0	0	2c	0	4d	4a	3e
35	0	0	0	2a	17	8	28	44
36	0	0	0	22	1c	2	0	10
37	0	0	0	18	0	0	0	0
38	0	0	0	4	26	e	18	0
39	0	0	0	0	50	c	32	1e
40	0	0	0	0	0	0	14	22

Sets Compared	Sum
Voice 1 Ver 1 to Recorded Sample	589
Voice 1 Ver 2 to Recorded Sample	631
Voice 1 Ver 3 to Recorded Sample	631
Voice 1 Ver 4 to Recorded Sample	630
Voice 2 Ver 1 to Recorded Sample	274
Voice 2 Ver 2 to Recorded Sample	231
Voice 2 Ver 3 to Recorded Sample	250
Voice 1 Ver 1 to Voice 1 Ver 2	518
Voice 1 Ver 1 to Voice 1 Ver 3	472
Voice 1 Ver 1 to Voice 1 Ver 4	459
Voice 1 Ver 1 to Voice 2 Ver 1	603
Voice 1 Ver 1 to Voice 2 Ver 2	732
Voice 1 Ver 1 to Voice 2 Ver 3	621

Sets Compared	Sum²
Voice 1 Ver 1 to Recorded Sample	36673
Voice 1 Ver 2 to Recorded Sample	51667
Voice 1 Ver 3 to Recorded Sample	41927
Voice 1 Ver 4 to Recorded Sample	44330
Voice 2 Ver 1 to Recorded Sample	3718
Voice 2 Ver 2 to Recorded Sample	15035
Voice 2 Ver 3 to Recorded Sample	14602
Voice 1 Ver 1 to Voice 1 Ver 2	39288
Voice 1 Ver 1 to Voice 1 Ver 3	26232
Voice 1 Ver 1 to Voice 1 Ver 4	26025
Voice 1 Ver 1 to Voice 2 Ver 1	46731
Voice 1 Ver 1 to Voice 2 Ver 2	56760
Voice 1 Ver 1 to Voice 2 Ver 3	45055