There are three ways to create a collective work: 1. Pay people. 2. Get volunteers. 3. Architect your product in such a way that people create collective value by pursuing their individual self-interest. By way of example, Yahoo! built their directory using method 1. Many open source projects as well as shared content projects like Wikipedia use method 2. But many of the great successes of the Internet age have discovered method 3.
Brough proposes method 3 as the way to accumulate large speech corpora. That is, to provide ways for people to contribute while pursuing their individual self-interest. One way to achieve this might be through the creation of a game that, as a side effect, would collect speech as the user plays the game.
The game approach has been used successfully as shown in this presentation by Luis Von Ahn's (from CMU) describing some initiative that use two player games to provide descriptive tags for web images:
http://video.google.com/videoplay?docid=-8246463980976635143
Key success factors:
For a game using speech recognition, it is important have a word or phrase that is known, and use that in your grammar (so open source speech recognition will work - since we are essentially limited to grammar-based speech rec). If the system doesn't recognize the utterance, then maybe get the user to type the answer and say it (not an optimal approach).
User's goals of the game should reasonably align with site goals (i.e. VoxForge).
The best games are those that don't attract too much attention to yourself. For example, most solitaire is played while goofing-off at work.
Here is a list of possible games that might created that could help with the collection of speech for the VoxForge Speech Corpus:
Single Player vs computer, using speech recognition:
Riddles - with one answer or phrase. User guesses the answer to the riddle using speech. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Word jumble guessing – user is presented with a word/sentence with its letters randomly mixed up, must guess correct word/sentence. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Anagram guesser – user needs to guess all possible combinations of words using only the letters in a particular word. There are on-line websites/algorithms for determining all possible anagrams for a given word. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Identify music passages – play a section of music, user guesses the name of the song. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus. Copyright issues need to be resolved – if using current music. Might use midi versions of out of Copyright music.
Simon says – computer repeat an ever increasing sequence of words that the user must repeat correctly (cold be random words, or themed random sentences ... from Shakespeare for example). Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Repeat a rhyme or limerick as quickly as possible – basically trying to collect speech that more closely corresponds to spontaneous speech (as opposed to read speech). Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Bumper Stumper game – user presented with a word in abbreviated license plate format, must guess the correct word. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Text-messaging stumper game – players compete to see who can guess an abbreviated text message the quickest. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Wheel of Fortune game – user given a sentence with blanks representing the letters. User would key in letters as nouns or vowels are guessed, but must speak the answer to win. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
20 questions – user is presented with a list of questions they can ask (basically this is the grammar for the SRE) to identify a person/place/thing. They have twenty tries to get an answer. Speech rec is used to recognize the question (removes it from the list as they are selected). User then must guess the word.
Voice tetris/scrabble – words fall down the screen to one side and player selects it by saying the word and using mouse to move the word into place on a scrabble-like board. Speech Rec is used to select the word. Collect the speech for incorporation into VF speech corpus.
Falling Word de-scramble – scrambled word or anagram falls down a screen and if person solves the word puzzle (by uttering the correct word) before it hits the bottom of the screen they get points. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Single Player – no speech rec:
Virtual "Geeks With Talent"/"American idol" – use Shakespeare (or other out-of-Copyright author) and get people to recite them in different forms (e.g. Rapper, valley girl, etc.) Get others to vote. Collect the speech for incorporation into VF speech corpus.
Two Player Asymmetric games (i.e. one player has information that the other player does not) - speech rec:
"Pictionary"-like game - board game where one player tries to get other player to guess a word, but can only use drawings. One player would draw a picture on-line (using mouse; or searches for pictures in Google, but other user cannot see any text – only the resulting pictures from the search), other player would verbally guess, and speech rec would tell them when they got the right word.
Word guessing game – One player is given a word, and gives the other player hints as to what it is (speech or text) without actually saying the word (might have a "not usable" word list). Speech Rec is used to check for correct answer from second player. Collect the speech for incorporation into VF speech corpus.
Image Content Guessing Game Using Words. A picture is shown to one user. S/he describes the picture using text, the other player guesses the answer. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Quote jumble game – Two users are presented with a speech recording of a well know quote, but jumbled up. First person to guess the sentence wins. Speech Rec is used to check for correct answer. Collect the speech for incorporation into VF speech corpus.
Virtual Scavenger Hunt – one player is controlling a browser that is viewable to both players. The other gives directions telling the other where to go and what to collect from a limited grammar of words (i.e. go to the Sun website, and collect the logo for open source Java). The Spoken directions are converted to text (based on the grammar) and sent to the other user. Speech Rec is used to validate the words used. Collect the speech for incorporation into VF speech corpus.
Two Player Asymmetric Game – no speech rec (speech collection for later transcription)
Assemble 3D part – one player has pictorial instructions to assemble a virtual machine. Instructs other user how to put the machine together. Collect speech for later transcription.
Map Directions – both players have access to Google Maps. One player gives verbal directions to the other. Collect speech for later transcription.
Two Player Symmetric Game (i.e. both players have access to the same information) - speech rec
Image ESP game knockoff – System presents a known image (that has a predefined grammar of words that describe the picture) to both users. Each users utters words that describes the contents of the image. Players get points for using the same words to describe the image. Speech Rec is used to check for matching answers. Collect the speech for incorporation into VF speech corpus.
Sound ESP – same things as the Image ESP, but using sounds.
Two Player Symmetric game – no speech rec
Transcribing game – two users are presented with an audio sentence that needs transcribing. Jumble it up, and first person who puts it right, wins. The sentence gets un-jumbled bit by bit as time passes (with the point value for a correct answer going correspondingly down). Speech rec cannot be used in this case (at least not open source ... yet), but get enough players guessing the same answer at the same jumble of words, then we can say at some point that we have a good transcription.
Interesting article from CMU: