You will be replicating a published experiment on unsupervised (statistical
) word segmentation.
Since the original work by Jenny Saffran, many researchers have run studies investigating how humans might use unsupervised learning to segment words. While the bulk have been run with infants, at least a couple dozen have also been run with adults. (At the moment, we know very little about how unsupervised learning differs between adults and infants.)
Over the next several years, students in this class will be replicating as many of the adult studies as possible. The educational purpose is obvious (to give you hands-on experience with doing research). There are several scientific purposes as well:
The result of this project should be a much clearer view of the ability of humans to use unsupervised learning for word segmentation -- and, by extension, a clearer view of language acquisition itself.
By the end of reading period, you should turn in via git:
Each group should complete at least one experiment per three students. So if you have 4 students in your group, you need to do two papers. You only need to turn in one copy of each paper and experiment.
There are three phases to this project. First, you make the experiment. Second, my lab runs the experiment and gives you the data. Finally, you analyze the data and write up the results.
Collecting the data is expensive. So I'm going to confirm that you have done everything correctly before I collect data. This may take me a few days. If there are errors, I'll send it back to you for corrections, and then the process repeats. And that will take more days. Plus collecting the data itself will take 2-3 days. There are three key take-aways from this:
n1bapudipadubi.wav) not non-descriptive ones (e.g.,
1.wav). That makes it easier for me to figure out what is what. Document why you believe your nonwords or partwords meet the definition. Also, record any code you used, including
sox
commands or cat
commands (see below). Basically, the more you can record what you did and why you did it, the faster I can double-check. If I cannot figure out what you did, I may need you to come in to explain it to me.Here is a rough suggested timeline. If anything, I would try to be ahead of this schedule:
4/10 - Complete method section, so you know what you have to do (consider sending me a copy for comments) 4/24 - Turn in experiment code 5/1 - Receive data from instructor 5/8 - Paper due at midnight
I highly recommend that while you are waiting for your experiment to be checked or for data to be collected, you start making your analysis scripts. Some of the experiments involve analyses you may not have done before in R.
(If you aren't working with synthesized speech, you can skip this section.)
Watson text-to-speech is probably great for making Web apps. It was not good for making science experiments. Even for the first experiment, we ended up having to bend some rules. (For instance, we didn't have any co-articulation.) And many of the other experiments simply weren't possible with Waston.
I spent a lot of time digging around for an alternative that was more flexible but not too difficult to use. I ended up with MBROLA. You certainly can try making Watson work for you, but I strongly recommend using MBROLA instead.
Open the terminal and type the following:
$ cd ~/Downloads
$ curl -O "http://tcts.fpms.ac.be/synthesis/mbrola/bin/macintosh/mbrola"
$ chmod 744 "mbrola"
$ mv mbrola /usr/local/bin/
Assuming you haven't gotten any errors, close the terminal and reopen (if you got errors, speak with an instructor). Type the following:
$ mbrola -h
You should see the following in response:
USAGE: mbrola [-i] [-e] [-c CC] [-v VR] [-f FR] [-t TR] [-l VF] [-R RL] [-C CL] database pho_file* output_file
A - instead of pho_file or output_file means stdin or stdout
Extension of output_file ( raw, au, wav, aiff ) tells the wanted audio format
i = Print the database information if any
e = No fatal error on unkown diphone
CC= Comment Char, escape sequence for a comment
VR= Volume Ratio, float ratio applied to ouput samples
FR= Frequency Ratio, float ratio applied to pitch points
TR= Time Ratio, float ratio applied to phone durations
VF= Voice Freq, target freq for voice quality
RL= Phoneme renaming list of the form a A b B ...
CL= Phoneme cloning list of the form a A b B ...
If you get the error -bash: mbrola: command not found
then you need to add /usr/local/bin to your path. Speak with an instructor.
Now, we need to pick a voice for synthesis. You can download voices from the MBROLA project page. Choose 'downloads' from the menu on the top left, then click on 'MBROLA binary and voices':
Scroll down until you see 'Getting the MBROLA Voices'. For now, download the voice us1. (If your experiment involves non-English phonemes, you may ultimately need to use a different voice. For the purposes of this example, though, use us1.)
Move the us1 folder to your repo. Now, in the terminal, move yourself to the us1 folder. Then type the following:
$ mbrola us1 TEST/alice.pho sample.wav
$ open sample.wav
If the gods are good, you will hear a very robotic rendition of the opening lines of Alice in Wonderland. (If you get an error, check with an instructor.)
In the last part above, we asked MBROLA to convert the file alice.pho
into a wav file. Let's take a look at alice.pho. Here are the first couple dozen lines:
; Male voice -> 0.7, female voice 1.5
; F=0.7
; F=1.5
; Beginning of "Alice's adventures in Wonderland"
;
; file created with MBROLIGN v1.0
; Software distributed on http://tcts.fpms.ac.be/synthesis
; Malfr<8A>re malfrere@tcts.fpms.ac.be
_ 48 0 222
{ 80 40 222 90 235
l 72 44 250 66 250
I 80
s 88
w 40 80 235
@ 40 80 210
z 64 12 210 50 181 75 181
b 72 33 173 88 181
I 56 71 181
g 56 28 153 100 166
I 64 62 160
n 40 60 160
I 88 27 166
N 40 100 166
The first few lines -- the ones that start with ;
-- are comments. They are meant to provide information to human readers. They are ignored by MBROLA. After that, we get a list of phonemes, with one phoneme per line. The first two entries on each row indicate the sound and the duration of the sound. So _ 48
means to play 48 milliseconds of silence. { 80
means to play 'a' as in 'cat' for 80 milliseconds. The third line plays the 'l' sound for 72 milliseconds, and so on.
As you may have noticed, MBROLA does not use IPA. Each voice actually defines its own phoneme inventory. Here is the list for us1:
SYMBOL PRONOUNCED LIKE IN
p drop proxy
p_h pod (aspirated allophone of p)
t plot tromp
t_h top (aspirated allophone of t)
4 later (flapped allophone of t)
k rock crop
k_h cot (aspirated allophone of k)
b cob box
d nod dot
g jog gospel
f prof fox
s boss sonic
S wash shop
tS notch chop
T cloth thomp
v salve volley
z was zombie
Z garage jacques
dZ dodge jog
D clothe thy
m palm mambo
n john novel
N bong
l doll lockwood
r star roxanne
j yacht
w show womble
h harm
r= her urgent
i even
A arthur
O all
u oodles
I illness
E else
{ apple
V nut
U good
@ about
EI able
AI island
OI oyster
@U over
aU out
Some of the rows in alice.pho
have additional numbers that come after the duration. These numbers allow us to set the pitch contours: how the pitch rises and falls during the speech. We specify specific times that we want the pitch to be at specific levels, and MBROLA connects the dots. So if we wanted the vowel {
to start out fairly low (100 Hz) and end up high (250 Hz), we would write:
If we wanted the vowel to start at 100 Hz and climb to 250 Hz, we would write:
{ 200 0 100 100 250
MBROLA reads this as: Play 'a' for 200 milliseconds. 0% of the way through the sound -- that is, at the beginning -- the pitch should be 100 Hz. 100% of the way through the sound -- that is, at the end -- the pitch should be 250 Hz.
We've only told MBROLA the pitches we want at two locations in the sound; MBROLA fills in the rest. If we wanted the pitch to peak halfway through the vowel and then go back down, we could write:
{ 200 0 100 50 250 100 100
As you can tell, we can place as many pitch markers on a single sound as we want. Or, we can put none at all:
k 100 0 200
{ 200
t 100 100 250
This will produce the word 'cat', starting at 200 Hz and ending up at 250 Hz. Again, MBROLA simply changes the pitch smoothly in between the markers.
For most of our projects, we actually want a monotone. So we might start and stop at 200 Hz:
_ 50 0 200
k 100
{ 200
t 100
_ 50 100 200
However, MBROLA doesn't like going too long without being given a pitch marker. So every dozen or so phonemes, you may want to pop one in. If it's at the same pitch as the previous one, there won't be any pitch contour.
Read the paper you are trying to replicate and decide! Also listen to the sound files and see if you can make them out. If they sound bad, keep working. (They won't sound great, though, as the Alice example shows; this is pretty old tech.)
One major reason to use mbrola to make our sound files is so that we have natural co-articulation. So we don't want to make the wav files for each syllable separately and then concatenate. Instead, we want MBROLA to read the entire word at once.
So instead of concatenating wav files, we're going to concatenate pho files. A .pho file is really just a plain text file that ends in .pho and has content in the format described above. You can actually make plain text files in Word, just be sure to save as plain text not as a Word doc. (TextEdit will edit plain text files but will not create new ones.)
It is easy to concatenate plain text files in the Terminal using the command cat
. The syntax for cat
is:
cat file1 file2 ... > output_file
That is, you list all the files that you want concatenated, and then you use the >
to direct all the output to a specific file (you can call that output file whatever you want).
Navigate to the us1 folder, then type the following:
$ cat TEST/alice.pho TEST/mbrola.pho > concatenated.pho
This concatenates alica.pho and mbrola.pho and saves the result as concatenated.pho. (Notice that we have to specify that alice.pho and mbrola.pho are in the TEST folder; if we don't tell cat
where the files are, it can't find them!).
Now, we are ready to call MBROLA. The syntax for using MBROLA is:
mbrola voice_file input.pho output.wav
That is, we specify the voice file we want to use (us1
), the .pho file we want to parse, and what we want to call the resulting .wav file. Remember, if these files are not in your current working directory, you will need to specify their paths. Now, type:
$ mbrola us1 concatenated.pho concatenated.wav
$ open concatenated.wav
This should play the alice story and then a description of mbrola.
Thus, by using cat
, you can easily combine .pho files to make the question files (which, for most experiments, consists of one word from the language and one foil). Note that you'll also need to make a file called silence.pho
which consists of however much silence you need between the words.
Remember that you will need to convert all the .wav files to .mp3, just as in the first experiment.
I have written an R script makestims.R
which contains a single function make.training(wordlist,repsperfile,nblocks)
. First, run the code in order to load the function (the easiest method is to select all then press control+enter). Then run make.training(wordlist,repsperfile,nblocks)
where wordlist
is the list of .pho files for your words, repsperfile
is the total number of times each word should be presented, and nblocks
is the number of training blocks. For example:
> make.training(c("alice.pho","mbrola.pho"),5,2)
[1] " cat alice.pho mbrola.pho alice.pho mbrola.pho alice.pho > training1.pho; cat mbrola.pho alice.pho mbrola.pho alice.pho mbrola.pho > training2.pho;"
Now, in Terminal, make sure you are in the us1/TEST folder. Copy everything within the quotations above into the Terminal:
$ cat alice.pho mbrola.pho alice.pho mbrola.pho alice.pho > training1.pho; cat mbrola.pho alice.pho mbrola.pho alice.pho mbrola.pho > training2.pho;
This will make two files: training1.pho
and training2.pho
. You can then use mbrola to convert these into wav files. (Remember about paths. If you are still in the TEST folder, then us1 isn't local and you'll need to specify the path: mbrola ../us1 training1.pho training1.wav
.)
This script may not be enough for your experiment. You may need to modify it! (This is why I wrote it in R.) See if you can figure it out for yourself, but also don't be afraid to discuss with an instructor.