Introduction to Baseline SMT (With Moses) (Part 1)
BY WZ & CK.D.LEE 2013.01.30 We Developed a Baseline KR-CH Bi-Direction SMT system with MOSES.
- Server: Cala
- OS: 64bit Ubuntu 11.10
- Language: Korean – Chinese Bi-Direciton
(PS. This Introduction is based on Korean to Chinese MT System.) Installation
The minimum software requirements are:
- Moses (obviously!)
- GIZA++, for word-aligning your parallel corpus
- Either IRSTLM or SRILM for Language model estimation
1. Moses Linux Installation
Moses requires boost. Your distribution probably has a package for it. If your distribution has separate development packages, you need to install those too. For example, Ubuntu requires
sudo apt-get install libboost-all-dev
Moses source code
The source code is stored in a git repository on github.
You can clone this repository with the following command (the instructions that follow from here assume that you run this command from your home directory):
git clone git://github.com/moses-smt/mosesdecoder.git
After checking out from git, examine the options you want.
cd ~/mosesdecoder ./bjam --help
For example, if you have 8 CPUs, build in parallel:
Run it for the first time
Download the sample models and extract them into your working directory:
cd ~/mosesdecoder wget http://www.statmt.org/moses/download/sample-models.tgz tar xzf sample-models.tgz cd sample-models
Note that the configuration file
moses.ini in each directory is set to use KenLM language model toolkit by default. If you prefer to use another LM toolkit, editthe languagemodel entry in
moses.ini. Look at here for more details.
Run the deocder
cd ~/mosesdecoder/sample-models ~/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out
If everything worked out right, this should translate the sentence
das ist ein kleineshaus (in the file
it is a small house (in the file
PS. Details can be found at here.
2. Installing GIZA++
GIZA++ is hosted at Google Code, and a mirror of the original documentation can be found here. I recommend that you download the latest version from Google Code – I’m using 1.0.7 in this guide so I downloaded and built it with the following commands (issued in my home directory):
wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz tar xzvf giza-pp-v1.0.7.tar.gz cd giza-pp make
This should create the binaries
~/giza-pp/mkcls-v2/mkcls. These need to be copied to somewhere that Moses can find them as follows
cd ~/mosesdecoder mkdir tools cp ~/giza-pp/GIZA++-v2/GIZA++ ~/giza-pp/GIZA++-v2/snt2cooc.out ~/giza-pp/mkcls-v2/mkcls tools
When you come to run the training, you need to tell the training script where GIZA++ was installed with the argument.
train-model.perl -external-bin-dir $HOME/mosesdecoder/tools
3. Installing IRSTLM (Because you need to pay if you want to use SRIM for commercial using.)
IRSTLM is a language modelling toolkit from FBK, and is hosted on sourceforge. Again, you should download the latest version. I used version 5.70.04 for this guide so assuming you downloaded the tarball into your home directory (and making the obvious changes if you download a later version) the following commands should build and install IRSTLM:
tar zxvf irstlm-5.80.01.tgz cd irstlm-5.80.01 ./regenerate-makefiles.sh ./configure --prefix=$HOME/irstlm make install
You should now have several binaries and scripts in
~/irstlm/bin, in particular
To train a translation system we need parallel data (text in two different languages) which is aligned at the sentence level. This time we used corpora from CSLI which is Korean-Chinese with 300,000 parallel sentences for experiment. We placed it as follow:
cd mkdir ~/corpus
We created directory named corpus and placed two files seg.ko & seg.zh in it (ps.They are already segmented).
The language model (LM) is used to ensure fluent output, so it is built with the target language. The IRSTLM documentation gives a full explanation of the command-line options, but the following will build an appropriate 3-gram language model, removing singletons, smoothing with improved Kneser-Ney, and adding sentence boundary symbols:
mkdir ~/lm (THIS DIRECTORY CAN BE ANY WHERE) cd ~/lm ~/irstlm/bin/add-start-end.sh < ~/corpus/seg.ko > seg.sb.ko export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh -i seg.sb.en -t ./tmp -p \ -s improved-kneser-ney -o seg.lm.ko ~/irstlm/bin/compile-lm --text yes seg.lm.ko.gz seg.arpa.ko
This should give a language model in the
*.arpa.ko file, which we’ll then binarise (for faster loading) using KenLM
~/mosesdecoder/bin/build_binary seg.arpa.ko seg.blm.ko
KR-CH Language model can be found at jiafei427@cala:~/mosesdecoder/ckLm$ and we imported commands above to a shell file called mtTarCH.sh. Run it by typing sh mtTarCH.sh in consor.
Training the Translation System
Finally we come to the main event – training the translation model. This will run word-alignment (using GIZA++), phrase extraction and scoring, create lexicalised reordering tables and create your Moses configuration file, all with a single command. I recommend that you create an appropriate directory as follows, and then run the training command,catching logs:
mkdir ~/working cd ~/working nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus ~/corpus/seg \ -f ko -e ch -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ -lm 0:3:$HOME/lm/seg.blm.ko:8 -external-bin-dir ~/mosesdecoder/tools >& training.out &
If you have a multi-core machine it’s worth using the
-cores argument to encourage as much parallelisation as possible.
This is the slowest part of the process, so you might want to line up something to read whilst it’s progressing.Tuning requires a small amount of parallel data, separate from the training data,so we placed another parallel datas for it in ~/corpus directory -> seg.dev.ko seg.dev.zh(also segmented.) During tuning the system, you can play some games such as LOL, Warcraft or maybe Dragon Flight cuz this shit will burn your precious time.
Now go back to the directory we used for training, and launch the tuning process:
cd ~/working nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl ~/corpus/seg.dev.ko ~/corpus/seg.dev.zh --decoder-flags="-threads 4" \ ~/mosesdecoder/bin/moses train/model/moses.ini --mertdir ~/mosesdecoder/bin/ &> mert.out &
If you have several cores at your disposal, then it’ll be a lot faster to run Moses multi-threaded. Add
--decoder-flags="-threads 4" to the last line above in order to run the decoder with 4 threads. With this setting, tuning took about 2 hours for me. The end result of tuning is an ini file with trained weights, which should be in
~/working/mert-work/mo ses.ini if you’ve used the same directory structure as me.
You can now run Moses with
~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini
and type in your favorite korean to test the system. AND~! BAM!!! You can see your sentence will be translated like a CHARM~!!
Appendix. [Korean Parser] We Created Korean Parser in rough morphological level with JHanNanum Korean Mophological Analyzer. Usage: There’s file named krSeg.jar folders called conf & data. Put them on a same folder. Write the Korean sentence which you want to split and save the file by naming it sent.ko
You can now run it with
java -jar krSeg.jar
Then BAM! file seg.ko will be appear in the same directory which already done segmentation process.