Introduction to Baseline SMT (With Moses)

Introduction to Baseline SMT (With Moses) (Part 1)

BY WZ & CK.D.LEE 2013.01.30 We Developed a Baseline KR-CH Bi-Direction SMT system with MOSES.

  • Server: Cala
    • OS: 64bit Ubuntu 11.10
    • Language: Korean – Chinese Bi-Direciton

(PS. This Introduction is based on Korean to Chinese MT System.) Installation

     The minimum software requirements are:

  • Moses (obviously!)
  • GIZA++, for word-aligning your parallel corpus
  • Either IRSTLM or SRILM for Language model estimation

1. Moses Linux Installation

Install boost

Moses requires boost. Your distribution probably has a package for it. If your distribution has separate development packages, you need to install those too. For example, Ubuntu requires libboost-all-dev.

 sudo apt-get install libboost-all-dev

Moses source code

The source code is stored in a git repository on github.

You can clone this repository with the following command (the instructions that follow from here assume that you run this command from your home directory):

 git clone git://github.com/moses-smt/mosesdecoder.git

Compile

After checking out from git, examine the options you want.

 cd ~/mosesdecoder
 ./bjam --help

For example, if you have 8 CPUs, build in parallel:

 ./bjam -j8

Run it for the first time

Download the sample models and extract them into your working directory:

 cd ~/mosesdecoder
 wget http://www.statmt.org/moses/download/sample-models.tgz
 tar xzf sample-models.tgz
 cd sample-models

Note that the configuration file moses.ini in each directory is set to use KenLM language model toolkit by default. If you prefer to use another LM toolkit, editthe languagemodel entry in moses.ini.  Look at here for more details.

Run the deocder

 cd ~/mosesdecoder/sample-models
 ~/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out

If everything worked out right, this should translate the sentence das ist ein kleineshaus (in the file in) as it is a small house (in the file out).

PS. Details can be found at here.

2. Installing GIZA++

GIZA++ is hosted at Google Code, and a mirror of the original documentation can be found here. I recommend that you download the latest version from Google Code – I’m using 1.0.7 in this guide so I downloaded and built it with the following commands (issued in my home directory):

 wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
 tar xzvf giza-pp-v1.0.7.tar.gz
 cd giza-pp
 make

This should create the binaries~/giza-pp/GIZA++-v2/GIZA++,~/giza-pp/GIZA++-v2/snt2cooc.outand~/giza-pp/mkcls-v2/mkcls. These need to be copied to somewhere that Moses can find them as follows

 cd ~/mosesdecoder
 mkdir tools
 cp ~/giza-pp/GIZA++-v2/GIZA++ ~/giza-pp/GIZA++-v2/snt2cooc.out ~/giza-pp/mkcls-v2/mkcls tools

When you come to run the training, you need to tell the training script where GIZA++ was installed with the argument.

   train-model.perl -external-bin-dir $HOME/mosesdecoder/tools

3. Installing IRSTLM (Because you need to pay if you want to use SRIM for commercial using.)

IRSTLM is a language modelling toolkit from FBK, and is hosted on sourceforge. Again, you should download the latest version. I used version 5.70.04 for this guide so assuming you downloaded the tarball into your home directory (and making the obvious changes if you download a later version) the following commands should build and install IRSTLM:

 tar zxvf irstlm-5.80.01.tgz
 cd irstlm-5.80.01
 ./regenerate-makefiles.sh
 ./configure --prefix=$HOME/irstlm
 make install

You should now have several binaries and scripts in ~/irstlm/bin, in particular build-lm.sh

Corpus Preparation

To train a translation system we need parallel data (text in two different languages) which is aligned at the sentence level. This time we used corpora from CSLI which is Korean-Chinese with 300,000 parallel sentences for experiment. We placed it as follow:

 cd
 mkdir ~/corpus

We created directory named corpus and placed two files seg.ko & seg.zh in it (ps.They are already segmented).

Language Model Training

The language model (LM) is used to ensure fluent output, so it is built with the target language. The IRSTLM documentation gives a full explanation of the command-line options, but the following will build an appropriate 3-gram language model, removing singletons, smoothing with improved Kneser-Ney, and adding sentence boundary symbols:

 mkdir ~/lm   (THIS DIRECTORY CAN BE ANY WHERE)
 cd ~/lm
 ~/irstlm/bin/add-start-end.sh  < ~/corpus/seg.ko > seg.sb.ko
 export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh -i  seg.sb.en  -t ./tmp  -p  \
        -s improved-kneser-ney -o seg.lm.ko 
 ~/irstlm/bin/compile-lm --text yes seg.lm.ko.gz seg.arpa.ko

This should give a language model in the *.arpa.ko file, which we’ll then binarise (for faster loading) using KenLM

 ~/mosesdecoder/bin/build_binary  seg.arpa.ko  seg.blm.ko

 KR-CH Language model can be found at jiafei427@cala:~/mosesdecoder/ckLm$ and we imported commands above to a shell file called mtTarCH.sh. Run it by typing sh mtTarCH.sh in consor.

Training the Translation System

Finally we come to the main event – training the translation model. This will run word-alignment (using GIZA++), phrase extraction and scoring, create lexicalised reordering tables and create your Moses configuration file, all with a single command. I recommend that you create an appropriate directory as follows, and then run the training command,catching logs:

 mkdir ~/working
 cd ~/working
 nohup nice ~/mosesdecoder/scripts/training/train-model.perl  -root-dir train -corpus ~/corpus/seg \
   -f ko -e ch -alignment grow-diag-final-and -reordering msd-bidirectional-fe  \ 
   -lm 0:3:$HOME/lm/seg.blm.ko:8 -external-bin-dir ~/mosesdecoder/tools >& training.out &

If you have a multi-core machine it’s worth using the -cores argument to encourage as much parallelisation as possible.

Tuning

This is the slowest part of the process, so you might want to line up something to read whilst it’s progressing.Tuning requires a small amount of parallel data, separate from the training data,so we placed another parallel datas for it in ~/corpus directory -> seg.dev.ko seg.dev.zh(also segmented.) During tuning the system, you can play some games such as LOL, Warcraft or maybe Dragon Flight cuz this shit will burn your precious time.

Now go back to the directory we used for training, and launch the tuning process:

 cd ~/working
 nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl ~/corpus/seg.dev.ko ~/corpus/seg.dev.zh --decoder-flags="-threads 4" \
  ~/mosesdecoder/bin/moses  train/model/moses.ini --mertdir ~/mosesdecoder/bin/ &> mert.out &

If you have several cores at your disposal, then it’ll be a lot faster to run Moses multi-threaded. Add --decoder-flags="-threads 4" to the last line above in order to run the decoder with 4 threads. With this setting, tuning took about 2 hours for me. The end result of tuning is an ini file with trained weights, which should be in ~/working/mert-work/mo ses.ini if you’ve used the same directory structure as me.

Testing

You can now run Moses with

 ~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini

and type in your favorite korean to test the system. AND~! BAM!!! You can see your sentence will be translated like a CHARM~!!

Appendix. [Korean Parser] We Created Korean Parser in rough morphological level with JHanNanum Korean Mophological Analyzer. Usage: There’s file named krSeg.jar folders called conf & data. Put them on a same folder. Write the Korean sentence which you want to split and save the file by naming it sent.ko

You can now run it with

 java -jar krSeg.jar

Then BAM! file seg.ko will be appear in the same directory which already done segmentation process.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s