Introduction to Baseline SMT (With Moses) (Part 2)

BY WZ & CK.D.LEE 2013.01.30

This document will introduce how our demo system is made based on Moses.

1. A brief introduction for  the demo system

Edit section

URL for Chinese to Korean:  http://bertha.postech.ac.kr/smt_ck.html

URL for Chinese to Korean:  http://bertha.postech.ac.kr/smt_kc.html

Input the source sentence, the system will then output the segmented sentence for source sentence and the translated sentence.

2. Overview for the demo system

Edit section

1. The moses system is installed and configured on the Moses Server (cala). On cala, a daemon process implemented by daemon.pl, accepts network connections on a given port and perfroms the translation, and then sends back the translation result

2. bertha is designated as the web server. It provides the user interface to client users. After the client user types in the source sentence and submits a request , a form containing the source sentence will be sent to the web server. The web server will perform word segmentation for the source sentence and send the preprocessed input into moses server (cala). After receiving the translated sentence from moses server and postprocessing for the output, web server sends back the result to client user.

PS: Please refer to http://www.statmt.org/moses/?n=Moses.WebTranslation for more details. The provided document was originally desigend for translating web pages, which includes extracting txt information from web pages. We only refered to their settings of moses server.

3. Detailed instruction

Edit section

    1. Setting up the Moses server
Edit section

Before starting, please make sure that you have finished training a translation model and can decode input sentences (Part1 of this introduction). This part is based on the Chinese to Korean translation, under the path: @cala, /u1/wuzhen. The folder mosesdecoder is where moses is installed. Folder work.ck is where the Chinese-Korean translation model located.

Go to the folder ~/mosesdecoder/contrib/web/bin and find daemon.pl. This perl script will load the translation model into momery and listen on the port predefined.

Before running this script, we need some minor modification for daemon.pl:

At the 43th line, the original code was: my $pid = open2 ($MOSES_OUT, $MOSES_IN, $MOSES, ‘-f’, $MOSES_INI, ‘-t’);

This line was for running the translation model. Remove the last parameter “-t”, so that phrase segmentation will not be reported in the output. Otherwise you will get an output like : “저 는 집 에서 |0-3| 공부 하 ㅂ니다 |4-5| . |6-6| ”

After that we can run the script with the following command:

 ./daemon.pl cala 3400 /u1/wuzhen/working.ck/mert-work/moses.ini  

Three arguments are required here: hostname, port and mose.ini.

Hostname is the hostname for current moses server you are using (found by issuing the hostname command).

Port parameter can be any number between 1024 and 49151, but it should not be a standard port for common programs and protocols. Otherwise it will interference with the other programs. You can check the/etc/services file to find which ports are occupied.  While I was choosing the port, I first used port 279 which is not occupied, but I found that port can not be used by the perl script. After trying several other ports, finally the port number 3400 can be used. Port 3401 is then used for the Koread-Chinese moses translation model.

The last parameter is the path for moses.ini, which specified the translation model you will use. Here We used the tuned Chinese-to-Korean Translation model.

PS: In the referred document, the author used only two arguments (hostname and port). After reading the source code, we found adding the last parameter for moses.ini is necessary.

But the previous command did not fork a background process. To truly launch the process in the background, type in the following command

 nohup ./daemon.pl cala 3400 /u1/wuzhen/working.ck/mert-work/moses.ini  &

In this case, the process will run in background. The output for the program can be found in the file nohup.out under the same path. I have written a shell script named run_ckmaster.sh for this command under the path: @cala, /u1/wuzhen/mosesdecoder/contrib/web/bin

Up to now, we have finished the setting for moses decoder. To test if it works, Type in the following command in the web server (bertha). (Be sure that you have netcat installed on your web server)

 echo "你好" | nc -i 1 cala 3400

This command will send the message “你好” to port 3400 of cala. The interval is set at 1 second. If the moses server works, you would receive the translated result “안녕 하 세 요 . ” from moses server (cala).

PS: Parameter “-i” was not used in the referred document. In that case we can not receive the translation result because the netcat process will stop before receiving the translation result.

    2. Setting up the Web server
Edit section

bertha is used as the web server in our system. I will give a introduction based on our Chinese-to-Korean demo system first.

As you can see we can visit the user interface from the url: http://bertha.postech.ac.kr/smt_ck.html ,The file smt_ck.html is actually stored under the path: /usr/local/apache2/htdocs. All the document htmls for bertha apache are stored in the folder htdocs. After clicking the button “Translate” in the web page, the script “smt_ck.sh” under the path “/usr/local/apache2/cgi-bin” will be executed.

“smt_ck.sh” will first get the form from the webpage and store it. (using get_form_cksmt.py) The input source sentence will be stored in file “forminput” under the path: /bertha03/wuzhen/smt_ck. All users have been given the permission for folder smt_ck. In the subdirectory “seg”, an in-house Chinese segmenter written by Wu Zhen will segment the input sentence and store the result in file “seg.zh.tmp”. The segmented sentence is then sent to cala and after receiving the translated result, the result will be stored in the file “translated”. We used the command “cat” to get the results from files and print them to the returned webpage.

PS:

1. Modifying the files in apache folder requires root permission.

2. When training our system, standford Chinese tagger was used to segment the Chinese training corpus and development corpus. But in the demo system we found it takes a long time for the standford tagger to load the dictionary into memory, which can not be tolerated in a demo system. So we switched to use Wu Zhen’s Chinese segmenter. Both of this two segmenters are trained on CTB6 and the performance were also similar. But Standford segmenter is written in JAVA, ours is written in python and used crf++ commands in the codes. So to run our segmentation system, crf++ is required on be installed on the web server.

As for Korean-to-Chinese demo system, we also stored the webpage under /usr/local/apache2/htdocs, with the name “smt_kc.html”. In the cgi-bin folder, a shell script “smt_kc.sh” is executed. Proprocessing and postprocessing will be executed in the path: /bertha03/wuzhen/smt_kc. A Korean word segmenter is performed for the input. “rm_space.py” is used to remove the white spaces from the translated Chinese sentence.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s