Suppose that ZPar has been downloaded to the directory zpar. To make a phrase-structure parsing system for English, type make english.conparser. This will create a directory zpar/dist/english.conparser, in which there are two files: train and conparser. The file train is used to train a parsing model,and the file conparser is used to parse new texts using a trained parsing model. Similarly, we can make a phrase-structure parsing system for Chinese by typing make chinese.conparser. The train and conparser files are created under the directory of zpar/dist/chinese.conparser. Note that the English and Chinese parsers are designed specifically for Penn Treebanks.
The input file to the train executable contains a set of parse trees, one for each line. An
example parse tree is as follows:
( S r ( NP r ( NNP t Ms. ) ( NNP t Haag ) ) ( S l* ( VP l ( VBZ t
plays ) ( NP s ( NNP t Elianti ) ) ) ( . t . ) ) )
The format is different from the original format used in Penn Treebanks. Here is a
Python script to convert the original Penn Treebank format to the ZPar format. The
usage is
python binarize.py <rule-file> <input-file>
Here rule-file is a file containing head-finding rules (see the example rules for Penn
Chinese Treebank), and the conversion results will be printed to the console. Note that,
in the respect of Chinese, the encoding of input-file to binarize.py should be gb and the
output will be encoded in utf8. Here is a script that transfers files that are encoded in gb
to the utf8 encoding.
The input file to the conparser contain POS tagged sentences. The formats for English
and Chinese are different.
English:
Ms./NNP Haag/NNP plays/VBZ Elianti/NNP
Chinese:
ZPar_NR 可以_MD 分析_VV 中文_NN 和_CC 英文_NN
For Chinese, inputs to both train and conparser must be encoded in utf8.
To train a model, use
zpar/dist/english.conparser/train <train-file> <model-file>
<number of iterations>
For example, using the example train file, you can train a model by
zpar/dist/english.conparser/train train.txt model 1
After training is completed, a new file model will be created in the current directory,
which can be used to parse POS-tagged sentences. The above command performs
training with one iteration (see Section 6) using the training file. The commands for
training Chinese parsing models are the same.
To apply an existing model to parse new texts, use
zpar/dist/english.conparser/conparser <input-file>
<output-file> <model>
For example, using the model we just trained, we can parse an example input by
zpar/dist/english.conparser/conparser input.txt output.txt model
The output file contains automatically parsed trees. The commands for parsing Chinese
texts are the same.
In order to evaluate the quality of the outputs, we can manually specify the
gold parse trees of a sample, and compare the outputs with the correct sample.
Manually specified parse trees of the input file are given in this example reference
file. Refer to evalb to obtain a software that performs automatic evaluation.
Using the above output.txt and reference.txt, we can evaluate the accuracies by typing
./evalb -p <config.file> output.txt reference.txt
Here config.file sets running parameters of the evaluation. COLLINS.prm is a widely
used configuration file. Evaluation results will be printed to the console.
The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a parsing model using this. Here is a a shell script that automatically trains the parser for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, this is a variable called parser. You need to set this variable to the relative directory of zpar/dist/english.conparsr or zpar/dist/chinese.conparser.
The source code for the English phrase-structure parser can be found at
zpar/src/common/conparser/implementation/ENGLISH_CONPARSER_IMPL
where ENGLISH_CONPARSER_IMPL is a macro defined in Makefile, and
specifies a specific implementation for the English phrase-structure parser.
The source code for the Chinese phrase-structure parser can be found at
zpar/src/common/conparser/implementation/CHINESE_CONPARSER_IMPL
where CHINESE_CONPARSER_IMPL is a macro defined in Makefile, and specifies a
specific implementation for the Chinese phrase-structure parser.