Suppose that ZPar has been downloaded to the directory zpar. To make the joint Chinese word segmentor and POS tagger, type make chinese.postagger. This will create a directory zpar/dist/chinese.postagger, in which there are two files: train and tagger. The file train is used to train a joint model of Chinese word segmentation and POS tagging,and the file tagger is used to segment and assign POS tags to new texts using a trained joint model.
The input files to the tagger are formatted as a sequence of Chinese characters. An
example input is:
ZPar可以分析中文和英文
The output files contain space-separated words:
ZPar_NN 可以_VV 分析_VV 中文_NN 和_CC 英文_NN
The output format is also the format of training files for the train executable.
Both input and output files must be encoded in utf8. Here is a script that transfers files
that are encoded in gb to the utf8 encoding.
To train a model, use
zpar/dist/chinese.postagger/train <train-file> <model-file>
<number of iterations>
For example, using the example train file, you can train a model by
zpar/dist/chinese.postagger/train train.txt model 1
After training is completed, a new file model will be created in the current directory,
which can be used to do joint segmentation and POS taging to Chinese. The above
command performs training with one iteration (see Section 6) using the training
file.
To apply an existing model to do joint segmentation and POS tagging to new texts, use
zpar/dist/chinese.postagger/tagger <model> [<input-file>]
[<output-file>]
where the input file and output file are optional. If the output file is not specified,
segmented and POS-tagged texts will be printed to the console. If the input file
is not specified, raw texts will be read from the console. For example, using
the model we just trained, we can segment and POS-tag an example input by
zpar/dist/chinese.postagger/tagger model input.txt output.txt
The output file contains automatically segmented and POS-tagged texts.
Automatically segmented and POS-tagged texts contain errors. In order to
evaluate the quality of the outputs, we can manually specify the segmentation
and POS tags of a sample, and compare the outputs with the correct sample.
A manually specified segmentation and POS tagging of the input file is given in this
example reference file. Here is a Python script that performs automatic evaluation.
Using the above output.txt and reference.txt, we can evaluate the accuracies by typing
python evaluate.py output.txt reference.txt
You can find the precision, recall, and f-score here. See the explanation of these
measures on Wikipedia.
The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a joint segmentation and POS tagging model using this. Here is a a shell script that automatically trains the joint segmentor and POS tagger for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, there is a variable called zpar. You need to set this variable to the relative directory of zpar/dist/chinese.postagger.
The source code for the joint segmentor and POS tagger can be found at
zpar/src/chinese/tagger/implementation/CHINESE_TAGGER_IMPL
where CHINESE_TAGGER_IMPL is a macro defined in Makefile, and specifies a
specific implementation for the joint segmentor and POS tagger.
The Chinese POS-tagger by default performs segmentation and tagging simultaneously.
This means that if the input sentence has been segmented, the system will
resegment the sentence. There is one implementation that performs POS-tagging on
segmented sentences. The name of the implementation is segmented, and you
can compile this system by setting CHINESE_TAGGER_IMPL to segmented
in Makefile. The compilation, training, and usage are the same as the other
taggers.
[1] Yue Zhang and Stephen Clark. 2008. Joint Word Segmentation and POS Tagging Using A Single Perceptron. In Proc. of ACL. pages 888-896.
[2] Yue Zhang and Stephen Clark. 2010. A Fast Decoder for Joint Word Segmentation and POS-tagging Using a Single Discriminative Model. In Proc. of EMNLP. pages 843-852.