$Id: index.html 1574 2007-01-26 11:59:13Z taku $;
BACT is a machine learning tool for labeled orderd trees [Kudo & Matsumoto 2004]. The important characteristic is that the input example x is represented not in a numerical feature vector (bag-of-words) but in a labeled ordered tree.
% make % make test
Both the training file and the test file need to be in a particular format for the BACT to work properly. Each line of the input file denotes a tree instance represented in strict S-expression. Strict means that all nodes, even leaf nodes, must be bracketed. For example, (c(a b)) should be written as (c(a)(b)).
First column denotes the class label (+1 or -1).
Here is an example of such data.
+1 (S(NP(I))(VP(saw)(NP(a)(girl))(PP(with)(NP(a)(telescope))))(.)) +1 (S(NP(He))(VP(saw)(NP(the)(boy))(PP(with)(NP(this)(camera))))(.)) -1 (S(NP(I))(VP(go)(PP(to)(NP(this)(hotel))))(.)) -1 (S(NP(She))(VP(finds)(NP(a)(mistake))(PP(in)(this)(paper)))(.))
The followings are all invalid, since they are not labeled orderd trees.
+1 (a b c d) -1 (a (b c d)) +1 (a (b)(c d))
See med.(train|test) and jp.(train|test) as sample files.
./bact_learn [options] training-file model-file
where training-file is a file written in the format described in the previous section, and model-file is a text model file which will be generated by bact_learn.
There are some options to control the behavior of bact_learn.
bact_learn outputs the following message during its training phase
105 84 1/16861/17117 0.20287958 -0.04862460 -0.12681684 cells 106 85 3/16602/16850 0.20091623 -0.05129108 0.12543952 effect on a b c/ d / e f g h i
To make an efficient classification, model files generated by bact_learn should be converted into binary files:
./bact_mkmodel -i model-file -o model-file.bin
where model-file is a text model file generated by bact_learn and model-file.bin is a binary model file used in the actual classification.
If you want to obtain a list of support features, use the -O option:
./bact_mkmodel -i model-file -o model-file.bin -O model-file.O
where model-file.O includes the pairs of feature (subtree) and its weight. All subtrees are formatted in a string encoding (the details of this encoding is described in the next section). At the top line, the default weight is given.
You can control the number of iterations by the -T option:
./bact_mkmodel -T 100 -i model-file -o model-file.bin -O model-file.O
It is useful to control the parameter (# of boosting iterations) in this phase rather than in the training phase.
Use bact_classify:
./bact_classify test-file model-file.bin
where test-file is test data written in the same format as training data, and model-file.bin is a binary-model generated by bact_mkmodel. bact_classify prints accuracy, precision, recall and F-measure of given test file.
If you want to make a deep analysis of test data, use the -v option. There are 3 verbose levels:
./bact_classify -v1 test-file model-file.bin -1 -0.0123482 1 0.0110549 -1 -0.0147466 -1 -0.0289773 ...
First column is true answer and second column is the system output.
./bact_classify -v2 test-file model-file.bin -1 -0.0123482 (S1 (SBARQ (WHNP .... 1 0.0110549 (S1 (SBAR (WHNP ..
./bact_classify -v3 test-file model-file.bin <instance> -1 -0.0123482 (S1 (SBARQ (WHNP (~WP (!What))).. rule: 0.001 __DEFAULT__ rule: 0.00827753 WHNP rule: 0.00439396 VP rule: 0.00404992 ~WP rule: 0.00265146 !? .. </instance>
__DEFAULT__ is a default rule. Each feature (subtree) is also formatted in a string encoding.
You can obtain the set of features (sub-trees) applied in each classification. BACT uses a string encoding to represent a labeled orderd tree. Basically, the nodes are enumerated in pre-order. (depth-first enumerations). However, when nodes A and B are in sibling relation, the unique symbol ")" is inserted. e.g., "A ) B". Here is an example of the string encoding.
A B C -> (A(B(C))) A B ) C -> (A(B)(C) A B C ) ) D -> (A(B(C))(D))
All subtrees (features) are written in this format.
$Id: index.html 1574 2007-01-26 11:59:13Z taku $;
taku-ku@is.naist.jp