BACT: a Boosting Algorithm for Classification of Trees

$Id: index.html 1574 2007-01-26 11:59:13Z taku $;

Introduction

BACT is a machine learning tool for labeled orderd trees [Kudo & Matsumoto 2004]. The important characteristic is that the input example x is represented not in a numerical feature vector (bag-of-words) but in a labeled ordered tree.

Author

Taku Kudo

Download

BACT is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License.
BACT is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See GNU General Public License.the for more details.
Source
- bact-0.13.tar.gz: HTTP

Installation

Requirements
- C++ compiler (gcc 2.95 or higher)
- POSIX getopt library.
How to make
```
% make
% make test 
```

Usage

Format of input data

Both the training file and the test file need to be in a particular format for the BACT to work properly. Each line of the input file denotes a tree instance represented in strict S-expression. Strict means that all nodes, even leaf nodes, must be bracketed. For example, (c(a b)) should be written as (c(a)(b)).

First column denotes the class label (+1 or -1).

Here is an example of such data.

+1 (S(NP(I))(VP(saw)(NP(a)(girl))(PP(with)(NP(a)(telescope))))(.))
+1 (S(NP(He))(VP(saw)(NP(the)(boy))(PP(with)(NP(this)(camera))))(.))
-1 (S(NP(I))(VP(go)(PP(to)(NP(this)(hotel))))(.))
-1 (S(NP(She))(VP(finds)(NP(a)(mistake))(PP(in)(this)(paper)))(.))

The followings are all invalid, since they are not labeled orderd trees.

+1 (a b c d)
-1 (a (b c d))
+1 (a (b)(c d))

See med.(train|test) and jp.(train|test) as sample files.

med.(train|test): MEDLINE sentence classification task where one has to classify sentences into BACKGROUND or CONCLUSIONS.
jp.(train|test): PHS REVIEW classification task where one has to classify sentences into POSITIVE reviews or NEGATIVE reviews. This is a part of the data used in [Kudo & Matsumoto 2004].

Training

./bact_learn [options] training-file model-file

where training-file is a file written in the format described in the previous section, and model-file is a text model file which will be generated by bact_learn.

There are some options to control the behavior of bact_learn.

-T NUM: set maximum boosting iterations (default: 10000).
The number of iterations is also selected in the classification phase as long as the given iterations is less than NUM. This parameter is seen as the MAXIMUM number of iteration.
-m NUM: set feature cutoff threshold. (default 1)
use features (subtrees) which occur no less than NUM times in training data.
-L NUM: set maximum tree size (default: no-restriction).
When training speed is slow, the maximum size of trees will be restricted by this option. (e.g., -L 5 or -L 6).
-p prob: set the percentage of non-approximated algorithm (default 1.0).
When training speed is slow, you can employ an approximated algorithm by this option. In the approximated iterations, the optimal rule is selected form the cache which maintains the rules explored in the previous iterations. This is an approximation and dose not give the optimal solution. -p option can control the probability where the non-approximated algorithm is invoked. In the default setting, p is set as 1.0, which implies that the approximated algorithm is NOT employed. If p is set as 0.1 non-approximated algorithm is used with the probability of 0.1.
-s NUM: set the number of sub-iterations (default 100)
This parameter should be used along with -p option. The non-approximated algorithm is used in (NUM * prob) iterations.
-t (0|1): set the boosting algorithm (default 0)
In the default setting, Arc-GV is used. use the "-t 1" option to employ AdaBoost.

bact_learn outputs the following message during its training phase

105 84 1/16861/17117 0.20287958 -0.04862460 -0.12681684 cells
106 85 3/16602/16850 0.20091623 -0.05129108 0.12543952 effect on
a   b  c/  d  /  e        f          g          h       i

a: # of boosting iterations.
b: # of unique rules
c: # of times when optimal rules are rewritten
d: # of times when pruning conditions are satisfied
e: # of times when pruning conditions are checked
f: error rate
g: margin
h: weight (alpha) of the rule
i: rule (subtree) represented in a string encoding

Compiling Models

To make an efficient classification, model files generated by bact_learn should be converted into binary files:

./bact_mkmodel -i model-file -o model-file.bin

where model-file is a text model file generated by bact_learn and model-file.bin is a binary model file used in the actual classification.

If you want to obtain a list of support features, use the -O option:

./bact_mkmodel -i model-file -o model-file.bin -O model-file.O

where model-file.O includes the pairs of feature (subtree) and its weight. All subtrees are formatted in a string encoding (the details of this encoding is described in the next section). At the top line, the default weight is given.

You can control the number of iterations by the -T option:

./bact_mkmodel -T 100  -i model-file -o model-file.bin -O model-file.O

It is useful to control the parameter (# of boosting iterations) in this phase rather than in the training phase.

Classification

Use bact_classify:

./bact_classify test-file model-file.bin

where test-file is test data written in the same format as training data, and model-file.bin is a binary-model generated by bact_mkmodel. bact_classify prints accuracy, precision, recall and F-measure of given test file.

If you want to make a deep analysis of test data, use the -v option. There are 3 verbose levels:

Output the labels (confidence measures) of classifications.

./bact_classify -v1 test-file model-file.bin
-1 -0.0123482
1 0.0110549
-1 -0.0147466
-1 -0.0289773
...

First column is true answer and second column is the system output.

Together with -v1, each test instance is also given.

./bact_classify -v2 test-file model-file.bin
-1 -0.0123482 (S1 (SBARQ (WHNP ....
1 0.0110549 (S1 (SBAR (WHNP ..

Together with -v2, the list of support features is given.

./bact_classify -v3 test-file model-file.bin
<instance>
-1 -0.0123482 (S1 (SBARQ (WHNP (~WP (!What)))..
rule: 0.001 __DEFAULT__
rule: 0.00827753 WHNP
rule: 0.00439396 VP
rule: 0.00404992 ~WP
rule: 0.00265146 !?
..
</instance>

__DEFAULT__ is a default rule. Each feature (subtree) is also formatted in a string encoding.

The format of rules (features), a.k.a. string encoding

You can obtain the set of features (sub-trees) applied in each classification. BACT uses a string encoding to represent a labeled orderd tree. Basically, the nodes are enumerated in pre-order. (depth-first enumerations). However, when nodes A and B are in sibling relation, the unique symbol ")" is inserted. e.g., "A ) B". Here is an example of the string encoding.

A B C            -> (A(B(C)))
A B ) C          -> (A(B)(C)
A B C ) ) D      -> (A(B(C))(D))

All subtrees (features) are written in this format.

Bibliography

Taku Kudo, Yuji Matsumoto (2004)
A Boosting Algorithm for Classification of Semi-Structured Text, EMNLP 2004 [PDF]