$Id: index.html,v 1.1 2003/09/07 10:10:02 taku-ku Exp $;
The algorithm FREQT (FREQuent Tree miner), independently introduced by Asai[1] and Zaki[2], efficiently extracts frequent ordered-sub-trees from a set of ordered-trees (forest database). Frequent means that a sub-tree occurs in no less than N trees, where N is a user given threshold usually called minimum support. FREQT efficiently enumerates frequent sub-trees using right-most expansion.
% make % make test
./freqt [options] < input-data > output-data option: -m NUM: set minimum support (default: 1) -M NUM: set minimum node size (default: 1) -L NUM: set maximum node size (default: 0xffffffff) -W: use "weighted" support instead of "standard" support Even if a sub-tree pattern occurs more than one time in a transaction, the "support" for this transaction is 1. On the other hand, "weighted support" counts all occurrences in a transaction. -e: use internal string encoding format[2] as output -w: output the location where each pattern occurs.
The input file need to be in a particular format for the FREQT to work properly. Each line of the input file denotes one tree transaction represented in strict S-expression. Strict means that all nodes, even leaf nodes, must be bracketed. For example, (c(a b)) should be written as (c(a(b))). If the line begin with ';' (comment character in S-expression), this line is ignored.
Here is an example of such data.
;; this is an example (S(NP(I))(VP(saw)(NP(a)(girl))(PP(with)(NP(a)(telescope))))(.)) (S(NP(He))(VP(saw)(NP(the)(boy))(PP(with)(NP(this)(camera))))(.)) (S(NP(I))(VP(go)(PP(to)(NP(this)(hotel))))(.)) (S(NP(She))(VP(finds)(NP(a)(mistake))(PP(in)(this)(paper)))(.))
See 'data' as sample file.
If you want to handle frequent oriented-tree, just sort siblings by dictionary order.
Orderd-tree: (a(b)(d(e)(c)) Oriented-tree: (a(b)(c)(d(e))
Each line denotes a frequent sub-tree pattern, consisting of four columns, support, weighted support, size of tree (# of nodes), and frequent sub-tree pattern represented in strict S-expression. You can restrict the size of tree by using -L MIN_SIZE and -M MAX_SIZE option. The -e option changes the sub-tree format from S-expression to the internal string-encoding format[2].
Here is a concrete example of output.
12 12 2 (~IN(at)) 11 11 2 (~IN(by)) 23 25 2 (~IN(for)) 11 11 3 (~IN(for)(~NN)) 11 13 2 (~IN(from)) 23 24 2 (~IN(in))
The sub-tree pattern, "(~IN(at))", consists of 2 nodes and has 12 supports and 12 weighted supports.
The -w option allows you to obtain the list of transaction ids where a pattern occurs. Here is an example.
<pattern> <support>12<support> <wsupport>16<wsupport> <where>43 43 53 53 55 58 61 61 63 89 91 91 93 94 95 96<where> <what>($)<what> <pattern> <pattern> <support>10<support> <wsupport>18<wsupport> <where>34 34 41 41 45 46 46 48 48 49 49 49 49 50 91 91 95 96<where> <what>(%)<what> <pattern> <pattern>
One sub-tree pattern is bracketed in a <pattern> tag. The transaction ids where the pattern occurs are listed in <where> tag. The extracted sub-tree pattern, support and weighted support are in <what>, <support>, and <wsupport> tags respectively.
$Id: index.html,v 1.1 2003/01/22 10:10:02 taku-ku Exp $;
taku-ku@is.aist-nara.ac.jp