FREQT: An implementation of FREQT (FREQuent Tree miner)

$Id: index.html,v 1.1 2003/09/07 10:10:02 taku-ku Exp $;

Introduction

The algorithm FREQT (FREQuent Tree miner), independently introduced by Asai[1] and Zaki[2], efficiently extracts frequent ordered-sub-trees from a set of ordered-trees (forest database). Frequent means that a sub-tree occurs in no less than N trees, where N is a user given threshold usually called minimum support. FREQT efficiently enumerates frequent sub-trees using right-most expansion.

Author

Taku Kudo
Nara Institute of Science and Technology, Graduate School of Information Science, Computational Linguistics Laboratory

Download

FREQT is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License.
FREQT is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See GNU General Public License.the for more details.
Source

freqt-0.22.tar.gz: HTTP

Installation

Requirements
- C++ compiler (gcc 2.95 or higher, does not support gcc 2.8 or 2.7)
- POSIX getopt library (most of UNIX-based OS have it)
How to make

% make
% make test

Usage

./freqt [options] < input-data > output-data

option: 
-m NUM:   set minimum support   (default: 1)
-M NUM:   set minimum node size (default: 1)
-L NUM:   set maximum node size (default: 0xffffffff)

-W:       use "weighted" support instead of "standard" support
          Even if a sub-tree pattern occurs more than one time
      in a transaction, the "support" for this transaction is 1. 
      On the other hand, "weighted support" counts all 
      occurrences in a transaction.

-e:       use internal string encoding format[2] as output 

-w:   output the location where each pattern occurs.

Format of input data

The input file need to be in a particular format for the FREQT to work properly. Each line of the input file denotes one tree transaction represented in strict S-expression. Strict means that all nodes, even leaf nodes, must be bracketed. For example, (c(a b)) should be written as (c(a(b))). If the line begin with ';' (comment character in S-expression), this line is ignored.

Here is an example of such data.

;; this is an example
(S(NP(I))(VP(saw)(NP(a)(girl))(PP(with)(NP(a)(telescope))))(.))
(S(NP(He))(VP(saw)(NP(the)(boy))(PP(with)(NP(this)(camera))))(.))
(S(NP(I))(VP(go)(PP(to)(NP(this)(hotel))))(.))
(S(NP(She))(VP(finds)(NP(a)(mistake))(PP(in)(this)(paper)))(.))

See 'data' as sample file.

If you want to handle frequent oriented-tree, just sort siblings by dictionary order.

Orderd-tree: (a(b)(d(e)(c)) 
Oriented-tree: (a(b)(c)(d(e))

Format of output data

Each line denotes a frequent sub-tree pattern, consisting of four columns, support, weighted support, size of tree (# of nodes), and frequent sub-tree pattern represented in strict S-expression. You can restrict the size of tree by using -L MIN_SIZE and -M MAX_SIZE option. The -e option changes the sub-tree format from S-expression to the internal string-encoding format[2].

Here is a concrete example of output.

12 12 2 (~IN(at))
11 11 2 (~IN(by))
23 25 2 (~IN(for))
11 11 3 (~IN(for)(~NN))
11 13 2 (~IN(from))
23 24 2 (~IN(in))

The sub-tree pattern, "(~IN(at))", consists of 2 nodes and has 12 supports and 12 weighted supports.

The -w option allows you to obtain the list of transaction ids where a pattern occurs. Here is an example.

<pattern>
<support>12<support>
<wsupport>16<wsupport>
<where>43 43 53 53 55 58 61 61 63 89 91 91 93 94 95 96<where>
<what>($)<what>
<pattern>
<pattern>
<support>10<support>
<wsupport>18<wsupport>
<where>34 34 41 41 45 46 46 48 48 49 49 49 49 50 91 91 95 96<where>
<what>(%)<what>
<pattern>
<pattern>

One sub-tree pattern is bracketed in a <pattern> tag. The transaction ids where the pattern occurs are listed in <where> tag. The extracted sub-tree pattern, support and weighted support are in <what>, <support>, and <wsupport> tags respectively.

Bibliography

[1] Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Optimized Substructure Discovery for Semi-structured Data, Proc. 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2002), LNAI 2431, Springer-Verlag, 1-14, August 2002.
[2] Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002.

Links

$Id: index.html,v 1.1 2003/01/22 10:10:02 taku-ku Exp $;

taku-ku@is.aist-nara.ac.jp