bpcfg

bpcfg - Bayesian PCFG for unsupervised grammar induction.

Daichi Mochihashi
The Institute of Statistical Mathematics
Wed Apr 25 22:12:49 JST 2018 daichi@ism.ac.jp

About

This is a C++ implementation of Bayesian unsupervised grammar induction, based on Johnson+ (2007) and additional slice sampler proposed in Blunsom and Cohn (2010).
"Bayesian Inference for PCFGs via Markov chain Monte Carlo", Mark Johnson, Thomas L. Griffiths, Sharon Goldwater. NAACL 2007.
http://aclweb.org/anthology/N/N07/N07-1018.pdf

Requirements

g++ compiler (works with g++ 5.2.0).
Boost C++ library.
GSL (Gnu Scientific Library).

Download

bpcfg-0.5.tar.gz (2018/4/26)

Install

just type make.
This will create "bpcfg" "bpcfgs" and "bpcfgt".
"bpcfg" is the most basic parser; "bpcfgs" and "bpcfgt" are faster samplers using different slice sampling. For the most basic usage, just use "bpcfg".

To Use

% ./bpcfg
bpcfg, Bayesian PCFG for unsupervised data.
usage: % bpcfg OPTIONS train model
OPTIONS
 -K states  number of hidden categories (default 3)
 -N iter    number of MCMC iterations (default 1)
 -g gamma   Dirichlet prior for state transitions (default 0.1)
 -e eta     Dirichlet prior for lexical emissions (default 0.01)
 -a alpha   Beta prior for random slices (default 1)
 -b beta    Beta prior for random slices (default 4)
$Id: bpcfg.cpp,v 1.25 2015/03/10 01:24:28 daichi Exp $

This package also includes sample training data "kyoto" (1K sentences) from the Kyoto corpus (Kurohashi and Nagao 1996). To try it, type the following:

% ./bpcfg -K 5 -N 10 kyoto model 
PCFG initialized.
iter = 10, train = kyoto, model = model
N = 1000, K = 6, V = 3858, gamma = 0.1, eta = 0.01
iter[ 1] : PPL = 18222.94 (accept=100.00%)
iter[ 2] : PPL = 17836.08 (accept=99.00%)
iter[ 3] : PPL = 18062.60 (accept=99.10%)
iter[ 4] : PPL = 17589.22 (accept=98.80%)
iter[ 5] : PPL = 16908.53 (accept=98.80%)
iter[ 6] : PPL = 16505.44 (accept=99.00%)
iter[ 7] : PPL = 16102.63 (accept=98.70%)
iter[ 8] : PPL = 15915.51 (accept=99.00%)
iter[ 9] : PPL = 15415.64 (accept=99.00%)
iter[10] : PPL = 14920.21 (accept=98.90%)

This will create "model.tree", "model.grammar", and "model.dic".

"model.tree" is a collection of parsed sentences of training data in an S-expression style.
"model.grammar" is the model file of counts necessary for building PCFG production probabilities and emission probabilities.
"model.dic" is a mapping from words to word ids (0 is reserved for unknown words).

The most important part is "pcfg.cpp" for MCMC sampler of CYK parsing. Notice that iterations usually requires more than hundreds, often thousands, for the perplexity stabilizes around some hundreds. Without slice sampling, learning will be generally very slow (bpcfgs will be much faster).

daichi

Last modified: Sat Apr 22 17:18:19 2023