bpcfg - Bayesian PCFG for unsupervised grammar induction.

Daichi Mochihashi
The Institute of Statistical Mathematics
Wed Apr 25 22:12:49 JST 2018 daichi@ism.ac.jp


This is a C++ implementation of Bayesian unsupervised grammar induction, based on Johnson+ (2007) and additional slice sampler proposed in Blunsom and Cohn (2010).
"Bayesian Inference for PCFGs via Markov chain Monte Carlo", Mark Johnson, Thomas L. Griffiths, Sharon Goldwater. NAACL 2007.




just type make.
This will create "bpcfg" "bpcfgs" and "bpcfgt".
"bpcfg" is the most basic parser; "bpcfgs" and "bpcfgt" are faster samplers using different slice sampling. For the most basic usage, just use "bpcfg".

To Use

% ./bpcfg
bpcfg, Bayesian PCFG for unsupervised data.
usage: % bpcfg OPTIONS train model
 -K states  number of hidden categories (default 3)
 -N iter    number of MCMC iterations (default 1)
 -g gamma   Dirichlet prior for state transitions (default 0.1)
 -e eta     Dirichlet prior for lexical emissions (default 0.01)
 -a alpha   Beta prior for random slices (default 1)
 -b beta    Beta prior for random slices (default 4)
$Id: bpcfg.cpp,v 1.25 2015/03/10 01:24:28 daichi Exp $
This package also includes sample training data "kyoto" (1K sentences) from the Kyoto corpus (Kurohashi and Nagao 1996). To try it, type the following:
% ./bpcfg -K 5 -N 10 kyoto model 
PCFG initialized.
iter = 10, train = kyoto, model = model
N = 1000, K = 6, V = 3858, gamma = 0.1, eta = 0.01
iter[ 1] : PPL = 18222.94 (accept=100.00%)
iter[ 2] : PPL = 17836.08 (accept=99.00%)
iter[ 3] : PPL = 18062.60 (accept=99.10%)
iter[ 4] : PPL = 17589.22 (accept=98.80%)
iter[ 5] : PPL = 16908.53 (accept=98.80%)
iter[ 6] : PPL = 16505.44 (accept=99.00%)
iter[ 7] : PPL = 16102.63 (accept=98.70%)
iter[ 8] : PPL = 15915.51 (accept=99.00%)
iter[ 9] : PPL = 15415.64 (accept=99.00%)
iter[10] : PPL = 14920.21 (accept=98.90%)
This will create "model.tree", "model.grammar", and "model.dic". The most important part is "pcfg.cpp" for MCMC sampler of CYK parsing. Notice that iterations usually requires more than hundreds, often thousands, for the perplexity stabilizes around some hundreds. Without slice sampling, learning will be generally very slow (bpcfgs will be much faster).

Last modified: Sat Apr 22 17:18:19 2023