bpcfg - Bayesian PCFG for unsupervised grammar induction.
Daichi Mochihashi
The Institute of Statistical Mathematics
Wed Apr 25 22:12:49 JST 2018 daichi@ism.ac.jp
About
This is a C++ implementation of Bayesian unsupervised grammar induction,
based on Johnson+ (2007) and additional slice sampler proposed in Blunsom
and Cohn (2010).
"Bayesian Inference for PCFGs via Markov chain Monte Carlo",
Mark Johnson, Thomas L. Griffiths, Sharon Goldwater. NAACL 2007.
http://aclweb.org/anthology/N/N07/N07-1018.pdf
Requirements
- g++ compiler (works with g++ 5.2.0).
- Boost C++ library.
- GSL (Gnu Scientific Library).
Download
Install
just type make.
This will create "bpcfg" "bpcfgs" and "bpcfgt".
"bpcfg" is the most basic parser; "bpcfgs" and "bpcfgt" are faster samplers
using different slice sampling. For the most basic usage, just use "bpcfg".
To Use
% ./bpcfg
bpcfg, Bayesian PCFG for unsupervised data.
usage: % bpcfg OPTIONS train model
OPTIONS
-K states number of hidden categories (default 3)
-N iter number of MCMC iterations (default 1)
-g gamma Dirichlet prior for state transitions (default 0.1)
-e eta Dirichlet prior for lexical emissions (default 0.01)
-a alpha Beta prior for random slices (default 1)
-b beta Beta prior for random slices (default 4)
$Id: bpcfg.cpp,v 1.25 2015/03/10 01:24:28 daichi Exp $
This package also includes sample training data "kyoto" (1K sentences) from the Kyoto
corpus (Kurohashi and Nagao 1996). To try it, type the following:
% ./bpcfg -K 5 -N 10 kyoto model
PCFG initialized.
iter = 10, train = kyoto, model = model
N = 1000, K = 6, V = 3858, gamma = 0.1, eta = 0.01
iter[ 1] : PPL = 18222.94 (accept=100.00%)
iter[ 2] : PPL = 17836.08 (accept=99.00%)
iter[ 3] : PPL = 18062.60 (accept=99.10%)
iter[ 4] : PPL = 17589.22 (accept=98.80%)
iter[ 5] : PPL = 16908.53 (accept=98.80%)
iter[ 6] : PPL = 16505.44 (accept=99.00%)
iter[ 7] : PPL = 16102.63 (accept=98.70%)
iter[ 8] : PPL = 15915.51 (accept=99.00%)
iter[ 9] : PPL = 15415.64 (accept=99.00%)
iter[10] : PPL = 14920.21 (accept=98.90%)
This will create "model.tree", "model.grammar", and "model.dic".
- "model.tree" is a collection of parsed sentences of training data in
an S-expression style.
- "model.grammar" is the model file of counts necessary for building
PCFG production probabilities and emission probabilities.
- "model.dic" is a mapping from words to word ids (0 is reserved for
unknown words).
The most important part is "pcfg.cpp" for MCMC sampler of CYK parsing.
Notice that iterations usually requires more than hundreds, often thousands,
for the perplexity stabilizes around some hundreds. Without slice sampling,
learning will be generally very slow (bpcfgs will be much faster).
daichi
Last modified: Sat Apr 22 17:18:19 2023