lwlm, The Latent Words Language Model.
Daichi Mochihashi
NTT Communication Science Laboratories
$Id: lwlm.html,v 1.1 2010/03/19 10:15:06 daichi Exp $
lwlm is an exact, full Bayesian implementation of the Latent Words Language
Model (Deschacht and Moens, 2009). It automatically learns synonymous words
to infer context-dependent "latent word" for each word appearance,
in a completely unsupervised fashion.
Technically, LWLM is a higher-order hidden Markov model over words where the
latent dimension equals the dimension of lexicon, or a "class-based" n-gram
model where the number of classes equals the number of word types.
This high dimensionality and higher-order model structure preclude ordinary
EM algorithms, and necessitates a Gibbs sampler for learning.
In this toolkit, I further extended it as wholly Bayesian, and newly introduced
"sliced" Gibbs sampler to cope with possible difficulty of the high
dimensionality of latent words that must be considered.
Additionally, lwlm includes a complete implementation of the hierarchical
Pitman-Yor language model (HPYLM) (Teh, 2006) as its submodel.
Requirements
- GNU C++ compiler. (developed with g++ 4.1.1)
- Note the implementation uses <ext/hash_map>, which seems
deleted in the latest g++.
- Google
sparsehash must be installed beforehand. (built with sparsehash-1.6)
Downloads
- Source: lwlm-0.1.tar.gz
- Precomputed Model:
- New York Times model:
- model-nyt.n3.tar.gz
(recommended) [4.9MB]
- This is a model learned from random 100K sentences from the
New York Times, 2006 January (about 2.1 million words).
Penn tokenization.
For details, see the enclosed README.
- Japanese "Mainichi" model:
- model-mai2007.n3.tar.gz
(recommended) [4.7MB]
- This is a model built from the 2007 texts of the standard
Japanese newspaper of Mainichi Newspaper, which contains
2.4 million words, 100K sentences. For details, see the
enclosed README.
- FSNLP "austen" model:
- model-austen.tar.gz
[3.2MB]
- This is a model built from the text of Jane Austen's novels
available from
FSNLP (Foundations of Statistical Natural Language
Processing, MIT Press) site for
language modeling chapter.
This includes both the original text and learned model
after 400 iterations of the Gibbs sampler.
Note this model was built on a very small text.
- Japanese "Kyoto" model:
- model-kyoto.tar.gz
[2.2MB]
- This is a model built from standard Japanese corpus of the
"Kyoto corpus", which contains 1 million words,
38400 sentences.
For copyright issues, only the learned model is included
in this file (original text and learned latent words are
excluded). Note this model is still small, but suffices to
demonstrate the LWLM ability for natural language processing.
Install
- Install
Google-sparsehash if necessary.
- Take a glance at Makefile, and edit it accordingly.
- Type "make all". This builds both "lwlm" (the learner) and "lwlm-decoder"
(the decoder).
- If any problem occurs, please inform it to the author.
Getting started
Prepare an appropriate sized text or simply use
this sample data,
and invoke lwlm as (this may take a few minutes):
% ./lwlm -n 3 -N 10 persuation.txt model.austen
latent ngram order = 3, alpha = 0.01
reading persuation.txt.. done.
vocabulary = 5756
parsing 1005/ 1005 ..done.
[10] : sampling 84149/ 84148.. ETA: 0:00:00 (6678 words/sec) PPL = 88.1006
done.
saving model.. done.
Or, just download the precomputed model from above and unpack it:
% tar xvfz model-austen.tar.gz
To decode latent words in a text, invoke lwlm-decode as (in zsh):
% ./lwlm-decode -K 5 -o out =(tail -10 persuation.txt) model.austen
This yields a file "out" which contains decoded latent word distributions.
For complete command-line options, try "lwlm -h" and "lwlm-decode -h" or
see the texts below.
Data format and Model files
lwlm
works on any vanilla text files.
Each line in the text represents a sentence, which consists of words
separated with (possibly multiple) white spaces.
Empty lines are ignored.
Each model directory has a following file structure:
+ model --+-- lexicon Words beginning from internal id = 0.
+-- latent Sampled latent words corresponding to the training text.
+-- table Learned table of hidden -> observed translations.
+-- param Model parameters (order and alpha)
+-- log Log file during computation, esp. PPL (see below)
+-- lm HPYLM
+-- nodes Nodes and CRP restaurants
+-- tree Dependencies between nodes
+-- param HPYLM parameters (order, d, theta, lexicon)
Command line syntax
LWLM Model learner
% ./lwlm -h
lwlm, The Latent Words Languge Model.
$Id: lwlm.html,v 1.1 2010/03/19 10:15:06 daichi Exp $
Copyright (C) 2010 Daichi Mochihashi, All rights reserved.
usage: lwlm OPTIONS train model
OPTIONS
-n n-gram order (default 3, trigram)
-a Dirichlet hyperparameter of translation table (default 0.01)
-N number of Gibbs iterations
-h help
- -n [n]
- n-gram order of latent words (default 3). Generally, Larger n requires
larger data for plausible inference.
- -a [a]
- Dirichlet pseudo count for word translation matrix. Smaller counts will
induce peaky latent word distributions, while larger counts will explore
more possibilities for them.
- -N [N]
- Number of Gibbs iterations. Generally, several hundreds of iterations
will suffice for approximate convergence.
For assessing convergence, see the section below.
-
-
LWLM Decoder
% ./lwlm-decode -h
lwlm-decoder, The Latent Words Languge Model decoder.
$Id: lwlm.html,v 1.1 2010/03/19 10:15:06 daichi Exp $
Copyright (C) 2010 Daichi Mochihashi, All rights reserved.
usage: lwlm-decode OPTIONS test model
OPTIONS:
-B [b] burn-in iterations (default 100)
-N [n] number of iterations to collect samples (default 100)
-I [i] interval to collect posterior samples (default 1)
-K [K] number of possible candidates (default 5)
-e [eps] threshold for candidate probability (default 0)
-f [fmt] printf format of candidates (default "%s (%.03f) ")
-o [file] filename for output
-h this help
lwlm-decode
finds latent words for the input text by running Gibbs iterations
to collect posterior samples.
Possible parameters are shown above: if "-o" option is not specified, decoded
results are printed into stdout.
Note: in order to change the output format, e.g. to create a data file
for another processing task, use "-f format" options.
This format takes two arguments,
a latent word (char *) and its probability (double), thus '-f "%s:%.04f "'
yields an output like "foo bar:0.6704 qux:0.2000 quux:0.1250".
Convergence
To assess the convergence, "log" file created in the model directory records
training data perplexity (equivalent to the negative data likelihood)
for each Gibbs iteration.
Generally, perplexity will increase and saturate after several hundreds
of iterations.
This is correct, because LWLM is essentially a model that avoids overfitting
to the training data.
Technical details
Sampling latent word for each observation is the core of the inference in the
Latent Words Language Model.
However, in principle this will incur computing probabilities over all words
in the lexicon, which are usually tens of thousands of candidates.
In order to cope with this difficulty, lwlm leverages slice sampling
(Neal 2003) for thresholding these candidates probabilistically.
However, naive thresholding would still incur computing probabilities over
all words, thus lwlm maintains a special data structure (record *)
for each n-gram node that enables to prune the candidates efficiently.
References
- "Using the Latent Words Language Model for Semi-Supervised Semantic Role
Labeling". Koen Deschacht and Marie-Francine Moens. EMNLP 2009.
[PDF]
- "The Latent Words Language Model". Koen Deschacht and Marie-Francine
Moens. In Proceedings of the 18th Annual Belgian-Dutch Conference on
Machine Learning (Benelearn 09), 2009.
[PDF]
- "Slice sampling". Radford M. Neal.
Annals of Statistics, 31(3), pp.705-767, 2003.
- "A Bayesian Interpretation of Interpolated Kneser-Ney". Yee Whye Teh.
Technical Report TRA2/06, School of Computing, NUS, 2006.
Acknowledgement
Thanks for Daisuke Okanohara for advice about efficient data structures.
daichi<at>cslab.kecl.ntt.co.jp
Last modified: Thu Mar 25 15:44:58 2010