lda, a Latent Dirichlet Allocation package.
Daichi Mochihashi
NTT Communication Science Laboratories
$Id: index.html,v 1.3 2004/12/04 12:47:35 daiti-m Exp $
Overview
lda is a Latent Dirichlet Allocation (Blei et al., 2001) package
written both in MATLAB and C (command line interface).
This package provides only a standard variational Bayes estimation
that was first proposed, but has a simple textual data format
that is almost the same as SVMlight
or TinySVM.
This package can be used as an aid to understand LDA, or simply as a
regularized alternative to PLSI, which has a severe overfitting problem
due to its maximum likelihood structure.
For advanced users who wish to benefit from the latest result, consider using
npbayes or
MPCA:
though, they have data formats different from above.
Requirements
- C version:
-
- ANSI C compiler.
Systems below are confirmed to compile.
- Linux 2.4.20, Redhat 9, gcc 3.2.2
- Linux 2.6.5, Fedora core release 2, gcc 3.3.3
- FreeBSD 4.8-STABLE, gcc 2.95.4 (GNU make)
- SunOS 5.8, gcc 2.95.3 (GNU make)
- MATLAB version:
-
- A MATLAB environment. Statistical Toolbox may be needed for psi()
function (but in case it is not installed, consider using Minka's
Lightspeed MATLAB toolbox).
- Octave is not supported.
Install
- C version
-
- Take a glance at Makefile; and type make.
- C version is not intended to be used by those who are not familiar
with C.
Makefile and source files are very simple, so you can modify it
as needed if it does not compile (If severe problems are found,
please contact to the author.)
- MATLAB version
- simply add a directory where you have unpacked *.m into MATLAB path.
For example:
% cd ~/work
% tar xvfz lda-0.1-matlab.tar.gz
% cd lda-0.1-matlab/
% matlab
>> addpath ~/work/lda-0.1-matlab
Download
Performance
- C version runs about 8 times or more faster than MATLAB
(while MATLAB codes are fully vectorized).
- However, MATLAB version is closed under MATLAB environment; so
it is easy to investigate and manipulate the parameters (especially
graphically using plot or surf).
Moreover, MATLAB codes are so simple and easy to understand.
- To estimate the parameters of 50 class LDA decomposition of the
standard Cranfield collection (1397 documents, 5177 unique terms),
- C version took 1 minute 32 seconds,
- MATLAB version took 38 minutes 55 seconds,
on a Xeon 2.8GHz.
It runs in low memory efficiently: in the experiment above,
it uses only 6.8MB (C) and 29MB (MATLAB) of memory.
Getting Started
This package contains a sample data file "train" which was compiled from
the first 100 documents of the Cranfield collection.
Each feature id corresponds to the respective line of file "train.lex";
that is, feature 20 means a word "accuracy", feature 21 means "accurate",
and so on.
After compilation, you can test it using "train" data as follows.
- C version:
- % lda -N 20 train model
- MATLAB version:
- % matlab
>> [alpha,beta] = ldamain('train',20);
C version creates two files "model.alpha" and "model.beta"; MATLAB version
creates 1x20-dimensional vector alpha and 1324x20-dimensional matrix beta.
Parameters of the resulting models are explained in the sections below.
Data Format
Data format is common to both C and MATLAB version.
It is almost the same as widely-used
SVMlight, except that
there is no label in Latent Dirichlet Allocation since LDA is an
unsupervised method.
A data file is an ASCII text file, where each line represents a document
(NB. document is simply a synonym for "a group of data"; So you can
interpret it as you like whenever it means a group of data.)
Typical data file is as follows:
1:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1
- Each line can be maximum 65535 bytes (about 820 lines in 80-column text)
by default. For a standard document this value is sufficient, but if you
wish to increase this limit, modify BUFSIZE in feature.c as you like.
- Each line consists of pairs of <feature_id>:<count>.
Here, feature_id is an integer from 1 (this is the same as
SVMlight); count can be an integer or a real number that must be
positive.
- <feature_id>:<count> pairs are separated by (possibly
multiple) white spaces.
The program is coded to work even if there are any empty lines, but
it is preferable that there are no such unnecessary lines.
- For a complete specification, please refer to SVMlight's page.
Command Line Syntax
C version
lda is typically invoked simply as:
% lda -N 100 train model
where % is a command prompt.
train is a data file that has a format described above, and
model is a basename of output files of model parameters.
Specifically, lda uses two outputs:
model.alpha and model.beta
that represent \alpha and \beta in the LDA described in (Blei et al., 2001);
that is, \alpha is the parameter of prior Dirichlet distribution over the
latent classes, and \beta is the set of class unigrams for each latent
class.
-N 100 is the number of latent classes to assume in the data.
For the standard model of LDA, this is the only parameter we must provide
in advance. In this case, 100 latent classes are assumed.
Besides, there are several rarely-used options:
% lda -h
lda, a Latent Dirichlet Allocation package.
Copyright (c) 2004 Daichi Mochihashi, All rights reserved.
usage: lda -N classes [-I emmax -D demmax -E epsilon] train model
- -I emmax
- Maximum # of iteration of the outer VB-EM algorithm, which is exited when
converged. (default 100)
- -D demmax
- Maximum # of iteration of the inner VB-EM algorithm for each document,
which is exited when converged. (default 20)
- -E epsilon
- A threshold to determine the whole convergence of the estimation.
It is a lower threshold of the relative increase in the total
data likelihood. (default 0.0001)
- -h
- displays help.
MATLAB version
First, you must load a data file into MATLAB data structure:
% matlab
>> d = fmatrix('train');
And run a function lda to estimate the parameters.
The second argument is the number of latent classes that you assume.
(in the example below, 20)
>> help lda
Latent Dirichlet Allocation, standard model.
Copyright (c) 2004 Daichi Mochihashi, all rights reserved.
$Id: index.html,v 1.3 2004/12/04 12:47:35 daiti-m Exp $
[alpha,beta] = lda(d,k,[emmax,demmax])
d : data of documents
k : # of classes to assume
emmax : # of maximum VB-EM iteration (default 100)
demmax : # of maximum VB-EM iteration for a document (default 20)
>> [alpha,beta] = lda(d,20);
Optional two parameters emmax and demmax can be fed into lda,
which has the same meaning as the C version.
If you find that loading text data into MATLAB structure in advance
is troublesome,
there is a wrapper function ldamain that works exactly the same as
the C version:
>> [alpha,beta] = ldamain('train.dat',20);
Output Format
MATLAB version
In the example above, alpha is a N-dimensional row vector of \alpha
for corresponding latent topics,
and beta is a [V,N]-dimensional matrix of \beta
where beta(v,n) = p(v|n) (n = 1 .. N, v = 1 .. V; V is the size of the lexicon).
You can save them to file using standard MATLAB function save,
for example, as:
>> [alpha,beta] = ldamain('train',20);
number of documents = 100
number of words = 1324
number of latent classes = 20
iteration 31/100.. PPL = 132.093 ETA: 0:00:15 (0 sec/step)
converged.
>> save('alpha.dat', 'alpha', '-ascii');
>> save('beta.dat', 'beta', '-ascii');
C version
If you invoke lda as the following,
% lda -N 100 train model
two files "model.alpha" and "model.beta" are created.
These two files are exactly of the same format as those which are saved
from MATLAB: "model.alpha" is a space-separated N-dimensional
vector of \alpha, and "model.beta" is a space-separated V x N
matrix of \beta.
These parameters can be loaded into MATLAB using standard ways:
>> beta = load('model.beta');
And you can manipulate these parameters within MATLAB.
Posterior inference
Some users have asked for posterior inference.
Currently, I deliberately didn't include any posterior inference functions
in order to urge the reader to follow the derivations in the original paper.
It's very easy once the optimal \alpha and \beta have been obtained,
and can be written in various languages.
However, when doing so, vbem.c (C) and vbem.m (MATLAB) files in the package will help
you.
References
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
Latent Dirichlet Allocation.
Neural Information Processing Systems 14, 2001.
[citeseer]
- David M. Blei, Andrew Y. Ng and Michael I. Jordan.
Latent Dirichlet Allocation.
Journal of Machine Learning Research, vol. 3, pp.993--1022, 2003.
[citeseer]
- Daichi Mochihashi. A note on a Variational Bayes derivation of full Bayesian Latent Dirichlet Allocation.
unpublished manuscript, 2004.
[PDF]
Acknowledgements
Digamma and trigamma codes are by
Thomas Minka.
(url)
Thanks for Taku Kudo for his plsi tool
that has been a good reference for the development of lda.
Note
Almost at the same time(!), Blei
himself made
LDA-C package written in C to the public.
According to some experiments,
our package runs about 4 to 10 times as fast as his
(this may be due to reusing preallocated buffers in VB-EM),
and also provides a MATLAB version for easy experimentation.
Of course, the package by the original author is more desirable,
so I wish both ones are equally useful.
daichi <at> cslab.kecl.ntt.co.jp
Last modified: Wed Jan 16 17:18:55 2013