lda, a Latent Dirichlet Allocation package.

Daichi Mochihashi
NTT Communication Science Laboratories
$Id: index.html,v 1.3 2004/12/04 12:47:35 daiti-m Exp $

Overview

lda is a Latent Dirichlet Allocation (Blei et al., 2001) package written both in MATLAB and C (command line interface).
This package provides only a standard variational Bayes estimation that was first proposed, but has a simple textual data format that is almost the same as SVMlight or TinySVM.
This package can be used as an aid to understand LDA, or simply as a regularized alternative to PLSI, which has a severe overfitting problem due to its maximum likelihood structure.
For advanced users who wish to benefit from the latest result, consider using npbayes or MPCA: though, they have data formats different from above.

Requirements

C version:

ANSI C compiler.
Systems below are confirmed to compile.
- Linux 2.4.20, Redhat 9, gcc 3.2.2
- Linux 2.6.5, Fedora core release 2, gcc 3.3.3
- FreeBSD 4.8-STABLE, gcc 2.95.4 (GNU make)
- SunOS 5.8, gcc 2.95.3 (GNU make)

MATLAB version:

A MATLAB environment. Statistical Toolbox may be needed for psi() function (but in case it is not installed, consider using Minka's Lightspeed MATLAB toolbox).
Octave is not supported.

Install

C version

Take a glance at Makefile; and type make.
C version is not intended to be used by those who are not familiar with C.
Makefile and source files are very simple, so you can modify it as needed if it does not compile (If severe problems are found, please contact to the author.)

MATLAB version

simply add a directory where you have unpacked *.m into MATLAB path. For example:
% cd ~/work
% tar xvfz lda-0.1-matlab.tar.gz
% cd lda-0.1-matlab/
% matlab
>> addpath ~/work/lda-0.1-matlab

Download

C version: lda-0.2.tar.gz
MATLAB version: lda-0.2-matlab.tar.gz

Performance

C version runs about 8 times or more faster than MATLAB (while MATLAB codes are fully vectorized).
However, MATLAB version is closed under MATLAB environment; so it is easy to investigate and manipulate the parameters (especially graphically using plot or surf).
Moreover, MATLAB codes are so simple and easy to understand.
To estimate the parameters of 50 class LDA decomposition of the standard Cranfield collection (1397 documents, 5177 unique terms),
- C version took 1 minute 32 seconds,
- MATLAB version took 38 minutes 55 seconds,
on a Xeon 2.8GHz.
It runs in low memory efficiently: in the experiment above, it uses only 6.8MB (C) and 29MB (MATLAB) of memory.

Getting Started

This package contains a sample data file "train" which was compiled from the first 100 documents of the Cranfield collection.
Each feature id corresponds to the respective line of file "train.lex"; that is, feature 20 means a word "accuracy", feature 21 means "accurate", and so on.
After compilation, you can test it using "train" data as follows.

C version:: % lda -N 20 train model
MATLAB version:: % matlab
>> [alpha,beta] = ldamain('train',20);

C version creates two files "model.alpha" and "model.beta"; MATLAB version creates 1x20-dimensional vector alpha and 1324x20-dimensional matrix beta.
Parameters of the resulting models are explained in the sections below.

Data Format

Data format is common to both C and MATLAB version. It is almost the same as widely-used SVMlight, except that there is no label in Latent Dirichlet Allocation since LDA is an unsupervised method.

A data file is an ASCII text file, where each line represents a document (NB. document is simply a synonym for "a group of data"; So you can interpret it as you like whenever it means a group of data.)
Typical data file is as follows:

1:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1

Each line can be maximum 65535 bytes (about 820 lines in 80-column text) by default. For a standard document this value is sufficient, but if you wish to increase this limit, modify BUFSIZE in feature.c as you like.
Each line consists of pairs of <feature_id>:<count>. Here, feature_id is an integer from 1 (this is the same as SVMlight); count can be an integer or a real number that must be positive.
<feature_id>:<count> pairs are separated by (possibly multiple) white spaces. The program is coded to work even if there are any empty lines, but it is preferable that there are no such unnecessary lines.
For a complete specification, please refer to SVMlight's page.

Command Line Syntax

C version

lda is typically invoked simply as:

% lda -N 100 train model

where % is a command prompt.
train is a data file that has a format described above, and model is a basename of output files of model parameters. Specifically, lda uses two outputs: model.alpha and model.beta that represent \alpha and \beta in the LDA described in (Blei et al., 2001); that is, \alpha is the parameter of prior Dirichlet distribution over the latent classes, and \beta is the set of class unigrams for each latent class.
-N 100 is the number of latent classes to assume in the data. For the standard model of LDA, this is the only parameter we must provide in advance. In this case, 100 latent classes are assumed.

Besides, there are several rarely-used options:

% lda -h
lda, a Latent Dirichlet Allocation package.
Copyright (c) 2004 Daichi Mochihashi, All rights reserved.
usage: lda -N classes [-I emmax -D demmax -E epsilon] train model

-I emmax: Maximum # of iteration of the outer VB-EM algorithm, which is exited when converged. (default 100)
-D demmax: Maximum # of iteration of the inner VB-EM algorithm for each document, which is exited when converged. (default 20)
-E epsilon: A threshold to determine the whole convergence of the estimation. It is a lower threshold of the relative increase in the total data likelihood. (default 0.0001)
-h: displays help.

MATLAB version

First, you must load a data file into MATLAB data structure:

% matlab
>> d = fmatrix('train');

And run a function lda to estimate the parameters. The second argument is the number of latent classes that you assume. (in the example below, 20)

>> help lda
  Latent Dirichlet Allocation, standard model.
  Copyright (c) 2004 Daichi Mochihashi, all rights reserved.
  $Id: index.html,v 1.3 2004/12/04 12:47:35 daiti-m Exp $
  [alpha,beta] = lda(d,k,[emmax,demmax])
  d      : data of documents
  k      : # of classes to assume
  emmax  : # of maximum VB-EM iteration (default 100)
  demmax : # of maximum VB-EM iteration for a document (default 20)
>> [alpha,beta] = lda(d,20);

Optional two parameters emmax and demmax can be fed into lda, which has the same meaning as the C version.
If you find that loading text data into MATLAB structure in advance is troublesome, there is a wrapper function ldamain that works exactly the same as the C version:

>> [alpha,beta] = ldamain('train.dat',20);

Output Format

MATLAB version

In the example above, alpha is a N-dimensional row vector of \alpha for corresponding latent topics, and beta is a [V,N]-dimensional matrix of \beta where beta(v,n) = p(v|n) (n = 1 .. N, v = 1 .. V; V is the size of the lexicon).
You can save them to file using standard MATLAB function save, for example, as:

>> [alpha,beta] = ldamain('train',20);
number of documents      = 100
number of words          = 1324
number of latent classes = 20
iteration 31/100..      PPL = 132.093   ETA: 0:00:15 (0 sec/step)
converged.
>> save('alpha.dat', 'alpha', '-ascii');
>> save('beta.dat', 'beta', '-ascii');

C version

If you invoke lda as the following,

% lda -N 100 train model

two files "model.alpha" and "model.beta" are created.
These two files are exactly of the same format as those which are saved from MATLAB: "model.alpha" is a space-separated N-dimensional vector of \alpha, and "model.beta" is a space-separated V x N matrix of \beta.
These parameters can be loaded into MATLAB using standard ways:

>> beta = load('model.beta');

And you can manipulate these parameters within MATLAB.

Posterior inference

Some users have asked for posterior inference. Currently, I deliberately didn't include any posterior inference functions in order to urge the reader to follow the derivations in the original paper.
It's very easy once the optimal \alpha and \beta have been obtained, and can be written in various languages. However, when doing so, vbem.c (C) and vbem.m (MATLAB) files in the package will help you.

References

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Neural Information Processing Systems 14, 2001. [citeseer]
David M. Blei, Andrew Y. Ng and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, vol. 3, pp.993--1022, 2003. [citeseer]
Daichi Mochihashi. A note on a Variational Bayes derivation of full Bayesian Latent Dirichlet Allocation. unpublished manuscript, 2004. [PDF]

Acknowledgements

Digamma and trigamma codes are by Thomas Minka. (url)
Thanks for Taku Kudo for his plsi tool that has been a good reference for the development of lda.

Note

Almost at the same time(!), Blei himself made LDA-C package written in C to the public.
According to some experiments, our package runs about 4 to 10 times as fast as his (this may be due to reusing preallocated buffers in VB-EM), and also provides a MATLAB version for easy experimentation.
Of course, the package by the original author is more desirable, so I wish both ones are equally useful.

daichi <at> cslab.kecl.ntt.co.jp

Last modified: Wed Jan 16 17:18:55 2013