lda.py: LDA in Python.

Daichi Mochihashi
The Institute of Statistical Mathematics, Tokyo
$Id: index.html,v 1.3 2018/12/09 16:14:16 daichi Exp $

lda.py is a Python/Cython implementation of a standard Gibbs sampling for the latent Dirichlet allocation (Blei+, 2003). This is a package basically for learning and extension; however, since it is written in Cython, it runs much faster than a pure Python implementation and thus amenable to medium sized data.
This package is based on a part of the codes by Ryan P. Adams (Princeton).

Download

Requirements

Install

First, build a Cython module "ldac.so" with Cython by just type
% make
to execute "./setup.py build_ext --inplace". That's all!

Usage

% ./lda.py -h
usage: % lda.py OPTIONS train model
OPTIONS
 -K topics  number of topics in LDA
 -N iters   number of Gibbs iterations (default 1)
 -a alpha   Dirichlet hyperparameter on topics (default auto)
 -b beta    Dirichlet hyperparameter on words (default auto)
 -h         displays this help
$Id: lda.py,v 1.14 2018/11/17 03:01:53 daichi Exp 
"train" is a data file whose format is the same as lda or SVMlight (see below), and "model" is a name of model file to be generated.

Getting started

You can start using lda.py as a following example with the train datafile enclosed in this package:
% ./lda.py -K 10 -N 100 train model
LDA: K = 10, iters = 100, alpha = 5, beta = 0.01
loading data.. documents = 100, lexicon = 1325, nwords = 16054
initializing..
Gibbs iteration [100/100] PPL = 712.9421
saving model..
done.
This generates a model file "model" in a gzipped pickle format. It can be loaded as follows:
import gzip
import cPickle as pickle
with gzip.open ("model", "rb") as gf:
    model = pickle.load (gf)
Then "model['alpha']" is a scalar alpha parameter used for training, and "model['beta']" is VxK matrix of beta parameter, and "model['theta']" is a NxK matrix of estimated topic proportion of each of the training documents.
In addition to the model file "model", it also yields a log file "model".log in the same directory.

Data formats

Training data format is almost the same as lda or SVMlight without labels for each line. Typical data file is as follows:
0:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1

daichi<at>ism.ac.jp
Last modified: Tue Jun 20 20:14:21 2023