LDA in Python

lda.py: LDA in Python.

Daichi Mochihashi
The Institute of Statistical Mathematics, Tokyo
$Id: index.html,v 1.3 2018/12/09 16:14:16 daichi Exp $

lda.py is a Python/Cython implementation of a standard Gibbs sampling for the latent Dirichlet allocation (Blei+, 2003). This is a package basically for learning and extension; however, since it is written in Cython, it runs much faster than a pure Python implementation and thus amenable to medium sized data.
This package is based on a part of the codes by Ryan P. Adams (Princeton).

Download

lda.py-0.3.tar.gz

Requirements

Python (developed with Python 3.9.13)
Cython (developed with Cython 0.29.32)
Numpy & Scipy

Install

First, build a Cython module "ldac.so" with Cython by just type

% make

to execute "./setup.py build_ext --inplace". That's all!

Usage

% ./lda.py -h
usage: % lda.py OPTIONS train model
OPTIONS
 -K topics  number of topics in LDA
 -N iters   number of Gibbs iterations (default 1)
 -a alpha   Dirichlet hyperparameter on topics (default auto)
 -b beta    Dirichlet hyperparameter on words (default auto)
 -h         displays this help
$Id: lda.py,v 1.14 2018/11/17 03:01:53 daichi Exp

"train" is a data file whose format is the same as lda or SVMlight (see below), and "model" is a name of model file to be generated.

Getting started

You can start using lda.py as a following example with the train datafile enclosed in this package:

% ./lda.py -K 10 -N 100 train model
LDA: K = 10, iters = 100, alpha = 5, beta = 0.01
loading data.. documents = 100, lexicon = 1325, nwords = 16054
initializing..
Gibbs iteration [100/100] PPL = 712.9421
saving model..
done.

This generates a model file "model" in a gzipped pickle format. It can be loaded as follows:

import gzip
import cPickle as pickle
with gzip.open ("model", "rb") as gf:
    model = pickle.load (gf)

Then "model['alpha']" is a scalar alpha parameter used for training, and "model['beta']" is VxK matrix of beta parameter, and "model['theta']" is a NxK matrix of estimated topic proportion of each of the training documents.
In addition to the model file "model", it also yields a log file "model".log in the same directory.

Data formats

Training data format is almost the same as lda or SVMlight without labels for each line. Typical data file is as follows:

0:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1

Each line consists of pairs of <feature_id>:<count>. Here, feature_id is an integer from 0; count is a positive integer.
There can be unused feature id; however, probabilities are normalized within the maximum feature id used.
<feature_id>:<count> pairs are separated by (possibly multiple) white spaces. The program is coded to work even if there are any empty lines, but it is preferable that there are no such unnecessary lines.
For a complete specification, please refer to SVMlight's page.

daichi<at>ism.ac.jp

Last modified: Tue Jun 20 20:14:21 2023