lda.py is a Python/Cython implementation of a standard Gibbs sampling
for the latent Dirichlet allocation (Blei+, 2003).
This is a package basically for learning and extension; however,
since it is written in Cython, it runs much faster than a pure Python
implementation and thus amenable to medium sized data.
This package is based on a part of the codes by Ryan P. Adams (Princeton).
% maketo execute "./setup.py build_ext --inplace". That's all!
% ./lda.py -h usage: % lda.py OPTIONS train model OPTIONS -K topics number of topics in LDA -N iters number of Gibbs iterations (default 1) -a alpha Dirichlet hyperparameter on topics (default auto) -b beta Dirichlet hyperparameter on words (default auto) -h displays this help $Id: lda.py,v 1.14 2018/11/17 03:01:53 daichi Exp"train" is a data file whose format is the same as lda or SVMlight (see below), and "model" is a name of model file to be generated.
% ./lda.py -K 10 -N 100 train model LDA: K = 10, iters = 100, alpha = 5, beta = 0.01 loading data.. documents = 100, lexicon = 1325, nwords = 16054 initializing.. Gibbs iteration [100/100] PPL = 712.9421 saving model.. done.This generates a model file "model" in a gzipped pickle format. It can be loaded as follows:
import gzip import cPickle as pickle with gzip.open ("model", "rb") as gf: model = pickle.load (gf)Then "model['alpha']" is a scalar alpha parameter used for training, and "model['beta']" is VxK matrix of beta parameter, and "model['theta']" is a NxK matrix of estimated topic proportion of each of the training documents.
0:1 2:4 5:2 1:2 3:3 5:1 6:1 7:1 2:4 5:1 7:1