lightlda.sh: a simple wrapper script for LightLDA.

Daichi Mochihashi
The Institute of Statistical Mathematics
$Id: index.html,v 1.3 2018/10/23 03:09:46 daichi Exp $

Introduction

lightlda.sh is a package of a wrapper script for LightLDA (Yuan+, WWW 2015) that can be used easily.
LightLDA is a quite useful program for efficiently training huge topic models, thanks to a number of mathematical and algorithmic techniques. However, it employs specific data formats both for the input and the output: lightlda.sh wraps them to enable using standard input and output formats.

Content

lightlda.sh
lightlda2model.py
svmlight.py
train (sample data)

Download: lightlda.sh-0.1.tar.gz

Install

First, install LightLDA from:

git clone --recursive https://github.com/Microsoft/lightlda

Place the scripts above to one of the directories in PATH.
Then edit lightlda.sh to reflect $root to the root of the LightLDA package (like /usr/local/lib/lightlda).

Usage

using lightlda2.sh is simple:

% ./lightlda.sh
lightlda.sh: wrapper to execute LightLDA in a standard way.
usage: lightlda.sh K iters train model [alpha] [beta]
$Id: lightlda.sh,v 1.10 2018/10/23 02:57:28 daichi Exp

General usage:

% lightlda.sh K iters train model [alpha] [beta]

K is the number of topics to assume (e.g. 1000)
iters is the number of MCMC iterations (e.g. 5000)
train is a data file of standard format (described below)
model is a name of a directory to store the trained model
(optional) alpha is the hyperparameter of document-topic Dirichlets (default 0.1)
(optional) beta is the hyperparameter of topic-word Dirichlets (default 0.01)

Currently, this script does not support distributed training of LightLDA: for advanced usage, please use the original program directly.

Example

For example, using a sample "train" data file included in this package, you can run lightlda.sh as:

% ./lightlda.sh 10 100 train model
alpha = 0.1 beta = 0.01 topics = 10 iters = 100
preparing data at model ..
There are totally 1324 words in the vocabulary
There are maximally totally 16054 tokens in the data set
The number of tokens in the output block is: 16054
Local vocab_size for the output block is: 1323
Elapsed seconds for dump blocks: 0.0150371
docs  = 200
vocab = 1400
size  = 1
executing LightLDA ..
[INFO] [2018-10-23 12:07:46] INFO: block = 0, the number of slice = 1
[INFO] [2018-10-23 12:07:46] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-10-23 12:07:46] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
...
[INFO] [2018-10-23 12:08:46] Server 0: Dump model...
[INFO] [2018-10-23 12:08:46] Server 0 closed.
[INFO] [2018-10-23 12:08:47] Rank 0/1: Multiverso closed successfully.
converting to standard parameters..
reading model..
saving model..
done.
finished.

Data formats

Training data format is almost the same as lda or SVMlight without labels for each line. Typical data file is as follows:

1:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1

Each line consists of pairs of <feature_id>:<count>. Here, feature_id is an integer from 0; count is a positive integer.
<feature_id>:<count> pairs are separated by (possibly multiple) white spaces. The program is coded to work even if there are any empty lines, but it is preferable that there are no such unnecessary lines.
For a complete specification, please refer to SVMlight's page.

Output is stored in the model directory; in addition to the outputs of LightLDA, a file "model" is created there for easy use.
"model" is a gzipped pickled data that can be loaded in Python as follows:

import gzip
import cPickle as pickle
with gzip.open ("model", "rb") as gf:
    model = pickle.load (gf)

Then "model['alpha']" is a scalar alpha parameter used for training, and "model['beta']" is VxK matrix of beta parameter, and "model['gamma']" is a NxK matrix of Dirichlet posteriors for each of the training documents.

daichi<at>ism.ac.jp

Last modified: Tue Oct 23 12:13:20 2018