lightlda.sh: a simple wrapper script for LightLDA.
Daichi Mochihashi
The Institute of Statistical Mathematics
$Id: index.html,v 1.3 2018/10/23 03:09:46 daichi Exp $
Introduction
lightlda.sh is a package of a wrapper script for
LightLDA
(Yuan+, WWW 2015)
that can be used easily.
LightLDA is a quite useful program for efficiently training huge topic models,
thanks to a number of mathematical and algorithmic techniques.
However, it employs specific data formats both for the input and the output:
lightlda.sh wraps them to enable using standard input and output formats.
Content
- lightlda.sh
- lightlda2model.py
- svmlight.py
- train (sample data)
Download: lightlda.sh-0.1.tar.gz
Install
Usage
using lightlda2.sh is simple:
% ./lightlda.sh
lightlda.sh: wrapper to execute LightLDA in a standard way.
usage: lightlda.sh K iters train model [alpha] [beta]
$Id: lightlda.sh,v 1.10 2018/10/23 02:57:28 daichi Exp
General usage:
% lightlda.sh K iters train model [alpha] [beta]
- K is the number of topics to assume (e.g. 1000)
- iters is the number of MCMC iterations (e.g. 5000)
- train is a data file of standard format (described below)
- model is a name of a directory to store the trained model
- (optional) alpha is the hyperparameter of document-topic Dirichlets
(default 0.1)
- (optional) beta is the hyperparameter of topic-word Dirichlets
(default 0.01)
Currently, this script does not support distributed training of LightLDA:
for advanced usage, please use the original program directly.
Example
For example, using a sample "train" data file included in this package,
you can run lightlda.sh as:
% ./lightlda.sh 10 100 train model
alpha = 0.1 beta = 0.01 topics = 10 iters = 100
preparing data at model ..
There are totally 1324 words in the vocabulary
There are maximally totally 16054 tokens in the data set
The number of tokens in the output block is: 16054
Local vocab_size for the output block is: 1323
Elapsed seconds for dump blocks: 0.0150371
docs = 200
vocab = 1400
size = 1
executing LightLDA ..
[INFO] [2018-10-23 12:07:46] INFO: block = 0, the number of slice = 1
[INFO] [2018-10-23 12:07:46] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-10-23 12:07:46] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
...
[INFO] [2018-10-23 12:08:46] Server 0: Dump model...
[INFO] [2018-10-23 12:08:46] Server 0 closed.
[INFO] [2018-10-23 12:08:47] Rank 0/1: Multiverso closed successfully.
converting to standard parameters..
reading model..
saving model..
done.
finished.
Data formats
Training data format is almost the same as
lda or SVMlight
without labels for each line.
Typical data file is as follows:
1:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1
- Each line consists of pairs of <feature_id>:<count>.
Here, feature_id is an integer from 0;
count is a positive integer.
- <feature_id>:<count> pairs are separated by (possibly
multiple) white spaces.
The program is coded to work even if there are any empty lines, but
it is preferable that there are no such unnecessary lines.
- For a complete specification, please refer to SVMlight's page.
Output is stored in the model directory; in addition to the outputs of
LightLDA, a file "model" is created there for easy use.
"model" is a gzipped pickled data that can be loaded in Python
as follows:
import gzip
import cPickle as pickle
with gzip.open ("model", "rb") as gf:
model = pickle.load (gf)
Then "model['alpha']" is a scalar alpha parameter
used for training, and "model['beta']" is VxK matrix of beta parameter,
and "model['gamma']" is a NxK matrix of Dirichlet posteriors for each of the
training documents.
daichi<at>ism.ac.jp
Last modified: Tue Oct 23 12:13:20 2018