LDAb.py: Latent Dirichlet Allocation with a background distribution.

Daichi Mochihashi
The Institute of Statistical Mathematics, Tokyo
$Id: index.html,v 1.2 2020/08/03 12:07:07 daichi Exp $

LDAb.py is a Cython implementation of LDA with an automatic estimation of background distribution (i.e. function words) described in [1] (but [1] lacks necessary sampling details).

Requirements

All these requirements are satisfied by a standard installation of Anaconda 3.

Download

Install

Just type 'make' in this directory. After some warnings, *.so is generated by Cython.

How to start

Type 'ldab.py --help' for a command line usage.
% ldab.py --help
LDAb: LDA with a background distribution.
$Id: ldab.py,v 1.1 2020/08/02 14:03:49 daichi Exp
usage: % ldab.py OPTIONS train.txt model [stopwords]
OPTIONS
 -K topics  number of topics in LDA
 -N iters   number of Gibbs iterations (default 1)
 -t freq    threshold of word frequency (default 10)
 -a alpha   Dirichlet hyperparameter on topics (default 50/K)
 -b beta    Dirichlet hyperparameter on words (default 0.01)
 -e eta     Dirichlet hyperparameter on background (default 0.005)
 -h         displays this help
To start with, try an initial run with a sample text contained in this package (cran.txt in English and knbc.txt in Japanese):
% ldab.py -K 10 -N 50 cran.txt model stopwords.en
or
% ldab.py -K 20 -N 200 -t 5 knbc.txt model stopwords.ja
where "stopwords.en" or "stopwords.ja" are optional.

To see the learned model "model", use "viewtopic.py" as:

% viewtopic.py model
for a complete usage, just type "viewtopic.py".
viewtopic.py lists words based on probability for background distribution, and normalized PMI [2] for ordinary topics. To change this behavior, take a look on the code of viewtopic.py.

Sample text

cran.txt (English)
Cranfield collection for information retrieval. http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
knbc.txt (Japanese)
Kyoto University Blog corpus. http://nlp.ist.i.kyoto-u.ac.jp/kuntt/

Data format

ldab.py is an unsupervised algorithm that requires only a raw text of documents. Each document is separated by blank line(s), and each word is separated by whitespaces. It is not concerned with the underlying text: it does not lowercase each word nor unify numerals. If you wish to adequately handle them, please preprocess the input beforehand.
If you have data where each line represents a document, double-space it by:
% sed G input.txt > output.txt
then use "output.txt" for learning.

Output

"model" is a gzipped pickle file of Python dictionary, which have keys:
'alpha'
scalar of Dirichlet prior over \theta
'beta'
VxK matrix of \beta parameter of LDA, i.e. {p(v|k)}
'eta'
scalar of Dirichlet prior over background distribution
'theta'
NxK matrix of \theta posterior for each document
'background'
Vx1 vector of background distribution
'stopwords'
Vx1 vector of indicators of hinted background words (1/0)
'lexicon'
dictionary to map each word to its id

References

[1] Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. "Modeling general and specific aspects of documents with a probabilistic topic model". NIPS 2005, pp. 241-248.
[2] Gerlof Bouma. "Normalized (pointwise) mutual information in collocation extraction". Proceedings of GSCL (2009): 31-40.


daichi@ism.ac.jp
Last modified: Mon Aug 3 21:06:27 2020