LDAb-0.1

LDAb.py: Latent Dirichlet Allocation with a background distribution.

Daichi Mochihashi
The Institute of Statistical Mathematics, Tokyo
$Id: index.html,v 1.2 2020/08/03 12:07:07 daichi Exp $

LDAb.py is a Cython implementation of LDA with an automatic estimation of background distribution (i.e. function words) described in [1] (but [1] lacks necessary sampling details).

Requirements

Python 3.x
Numpy, Scipy
Cython

All these requirements are satisfied by a standard installation of Anaconda 3.

Download

ldab.py-0.1.zip

Install

Just type 'make' in this directory. After some warnings, *.so is generated by Cython.

How to start

Type 'ldab.py --help' for a command line usage.

% ldab.py --help
LDAb: LDA with a background distribution.
$Id: ldab.py,v 1.1 2020/08/02 14:03:49 daichi Exp
usage: % ldab.py OPTIONS train.txt model [stopwords]
OPTIONS
 -K topics  number of topics in LDA
 -N iters   number of Gibbs iterations (default 1)
 -t freq    threshold of word frequency (default 10)
 -a alpha   Dirichlet hyperparameter on topics (default 50/K)
 -b beta    Dirichlet hyperparameter on words (default 0.01)
 -e eta     Dirichlet hyperparameter on background (default 0.005)
 -h         displays this help

To start with, try an initial run with a sample text contained in this package (cran.txt in English and knbc.txt in Japanese):

% ldab.py -K 10 -N 50 cran.txt model stopwords.en

% ldab.py -K 20 -N 200 -t 5 knbc.txt model stopwords.ja

where "stopwords.en" or "stopwords.ja" are optional.

To see the learned model "model", use "viewtopic.py" as:

% viewtopic.py model

for a complete usage, just type "viewtopic.py".
viewtopic.py lists words based on probability for background distribution, and normalized PMI [2] for ordinary topics. To change this behavior, take a look on the code of viewtopic.py.

Sample text

cran.txt (English): Cranfield collection for information retrieval. http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
knbc.txt (Japanese): Kyoto University Blog corpus. http://nlp.ist.i.kyoto-u.ac.jp/kuntt/

Data format

ldab.py is an unsupervised algorithm that requires only a raw text of documents. Each document is separated by blank line(s), and each word is separated by whitespaces. It is not concerned with the underlying text: it does not lowercase each word nor unify numerals. If you wish to adequately handle them, please preprocess the input beforehand.
If you have data where each line represents a document, double-space it by:

% sed G input.txt > output.txt

then use "output.txt" for learning.

Note: ldab.py uses words that have frequencies >= threshold (default 10). To set this threshold, use "-t" option.
You can feed the list of words that will always be stopwords (= belong to background distribution) as the last argument. Note that this is only a hint, and background distribution will contain words that are not specified in the list.
The program runs without any such hints, but generally yields better results if you provide some candidate list of stopwords. See "stopwords.en" or "stopwords.ja" for example.

Output

"model" is a gzipped pickle file of Python dictionary, which have keys:

'alpha': scalar of Dirichlet prior over \theta
'beta': VxK matrix of \beta parameter of LDA, i.e. {p(v|k)}
'eta': scalar of Dirichlet prior over background distribution
'theta': NxK matrix of \theta posterior for each document
'background': Vx1 vector of background distribution
'stopwords': Vx1 vector of indicators of hinted background words (1/0)
'lexicon': dictionary to map each word to its id

References

[1] Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. "Modeling general and specific aspects of documents with a probabilistic topic model". NIPS 2005, pp. 241-248.
[2] Gerlof Bouma. "Normalized (pointwise) mutual information in collocation extraction". Proceedings of GSCL (2009): 31-40.

daichi@ism.ac.jp

Last modified: Mon Aug 3 21:06:27 2020