LDAb.py:
Latent Dirichlet Allocation with a background distribution.
Daichi Mochihashi
The Institute of Statistical Mathematics, Tokyo
$Id: index.html,v 1.2 2020/08/03 12:07:07 daichi Exp $
LDAb.py is a Cython implementation of LDA with an automatic estimation of
background distribution (i.e. function words)
described in [1] (but [1] lacks necessary sampling details).
Requirements
- Python 3.x
- Numpy, Scipy
- Cython
All these requirements are satisfied by a standard installation of Anaconda 3.
Download
Install
Just type 'make' in this directory.
After some warnings, *.so is generated by Cython.
How to start
Type 'ldab.py --help' for a command line usage.
% ldab.py --help
LDAb: LDA with a background distribution.
$Id: ldab.py,v 1.1 2020/08/02 14:03:49 daichi Exp
usage: % ldab.py OPTIONS train.txt model [stopwords]
OPTIONS
-K topics number of topics in LDA
-N iters number of Gibbs iterations (default 1)
-t freq threshold of word frequency (default 10)
-a alpha Dirichlet hyperparameter on topics (default 50/K)
-b beta Dirichlet hyperparameter on words (default 0.01)
-e eta Dirichlet hyperparameter on background (default 0.005)
-h displays this help
To start with, try an initial run with a sample text contained in this package
(cran.txt in English and knbc.txt in Japanese):
% ldab.py -K 10 -N 50 cran.txt model stopwords.en
or
% ldab.py -K 20 -N 200 -t 5 knbc.txt model stopwords.ja
where "stopwords.en" or "stopwords.ja" are optional.
To see the learned model "model", use "viewtopic.py" as:
% viewtopic.py model
for a complete usage, just type "viewtopic.py".
viewtopic.py lists words based on probability for background
distribution, and
normalized PMI [2] for ordinary topics. To change this behavior, take a look
on the code of viewtopic.py.
Sample text
- cran.txt (English)
- Cranfield collection for information retrieval.
http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
- knbc.txt (Japanese)
- Kyoto University Blog corpus.
http://nlp.ist.i.kyoto-u.ac.jp/kuntt/
Data format
ldab.py is an unsupervised algorithm that requires only a raw text of
documents.
Each document is separated by blank line(s), and each word is separated by
whitespaces. It is not concerned with the underlying text: it does not
lowercase each word nor unify numerals. If you wish to adequately handle them,
please preprocess the input beforehand.
If you have data where each line represents a document, double-space it by:
% sed G input.txt > output.txt
then use "output.txt" for learning.
- Note: ldab.py uses words that have frequencies >= threshold (default 10).
To set this threshold, use "-t" option.
- You can feed the list of words that will always be stopwords (= belong to
background distribution) as the last argument. Note that this is only a hint,
and background distribution will contain words that are not specified in the
list.
The program runs without any such hints, but generally yields better results
if you provide some candidate list of stopwords.
See "stopwords.en" or "stopwords.ja" for example.
Output
"model" is a gzipped pickle file of Python dictionary, which have keys:
- 'alpha'
- scalar of Dirichlet prior over \theta
- 'beta'
- VxK matrix of \beta parameter of LDA, i.e. {p(v|k)}
- 'eta'
- scalar of Dirichlet prior over background distribution
- 'theta'
- NxK matrix of \theta posterior for each document
- 'background'
- Vx1 vector of background distribution
- 'stopwords'
- Vx1 vector of indicators of hinted background words (1/0)
- 'lexicon'
- dictionary to map each word to its id
References
[1] Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. "Modeling
general and specific aspects of documents with a probabilistic topic model".
NIPS 2005, pp. 241-248.
[2] Gerlof Bouma. "Normalized (pointwise) mutual information in collocation
extraction". Proceedings of GSCL (2009): 31-40.
daichi@ism.ac.jp
Last modified: Mon Aug 3 21:06:27 2020