bhmm.py: Bayesian HMM in Python.

Daichi Mochihashi
The Institute of Statistical Mathematics, Tokyo
$Id: index.html,v 1.3 2021/10/24 03:03:13 daichi Exp $

bhmm.py is a simple Python implementation of Bayesian (discrete) hidden Markov model (HMM).
It is written basically for educational and research purposes, and implements standard forward filtering-backward sampling (Bayesian version of forward-backward algorithm, Scott (2002)) as well as Gibbs sampling in Python.

Download

Requirements

Python 3.x. (developed with Python 3.8.3)
If you need to run it by Python 2.x, please use bhmm.py-0.1.tar.gz.

Contents

Usage

% ./bhmm.py -h
bhmm.py: Bayesian HMM in Python.
usage: % bhmm.py OPTIONS train model
OPTIONS
 -G         use Gibbs sampling instead of dynamic programming
 -K states  number of states in HMM
 -N iters   number of MCMC iterations
 -a alpha   Dirichlet hyperparameter on transitions (default 0.1)
 -b beta    Dirichlet hyperparameter on emissions (default 0.01)
 -h         displays this help
$Id: bhmm.py,v 1.8 2021/10/24 02:02:42 daichi Exp $
train is a data file: each line consists of a sequence of integers from 0, which usually corresponds to a sequence of word id in a sentence.
model is the basename of output of learned parameters.

Output

bhmm.py outputs three files: model is the basename specified above.
model.zz
Learned hidden states at the end of MCMC iterations.
model.trans
(K+1)x(K+1) state transition matrix. K is the number of hidden states. Note that state (K+1) is a special state that corresponds to BOS/EOS.
model.emit
VxK emission matrix for p(v|k). V is the number of words in vocabulary.
Note that model.zz is just the dump of hidden states at the final iteration (this is stochastic, because we use MCMC): if you wish to find the most likely state sequences, you should run a Viterbi algorithm using the learned parameters.

Sample data and outputs

Using zsh: (it might take around 30 minutes to converge, since written in pure Python)
% bhmm.py -K 7 -N 400 alice.dat alice
% paste =(flatten.awk alice.zz) =(flatten.awk alice.txt) > alice.states
% for n in `seq 0 6`; do
echo $n
awk "\$1==$n" alice.states | sort | uniq -c | sort -nr | head -25 | awk '{printf("%4s  %s\n",$1,$3)}' > alice.$n
done

Result:

state 0 state 1 state 2 state 3 state 4 state 5 state 6
1642  the
 631  a
 162  her
  96  his
  63  this
  59  no
  58  my
  57  its
  57  an
  49  their
  40  very
  37  your
  37  some
  37  one
  31  any
  20  another
  14  that
  12  every
  12  alice's
  10  great
   8  which
   8  these
   8  our
   8  four
   7  those
 678  and
 257  as
 170  but
 150  that
 115  so
  96  if
  82  what
  79  when
  75  then
  70  or
  54  i'm
  51  how
  50  for
  48  of
  47  well
  45  it's
  41  oh
  40  why
  38  before
  34  that's
  33  now
  32  two
  31  no
  25  up
  25  round
 729  to
 466  of
 362  in
 341  said
 212  at
 180  with
 168  on
 103  for
  72  about
  70  like
  67  into
  65  all
  58  by
  51  not
  50  quite
  50  down
  46  be
  40  very
  40  over
  36  from
  35  after
  34  such
  31  looking
  26  without
  26  upon
 357  was
 178  had
 121  said
  94  is
  85  were
  83  went
  77  could
  65  would
  62  know
  61  don't
  56  thought
  56  began
  48  did
  45  looked
  45  got
  44  must
  43  are
  40  came
  38  have
  38  all
  37  never
  35  can
  30  think
  29  replied
  28  might
 121  little
  68  queen
  61  king
  58  time
  57  turtle
  56  way
  56  mock
  55  hatter
  55  gryphon
  49  head
  48  voice
  47  rabbit
  43  mouse
  40  other
  39  dormouse
  38  duchess
  37  tone
  35  cat
  34  thing
  33  very
  33  march
  31  hare
  30  white
  30  door
  29  moment
 541  she
 410  i
 277  it
 261  you
 193  and
 161  alice
 130  they
 119  he
  75  there
  57  who
  53  that
  48  what
  31  i've
  30  we
  30  i'll
  28  very
  25  your
  24  which
  20  this
  19  then
  19  not
  17  how
  12  it's
  10  you'd
  10  they're
 253  it
 225  alice
 117  out
 102  be
  88  you
  88  them
  86  her
  83  herself
  83  again
  79  all
  75  up
  73  off
  68  me
  63  that
  63  not
  59  see
  51  this
  50  one
  50  go
  47  down
  46  come
  44  much
  44  just
  43  him
  42  have


daichi<at>ism.ac.jp
Last modified: Sun Oct 24 12:04:28 2021