mots quotidiens.

前に日記で書いた, 自己組織化二分探索木であるSplay Treeは

struct splay {
        splay *left;
        splay *right;
        void *item;
};

というデータ構造を持っているため, データ構造へのポインタのきっかり3倍の記憶容量を必要とする。
順番は多くの場合関係ないので, こうした動的なデータ構造には本来ハッシュを使えばいいはずだが, 普通のハッシュでは不要なメモリが沢山確保される可能性があるため, スプレー木を使っていた。ハッシュテーブルが1つなら大した問題ではないですが, テーブル自体が何万個もあったりすると, そのロスは膨大なものになります。

最近, 開発環境をCからC++に変えたため(理由はそのうち), Googleの提供している Memory-efficientな Google Sparse Hash が使えるようになったので, 実行速度を比較してみた。研究の都合上文字列の比較が必要になるため, 文字列関数の実行回数の少ない(と思われる)ハッシュを使うべきではないか, という予測もありました。

先に結果だけ書いておくと, スプレー木の圧勝でした。(!) 驚き。

空白で区切られた単語の出現回数を数える, スプレー木のCのコードは, 重要なところだけ書くと

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "splay.h"

typedef struct {
        char *string;
        int count;
} entry;

int
entcmp (entry *a, entry *b)
{
        return strcmp(a->string, b->string);
}

int
main (int argc, char *argv[])
{
        splay *tree = NULL;
        char s[BUFSIZ];
        entry e, *ep;
        FILE *fp;

        // fp をオープン 

        while (read_word(fp, s) != EOF)
        {
                e.string = s;
                if ((ep = splay_find(&tree, &e, (cmpfunc)entcmp)) == NULL)
                {
                        if ((ep = (entry *)malloc(sizeof(entry))) == NULL)
                        {
                                perror("malloc");
                                exit(1);
                        } else {
                                ep->string = strdup(s);
                                ep->count = 1;
                        }
                        tree = splay_insert(tree, ep, (cmpfunc)entcmp);
                } else {
                        ep->count++;
                }
        }
        // splay_print (tree, (prfunc)entprint);
        fclose(fp);
        return 0;
}

のようになります。(速度比較のため, // で結果表示部をコメントアウト)
これに対して, 全く同じことをする Google Sparse Hash のコードは以下のような感じ。別に変なことはしていないと思うのですが..。

#include <iostream>
#include <google/sparse_hash_map>
#include <cstdio>
#include <cstring>

using namespace std;
using namespace __gnu_cxx;
using google::sparse_hash_map;

struct eqstr
{
        bool operator() (const char *a, const char *b) const
        {
                return (a == b) || (a && b && strcmp(a,b) == 0);
        }
};

int
read_word (FILE *fp, const char *buf)
{
        return fscanf(fp, "%s", buf);
}

int
main (int argc, char *argv[])
{
        FILE *fp;
        char s[BUFSIZ];
        sparse_hash_map <const char *,int,hash<const char *>,eqstr> lexicon;
        sparse_hash_map <const char *,int,hash<const char *>,eqstr>::iterator it;
        
        if ((fp = fopen(argv[1], "r")) == NULL)
        {
                perror("fopen");
                exit(1);
        }
        while (read_word(fp, s) != EOF)
        {
                if ((it = lexicon.find(s)) == lexicon.end())
                {
                        lexicon[strdup(s)] = 1;
                } else {
                        it->second++;   // count++;
                }
        }

        fclose(fp);

//      for (it = lexicon.begin(); it != lexicon.end(); it++)
//      {
//              printf("%s\t%d\n", it->first, it->second);
//      }

        return 0;
}

実際には sparse_hash_map はドキュメントによると, メモリ最適化を優先しているため, 速度優先のためには dense_hash_map を使うべきらしい(こちらの方が相当高速だが, メモリ使用量は増える)。コードは上のものを dense_hash_map に変えたもの+1行だけなので省略。

以上のプログラムを, テキストの単語の出現回数を数える場合で比較してみた。テキストは毎日新聞2001年度全文から1000000行(約2/3ぐらい)。最後に比較のため, awkワンライナーで同じことをした場合の結果も載せておきます。

algorithm time memory
splay tree 17.2s 7588K
ext/hash_map 11.4s 7544K
dense_hash_map 147.8s 6972K
sparse_hash_map 505.5s 6152K
(awk) 15.4s 17380K

algorithm	time	memory
splay tree	17.2s	7588K
ext/hash_map	11.4s	7544K
dense_hash_map	147.8s	6972K
sparse_hash_map	505.5s	6152K
(awk)	15.4s	17380K

実際, Google Sparse Hash の方が高性能かと思っていたので, 意外でした。ただし, メモリ量で見ると若干 sparse hash の方が勝っているようです。 *1
こんなに差が出たのは, 言語の場合は単語頻度に強烈なPower lawがあって, 「が」や「も」のような機能語が何度も何度も検索されるため, それらが自動的に木の上に集まって高速にアクセスされるスプレー木が勝った, ということではないか..と思ったのですが, 1..1000000までの数字がランダムに1回ずつ現れるテキストで実験したところ, splay tree(5.4s), dense_hash_map(158.1s), sparse_hash_map(367s)とやはり差は大きかったので, 細かいところまで自分で面倒を見るCのコードの方が高速, ということなのかもしれません。それにしても大きすぎる違いだ..。 *2
というわけで, ∞-gramの実装にハッシュを使わず, スプレー木にしたのは結局正解だったようです。うーん。

: この記憶量のうち, 文字列の確保に必要なメモリは314kbytesくらいなので, 大部分はアルゴリズムの差であることがわかる。
: コンピュータサイエンス的に言うと, 同値かどうかだけを問題にするハッシュは, 比較関数から本来得られる <, > という情報を落としているので, その分効率が落ちているのかもしれません。

2008年07月23日(水) [n年日記]

#1 Google Sparse Hash <-> Splay Tree

追記:

mots quotidiens.
Daichi Mochihashi (持橋大地) daichi <at> ism.ac.jp	by hns, version 2.10-pl1.