一本到高清DVD91日韩伦理影院|无码AV中文一区国产强奸三级簧片|日韩无码色哟哟午夜福利国产一区|丁香激情五月亚洲亚洲影院123区|五月天综合久久国产精品free|亚洲免费专区日韩热在线视频|黄片看视频免费久久偷拍的视频|五月婷桃色网日韩国产一级

服務(wù)熱線02152235399
當(dāng)前位置:博客 > 生物信息

An Implementation of Double-Array Trie 雙數(shù)組Trie的一種實(shí)現(xiàn)

時(shí)間:2018-10-22    |    閱讀量:5441


原文http://linux.thai.net/~thep/datrie/datrie.html

引文:http://quweiprotoss.blog.163.com/blog/static/4088288320091120112155178/

Contents

1. What is Trie?

2. What Does It Take to Implement a Trie?

3. Tripple-Array Trie

4. Double-Array Trie

5. Suffix Compression

6. Key Insertion

7. Key Deletion

8. Double-Array Pool Allocation

9. An Implementation

10. Download

11. References

What is Trie?

Trie is a kind of digital search tree. (See [Knuth1972] for the detail of digital search tree.)[Fredkin1960] introduced the trie terminology, which is abbreviated from "Retrieval".

Trie是一種數(shù)字搜索樹,[Fredkin]引入了trie這個(gè)術(shù)語,它是Retrieval的縮寫。

Trie is an efficient indexing method. It is indeed also a kind of deterministic finite automaton (DFA) (See[Cohen1990], for example, for the definition of DFA). Within the tree structure, each node corresponds to a DFA state, each (directed) labeled edge from a parent node to a child node corresponds to a DFA transition. The traversal starts at the root node. Then, from head to tail, one by one character in the key string is taken to determine the next state to go. The edge labeled with the same character is chosen to walk. Notice that each step of such walking consumes one character from the key and descends one step down the tree. If the key is exhausted and a leaf node is reached, then we arrive at the exit for that key. If we get stuck at some node, either because there is no branch labeled with the current character we have or because the key is exhausted at an internal node, then it simply implies that the key is not recognized by the trie.

Trie是一種高效的索引方法,它實(shí)際上是一種確定有限自動機(jī)(DFA),在樹的結(jié)構(gòu)中,每一個(gè)結(jié)點(diǎn)對應(yīng)一個(gè)DFA狀態(tài),每一個(gè)從父結(jié)點(diǎn)指向子結(jié)點(diǎn)(有向)標(biāo)記的邊對應(yīng)一個(gè)DFA轉(zhuǎn)換。遍歷從根結(jié)點(diǎn)開始,然后從head到tail,由關(guān)鍵詞(本想譯成鍵字符串,感太別扭)的每個(gè)字符來決定下一個(gè)狀態(tài),標(biāo)記有相同字符的邊被選中做移動。注意每次這種移動會從關(guān)鍵詞中消耗一個(gè)字符并走向樹的下一層,如果這個(gè)關(guān)鍵字符串空了,并且走到了葉子結(jié)點(diǎn),那么我們達(dá)到了這個(gè)關(guān)鍵詞的出口。如果我們被困在了一點(diǎn)結(jié)點(diǎn),比如因?yàn)闆]有分枝被標(biāo)記為當(dāng)前我們有的字符,或是因?yàn)殛P(guān)鍵字符串在中間結(jié)點(diǎn)就空了,這表示關(guān)鍵字符串沒有被trie認(rèn)出來。

Notice that the time needed to traverse from the root to the leaf is not dependent on the size of the database, but is proportional to the length of the key. Therefore, it is usually much faster than B-tree or any comparison-based indexing method in general cases. Its time complexity is comparable with hashing techniques.

注意從根遍歷到葉子的時(shí)間不依賴于數(shù)據(jù)量的大小,而與關(guān)鍵詞的長度成比例。也就是它通常比B-tree要快的多,或是比任何基于比較的索引方法在一般情況下都快的多。它的時(shí)間復(fù)雜度可以與hashing技術(shù)相比。

In addition to the efficiency, trie also provides flexibility in searching for the closest path in case that the key is misspelled. For example, by skipping a certain character in the key while walking, we can fix the insertion kind of typo. By walking toward all the immediate children of one node without consuming a character from the key, we can fix the deletion typo, or even substitution typo if we just drop the key character that has no branch to go and descend to all the immediate children of the current node.

除了效率,trie可以當(dāng)關(guān)鍵詞被拼錯(cuò)了時(shí),提供靈活的搜索。比如在遍歷時(shí)忽略一個(gè)特定的字符,我們可以修正多輸入的這種輸入錯(cuò)誤。直接遍歷所有直接的子結(jié)點(diǎn)而不消耗一個(gè)字符,這樣我可以修正少輸?shù)妮斎脲e(cuò)誤,或者甚至替換輸入的錯(cuò)誤,如果我們沒有分支可以向下移動,就向當(dāng)前結(jié)點(diǎn)的所有子結(jié)點(diǎn)直接向下移動。

What Does It Take to Implement a Trie?

In general, a DFA is represented with a transition table, in which the rows correspond to the states, and the columns correspond to the transition labels. The data kept in each cell is then the next state to go for a given state when the input is equal to the label.

通常,一個(gè)DFA是用一個(gè)transition table表示,它的行對應(yīng)狀態(tài)s,它的列對應(yīng)轉(zhuǎn)換標(biāo)簽s。每一個(gè)單元中的數(shù)據(jù)當(dāng)輸入與標(biāo)記相同時(shí),給定一個(gè)狀態(tài)后要到達(dá)的狀態(tài)。

This is an efficient method for the traversal, because every transition can be calculated by two-dimensional array indexing. However, in term of space usage, this is rather extravagant, because, in the case of trie, most nodes have only a few branches, leaving the majority of the table cells blanks.

這是一個(gè)的遍歷高效方法,因?yàn)槊看无D(zhuǎn)換可以通過計(jì)算二維的數(shù)組索引得到。但是,從空間使用的角度看,這是相當(dāng)浪費(fèi)的,因?yàn)?,在trie這種情況中,大多數(shù)結(jié)點(diǎn)只有少量的分枝,表中大多數(shù)單元是空白的。

Meanwhile, a more compact scheme is to use a linked list to store the transitions out of each state. But this results in slower access, due to the linear search.

同時(shí),一個(gè)更緊湊的方式是用鏈表保存每個(gè)狀態(tài)的轉(zhuǎn)移,但這種方式會比較慢,因?yàn)橐M(jìn)行線性搜索

Hence, table compression techniques which still allows fast access have been devised to solve the problem.

所以,建議使用可以快速訪問的表壓縮技術(shù)。

1. [Johnson1975] (Also explained in [Aho+1985] pp. 144-146) represented DFA with four arrays, which can be simplified to three in case of trie. The transition table rows are allocated in overlapping manner, allowing the free cells to be used by other rows.

2. [Aoe1989] proposed an improvement from the three-array structure by reducing the arrays to two.

Tripple-Array Trie

As explained in [Aho+1985] pp. 144-146, a DFA compression could be done using four linear arrays, namely default, base, next, and check. However, in a case simpler than the lexical analyzer, such as the mere trie for information retrieval, the default array could be omitted. Thus, a trie can be implemented using three arrays according to this scheme.

在[Aho+1985]中解釋到,一個(gè)DFA壓縮可以用四個(gè)數(shù)據(jù)來完成,分別是default, base, next, check。但是與語法分析不同,比如用于信息檢索的trie,default可以被忽略。所以,一個(gè)trie可以用三個(gè)數(shù)組來實(shí)現(xiàn)。

Structure

The tripple-array structure is composed of:

1. base. Each element in base corresponds to a node of the trie. For a trie nodes, base[s] is the starting index within the next and check pool (to be explained later) for the row of the node s in the transition table.

2. next. This array, in coordination with check, provides a pool for the allocation of the sparse vectors for the rows in the trie transition table. The vector data, that is, the vector of transitions from every node, would be stored in this array.

3. check. This array works in parallel to next. It marks the owner of every cell in next. This allows the cells next to one another to be allocated to different trie nodes. That means the sparse vectors of transitions from more than one node are allowed to be overlapped.

三數(shù)組結(jié)構(gòu)包括:

1.  base.在base中的每個(gè)元素對應(yīng)trie中的一個(gè)結(jié)點(diǎn),對于一個(gè)trie結(jié)點(diǎn)s,base[s]是next和check pool的起始索引,它們是在轉(zhuǎn)移表中結(jié)點(diǎn)s的行。

2.  next. 這個(gè)數(shù)組,與check配合,提供一個(gè)pool來分配trie轉(zhuǎn)移表中稀疏向量。這個(gè)向量數(shù)據(jù)是從每一個(gè)結(jié)點(diǎn)的轉(zhuǎn)移向量,會被保存在這個(gè)數(shù)組中。

3.  check. 這個(gè)數(shù)組與next平行地工作。它標(biāo)記在next單元的owner。它允許單元s轉(zhuǎn)移到別一個(gè)單元可以分配到不同的trie結(jié)點(diǎn)s。這意味著轉(zhuǎn)移s 的稀疏向量s可以從不同的結(jié)點(diǎn)轉(zhuǎn)移到。

Definition 1. For a transition from state s to t which takes character c as the input, the condition maintained in the tripple-array trie is:

check[base[s] + c] = s

next[base[s] + c] = t

定義 1. 對于從狀態(tài)s將c做為輸入,轉(zhuǎn)移到t,保存在三數(shù)組中trie條件是:

check[base[s] + c] = s

next[base[s] + c] = t

Walking

According to definition 1, the walking algorithm for a given state s and the input character c is:

根據(jù)定義1,對于一個(gè)給定狀態(tài)s和一個(gè)輸入字符c,移動算法如下:

t := base[s] + c;

if check[t] = s then

next state := next[t]

else

fail

endif

Construction

To insert a transition that takes character c to traverse from a state s to another state t, the cellnext[base[s] + c]] must be managed to be available. If it is already vacant, we are lucky. Otherwise, either the entire transition vector for the current owner of the cell or that of the state s itself must be relocated. The estimated cost for each case could determine which one to move. After finding the free slots to place the vector, the transition vector must be recalculated as follows. Assuming the new place begins at b, the procedure for the relocation is:

插入一個(gè)從狀態(tài)s到另一個(gè)狀態(tài)t的轉(zhuǎn)移,單元next[base[s]+c]必須是空閑的。如果它就是空的,那么我們很幸運(yùn)。不然,或是將當(dāng)前 owner的整個(gè)轉(zhuǎn)移向量重新分配,或是狀態(tài)s自己要重新分配。這兩種情況的估計(jì)代價(jià)會決定采用哪用方式。在找到空閑的位置來放這個(gè)向量,轉(zhuǎn)移向量必須重新如下計(jì)算,假設(shè)新的位置開始于b,重新分配的過程如下:

Procedure Relocate(s : state; b : base_index)

{ Move base for state s to a new place beginning at b }

begin

foreach input character c for the state s

{ i.e. foreach c such that check[base[s] + c]] = s }

begin

check[b + c] := s;     { mark owner }

next[b + c] := next[base[s] + c];     { copy data }

check[base[s] + c] := none { free the cell }

end;

base[s] := b

end

Double-Array Trie

The tripple-array structure for implementing trie appears to be well defined, but is still not practical to keep in a single file. The next/check pool may be able to keep in a single array of integer couples, but thebase array does not grow in parallel to the pool, and is therefore usually split.

三數(shù)組結(jié)構(gòu)來實(shí)現(xiàn)trie看真似為是一個(gè)很好的方式,但是將它保存到一個(gè)單一文件還是不現(xiàn)實(shí)的,next/check pool也許可以保存在整型對的一個(gè)單個(gè)數(shù)組,但是base數(shù)組不與pool平行地增長,所以通常分離。

To solve this problem, [Aoe1989] reduced the structure into two parallel arrays. In the double-array structure, the base and next are merged, resulting in only two parallel arrays, namely, base and check.

解決這個(gè)問題,[Aoe1989]減少這個(gè)結(jié)構(gòu)到二個(gè)平行的數(shù)組,在這個(gè)雙數(shù)組結(jié)構(gòu)中,將base和next合并,它只使用兩個(gè)平行的數(shù)組為,base和check。

Structure

Instead of indirectly referencing through state numbers as in tripple-array trie, nodes in double-array trie are linked directly within the base/check pool.

不同于三數(shù)組trie中通過狀態(tài)號間接引用,在雙數(shù)組trie是直接在base / check pool中鏈接。

Definition 2. For a transition from state s to t which takes character c as the input, the condition maintained in the double-array trie is:

check[base[s] + c] = s

base[s] + c = t

定義2. 對于一個(gè)接收字符c從狀態(tài)s移動到t的轉(zhuǎn)移,在雙數(shù)組中保存的條件是:

check[base[s] + c] = s

base[s] + c = t

Walking

According to definition 2, the walking algorithm for a given state s and the input character c is:

根據(jù)定義2,對于從狀態(tài)s接收輸入c的移動算法如下:

t := base[s] + c;

if check[t] = s then

next state := t

else

fail

endif

Construction

The construction of double-array trie is in principle the same as that of tripple-array trie. The difference is the base relocation:

構(gòu)造雙數(shù)組trie在原理上是與三數(shù)組trie相同的,不同點(diǎn)在于base的重新定位。

Procedure Relocate(s : state; b : base_index)

{ Move base for state s to a new place beginning at b }

begin

foreach input character c for the state s

{ i.e. foreach c such that check[base[s] + c]] = s }

begin

check[b + c] := s;     { mark owner }

base[b + c] := base[base[s] + c];     { copy data }

{ the node base[s] + c is to be moved to b + c;

Hence, for any i for which check[i] = base[s] + c, update check[i] to b+ c }

foreach input character d for the node base[s] + c

begin

check[base[base[s] + c] + d] := b + c

end;

check[base[s] + c] := none { free the cell }

end;

base[s] := b

end

Suffix Compression

[Aoe1989] also suggested a storage compression strategy, by splitting non-branching suffixes into single string storages, called tail, so that the rest non-branching steps are reduced into mere string comparison.

[Aoe1989]也建議了一種壓縮策略,通過拆分non-branching后綴到單一字符串保存,稱為tail,這樣剩余的non-branching步驟就減為字符串壓縮了。

With the two separate data structures, double-array branches and suffix-spool tail, key insertion and deletion algorithms must be modified accordingly.

使用兩種不同的數(shù)據(jù)結(jié)構(gòu),雙數(shù)組分枝和suffix-spool tail,關(guān)鍵詞插入和刪除算法也必須相應(yīng)的修改。

Key Insertion

To insert a new key, the branching position can be found by traversing the trie with the key one by one character until it gets stuck. The state where there is no branch to go is the very place to insert a new edge, labeled by the failing character. However, with the branch-tail structure, the insertion point can be either in the branch or in the tail.

插入一個(gè)新的關(guān)鍵詞,分枝位置可以通過關(guān)鍵詞中的字符遍歷trie,直到它困住。這種狀態(tài)就是沒有分枝可以走,要插入一個(gè)新的邊的位置,這個(gè)邊標(biāo)記為那個(gè)失敗的字符。然而,用branch-tail結(jié)構(gòu),插入點(diǎn)可以在branch或是tail。

1. When the branching point is in the double-array structure

Suppose that the new key is a string a1a2...ah-1ahah+1...an, where a1a2...ah-1 traverses the trie from the root to a node sr in the double-array structure, and there is no edge labeled ah that goes out of sr. The algorithm called A_INSERT in [Aoe1989] does as follows:

假設(shè)一個(gè)新的key是一個(gè)字符串a(chǎn)1a2...ah-1ahah+1...an,其中a1a2...ah-1從trie的根遍歷到結(jié)點(diǎn)sr,并且沒有邊標(biāo)記為ah可以走出sr。這個(gè)算法稱為A_INSERT在[Aoe1989]中如下操作:

From sr, insert edge labeled ah to new node st;

Let st be a separate node poining to a string ah+1...an in tail pool.

2. When the branching point is in the tail pool

Since the path through a tail string has no branch, and therefore corresponds to exactly one key, suppose that the key corresponding to the tail is

a1a2...ah-1ah...ah+k-1b1...bm,

where a1a2...ah-1 is in double-array structure, and ah...ah+k-1b1...bm is in tail. Suppose that the substring a1a2...ah-1 traverses the trie from the root to a node sr.

And suppose that the new key is in the form

a1a2...ah-1ah...ah+k-1ah+k...an,

where ah+k <> b1. The algorithm called B_INSERT in [Aoe1989] does as follows:

當(dāng)路徑經(jīng)過tail字符串沒有branch,所以對應(yīng)一個(gè)關(guān)鍵詞,假設(shè)在tail這個(gè)關(guān)鍵詞對應(yīng)的是:

a1a2...ah-1ah...ah+k-1b1...bm,

其中a1a2...ah-1在雙數(shù)組結(jié)構(gòu)中,而ah...ah+k-1b1...bm在tail。假設(shè)子串a(chǎn)1a2...ah-1遍歷trie從根到sr。

并假設(shè)這個(gè)新的關(guān)鍵詞是如下形式:

a1a2...ah-1ah...ah+k-1ah+k...an,

其中ah+k <> b1。這個(gè)算法在[Aoe1989]中被稱為B_INSERT如下操作:

From sr, insert straight path with ah...ah+k-1, ending at a new node st;

From st, insert edge labeled b1 to new node su;

Let su be separate node pointing to a string b2...bm in tail pool;

From st, insert edge labeled ah+k to new node sv;

Let sv be separate node pointing to a string ah+k+1...an in tail pool.

Key Deletion

To delete a key from the trie, all we need to do is delete the tail block occupied by the key, and all double-array nodes belonging exclusively to the key, without touching any node belonging to other keys.

從trie中刪除一個(gè)關(guān)鍵詞,我們僅需要的是刪除被這個(gè)關(guān)鍵詞占有的tail塊,和所有只屬于這個(gè)關(guān)鍵詞的雙數(shù)組結(jié)點(diǎn),不改變其它關(guān)鍵詞的結(jié)點(diǎn)。

Consider a trie which accepts a language K = {pool#, prepare#, preview#, prize#, produce#, producer#, progress#} :

考慮一個(gè)trie,它接收語言K = {pool#, prepare#, preview#, prize#, produce#, producer#, progress#} :

The key "pool#" can be deleted by removing the tail string "ol#" from the tail pool, and node 3 from the double-array structure. This is the simplest case.

關(guān)鍵詞"pool#"可以通過從tail pool移除tail字符串"ol#",即雙數(shù)組結(jié)構(gòu)中的結(jié)點(diǎn)3,這是最簡單的情況。

To remove the key "produce#", it is sufficient to delete node 14 from the double-array structure. But the resulting trie will not obay the convention that every node in the double-array structure, except the separate nodes which point to tail blocks, must belong to more than one key. The path from node 10 on will belong solely to the key "producer#".

移除關(guān)鍵詞"produce#",從雙數(shù)組結(jié)構(gòu)中刪除結(jié)點(diǎn)14就可以了,但是產(chǎn)生的trie不再遵循trie的定義,即除指向tail塊的分離的結(jié)點(diǎn)外,必須屬于多個(gè)關(guān)鍵詞。從結(jié)點(diǎn)10的路徑僅屬于關(guān)鍵詞"producer#"。

But there is no harm violating this rule. The only drawback is the uncompactnesss of the trie. Traversal, insertion and deletion algoritms are intact. Therefore, this should be relaxed, for the sake of simplicity and efficiency of the deletion algorithm. Otherwise, there must be extra steps to examine other keys in the same subtree ("producer#" for the deletion of "produce#") if any node needs to be moved from the double-array structure to tail pool.

但是違反這個(gè)規(guī)則沒有什么壞處。唯一的缺點(diǎn)是trie不緊湊。遍歷,插入,刪除算法都是不改變的。所以,出于刪除算法簡單和高效,這是可以放寬的。不然,必須有額外的步驟來檢查其它在相同子樹的關(guān)鍵詞,看是否其它結(jié)點(diǎn)需要從雙數(shù)組結(jié)構(gòu)刪除到 tail pool。

Suppose further that having removed "produce#" as such (by removing only node 14), we also need to remove "producer#" from the trie. What we have to do is remove string "#" from tail, and remove nodes 15, 13, 12, 11, 10 (which now belong solely to the key "producer#") from the double-array structure.

進(jìn)一步假設(shè)刪除的除了"produce#"(只刪除結(jié)點(diǎn)14),我們還需要從trie中刪除"producer#"。我們需要做的是從tail中刪除"#",并刪除15, 13, 12, 11, 10(它們僅屬于關(guān)鍵詞"producer#")。

We can thus summarize the algorithm to delete a key k = a1a2...ah-1ah...an, where a1a2...ah-1 is in double-array structure, and ah...an is in tail pool, as follows :

我們可以總結(jié)這個(gè)算法,其中刪除了一個(gè)關(guān)鍵詞k = a1a2...ah-1ah...an,其中a1a2...ah-1是在雙數(shù)組結(jié)構(gòu)中,且ah...an在tail pool中, 如下:

Let sr := the node reached by a1a2...ah-1;

Delete ah...an from tail;

s := sr;

repeat

p := parent of s;

Delete node s from double-array structure;

s := p

until s = root or outdegree(s) > 0.

Where outdegree(s) is the number of children nodes of s.

其中outdegree(s)是s的子結(jié)點(diǎn)數(shù)。

Double-Array Pool Allocation

When inserting a new branch for a node, it is possible that the array element for the new branch has already been allocated to another node. In that case, relocation is needed. The efficiency-critical part then turns out to be the search for a new place. A brute force algorithm iterates along the check array to find an empty cell to place the first branch, and then assure that there are empty cells for all other branches as well. The time used is therefore proportional to the size of the double-array pool and the size of the alphabet.

當(dāng)對一個(gè)結(jié)點(diǎn)插入一個(gè)新的分枝,有可能數(shù)組中為新的分枝的位置已經(jīng)分配給了其它結(jié)點(diǎn),在這種情況下,需要重新定位,影響效率的部分就是尋找一個(gè)新的位置,一種brute force(蠻力)算法迭代check數(shù)組找出一個(gè)空的單元來放置第一個(gè)分枝,然后假設(shè)還有空的單元來放其它的分枝。所以時(shí)間復(fù)雜性與雙數(shù)組的大小和字母集的大小成正比。

Suppose that there are n nodes in the trie, and the alphabet is of size m. The size of the double-array structure would be n + cm, where c is a coefficient which is dependent on the characteristic of the trie. And the time complexity of the brute force algorithm would be O(nm + cm2).

假設(shè)在trie中有n個(gè)結(jié)點(diǎn),字母集的大小為m。雙數(shù)組結(jié)構(gòu)的大小為n+cm,其中c是一個(gè)依賴trie字符形式的一個(gè)系數(shù)。Brute force算法的時(shí)間復(fù)雜性是O(nm + cm2)。

[Aoe1989] proposed a free-space list in the double-array structure to make the time complexity independent of the size of the trie, but dependent on the number of the free cells only. The check array for the free cells are redefined to keep a pointer to the next free cell (called G-link) :

[Aoe1989]提出了在雙數(shù)組結(jié)構(gòu)中一個(gè)free-space list使時(shí)間復(fù)雜性與trie的大小無關(guān),但僅與空閑單元的方法。Check數(shù)組的空閑單元被重定義為保存一下指向下一個(gè)空閑單元的指針(稱為G-link):

Definition 3. Let r1, r2, ... , rcm be the free cells in the double-array structure, ordered by position. G-link is defined as follows :

check[0] = -r1

check[ri] = -ri+1 ; 1 <= i <= cm-1

check[rcm] = -1

定義3. 令r1, r2, ... , rcm為雙數(shù)組結(jié)構(gòu)中的空閑單元,以位置排序。G-link定義如下:

check[0] = -r1

check[ri] = -ri+1 ; 1 <= i <= cm-1

check[rcm] = -1

By this definition, negative check means unoccupied in the same sense as that for "none" check in the ordinary algorithm. This encoding scheme forms a singly-linked list of free cells. When searching for an empty cell, only cm free cells are visited, instead of all n + cm cells as in the brute force algorithm.

根據(jù)這個(gè)定義,負(fù)check意味著空閑與在普通算法中的"none" check是同一個(gè)意思,這種編碼方式形成了一個(gè)空閑單無單一鏈接的鏈表。當(dāng)搜索一個(gè)空閑單元,僅cm空閑單元被訪問,而不是在brute force算法中的所有n+cm。

This, however, can still be improved. Notice that for those cells with negative check, the correspondingbase's are not given any definition. Therefore, in our implementation, Aoe's G-link is modified to be doubly-linked list by letting base of every free cell points to a previous free cell. This can speed up the insertion and deletion processes. And, for convenience in referencing the list head and tail, we let the list be circular. The zeroth node is dedicated to be the entry point of the list. And the root node of the trie will begin with cell number one.

這仍然是可以提高的,注意那些是負(fù)check的單元,對應(yīng)的base單元沒有任何給定的定義,所以,在我們的實(shí)現(xiàn)中,Aoe’s G-link被修改為雙鏈接的列表,使base中的每一個(gè)空閑單元指到前一個(gè)空閑單元。這可以提高插入和刪除的速度。并且為了方便引用列表的頭和尾,我們使用了循環(huán)鏈表,第0個(gè)結(jié)點(diǎn)被指定為列表的入口。并且trie的根結(jié)點(diǎn)開始于第1個(gè)單元。

Definition 4. Let r1, r2, ... , rcm be the free cells in the double-array structure, ordered by position. G-link is defined as follows :

check[0] = -r1

check[ri] = -ri+1 ; 1 <= i <= cm-1

check[rcm] = 0

base[0] = -rcm

base[r1] = 0

base[ri+1] = -ri ; 1 <= i <= cm-1

定義4. 令r1, r2, ... , rcm為雙數(shù)組結(jié)構(gòu)中的空閑單元,以位置排序,G-link的定義如下:

check[0] = -r1

check[ri] = -ri+1 ; 1 <= i <= cm-1

check[rcm] = 0

base[0] = -rcm

base[r1] = 0

base[ri+1] = -ri ; 1 <= i <= cm-1

Then, the searching for the slots for a node with input symbol set P = {c1, c2, ..., cp} needs to iterate only the cells with negative check :

那么,為輸入符號集P = {c1, c2, ..., cp}尋找一個(gè)結(jié)點(diǎn)的單元,只需要在有負(fù)check的單元中迭代。

{find least free cell s such that s > c1}

s := -check[0];

while s <> 0 and s <= c1 do

s := -check[s]

end;

if s = 0 then return FAIL;  {or reserve some additional space}

{continue searching for the row, given that s matches c1}

while s <> 0 do

i := 2;

while i <= p and check[s + ci - c1] < 0 do

i := i + 1

end;

if i = p + 1 then return s - c1;  {all cells required are free, so return it}

s := -check[-s]

end;

return FAIL;  {or reserve some additional space}

The time complexity for free slot searching is reduced to O(cm2). The relocation stage takes O(m2). The total time complexity is therefore O(cm2 + m2) = O(cm2).

搜索slot的時(shí)間復(fù)雜度下降到O(cm2)。重定位步驟占O(m2)。所以全部的時(shí)間復(fù)雜度是O(cm2 + m2) = O(cm2)。

It is useful to keep the free list ordered by position, so that the access through the array becomes more sequential. This would be beneficial when the trie is stored in a disk file or virtual memory, because the disk caching or page swapping would be used more efficiently. So, the free cell reusing should maintain this strategy :

將free list以位置排序是有好處的,這樣訪問位置可以是順序訪問,當(dāng)trie保存在磁盤文件或是虛擬內(nèi)存中,這會是有好處的,因?yàn)榇疟P緩存和交換頁的使用會更有效。所以,空閑單元重用應(yīng)保持這個(gè)策略。

t := -check[0];

while check[t] <> 0 and t < s do

t := -check[t]

end;

{t now points to the cell after s' place}

check[s] := -t;

check[-base[t]] := -s;

base[s] := base[t];

base[t] := -s;

Time complexity of freeing a cell is thus O(cm).

釋放一個(gè)單元的時(shí)間復(fù)雜度是O(cm)。

An Implementation

In my implementation, I exploit the concept of virtual memory to manage large and permenent trie. A base class called VirtualMem divides the data file into pages, and swaps them in as needed. Memory economy and disk caching is thus achieved. Different memory accessing methods are provided so that the page swapping mechanism is hidden from the class users. Meanwhile, byte order is internally managed on-the-fly inside the methods to achieve data portability.

Double-array structure and tail pool are then built upon the VirtualMem base class. Trie class is then created to contain the two data structures, with basic interfaces for trie manipulation.

Download

Update: The double-array trie implementation has been simplified and rewritten from scratch in C, and is now named libdatrie. It is now available under the terms of GNU Lesser General Public License (LGPL):

· libdatrie-0.2.2 (29 April 2009)

· libdatrie-0.2.1 (5 April 2009)

· libdatrie-0.2.0 (24 March 2009)

· libdatrie-0.1.3 (28 January 2008)

· libdatrie-0.1.2 (25 August 2007)

· libdatrie-0.1.1 (12 October 2006)

· libdatrie-0.1.0 (18 September 2006)

SVN: svn co http://linux.thai.net/svn/software/datrie

The old C++ source code below is under the terms of GNU Lesser General Public License (LGPL):

· midatrie-0.3.3 (2 October 2001)

· midatrie-0.3.3 (16 July 2001)

· midatrie-0.3.2 (21 May 2001)

· midatrie-0.3.1 (8 May 2001)

· midatrie-0.3.0 (23 Mar 2001)

References

1. [Knuth1972] Knuth, D. E. The Art of Computer Programming Vol. 3, Sorting and Searching. Addison-Wesley. 1972.

2. [Fredkin1960] Fredkin, E. Trie Memory. Communication of the ACM. Vol. 3:9 (Sep 1960). pp. 490-499.

3. [Cohen1990] Cohen, D. Introduction to Theory of Computing. John Wiley & Sons. 1990.

4. [Johnson1975] Johnson, S. C. YACC-Yet another compiler-compiler. Bell Lab. NJ. Computing Science Technical Report 32. pp.1-34. 1975.

5. [Aho+1985] Aho, A. V., Sethi, R., Ullman, J. D. Compilers : Principles, Techniques, and Tools. Addison-Wesley. 1985.

6. [Aoe1989] Aoe, J. An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering. Vol. 15, 9 (Sep 1989). pp. 1066-1077.

7. [Virach+1993] Virach Sornlertlamvanich, Apichit Pittayaratsophon, Kriangchai Chansaenwilai. Thai Dictionary Data Base Manipulation using Multi-indexed Double Array Trie. 5th Annual Conference. National Electronics and Computer Technology Center. Bangkok. 1993. pp 197-206. (in Thai)