GSP algorithm

GSP algorithm (Generalized Sequential Pattern algorithm) is an algorithm used for sequence mining. The algorithms for solving sequence mining problems are mostly based on the a priori (level-wise) algorithm. One way to use the level-wise paradigm is to first discover all the frequent items in a level-wise fashion. It simply means counting the occurrences of all singleton elements in the database. Then, the transactions are filtered by removing the non-frequent items. At the end of this step, each transaction consists of only the frequent elements it originally contained. This modified database becomes an input to the GSP algorithm. This process requires one pass over the whole database.

GSP algorithm makes multiple database passes. In the first pass, all single items (1-sequences) are counted. From the frequent items, a set of candidate 2-sequences are formed, and another pass is made to identify their frequency. The frequent 2-sequences are used to generate the candidate 3-sequences, and this process is repeated until no more frequent sequences are found. There are two main steps in the algorithm.

Algorithm

   F1 = the set of frequent 1-sequence
   k=2,
   do while F(k-1)!= Null;
       Generate candidate sets Ck (set of candidate k-sequences);
           For all input sequences s in the database D
               do
           Increment count of all a in Ck if s supports a
           Fk = {a ∈ Ck such that its frequency exceeds the threshold}
               k = k+1;
           Result = Set of all frequent sequences is the union of all Fks
               End do
   End do

The above algorithm looks like the Apriori algorithm. One main difference is however the generation of candidate sets. Let us assume that:

A → B and A → C

are two frequent 2-sequences. The items involved in these sequences are (A, B) and (A,C) respectively. The candidate generation in a usual Apriori style would give (A, B, C) as a 3-itemset, but in the present context we get the following 3-sequences as a result of joining the above 2- sequences

A → B → C, A → C → B and A → BC

The candidategeneration phase takes this into account. The GSP algorithm discovers frequent sequences, allowing for time constraints such as maximum gap and minimum gap among the sequence elements. Moreover, it supports the notion of a sliding window, i.e., of a time interval within which items are observed as belonging to the same event, even if they originate from different events.

See also

Sequence mining

References

External links

This article is issued from Wikipedia - version of the 9/7/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.