In this work we employ Bayesian surprise to detect interesting/anomalous patterns from discrete sequence data.
Many domains consist of discrete sequential time-series such as DNA analysis, online transactions, web click-stream navigation, cyber-attacks, financial transactions and especially sociology life-course data. The difficulty is that each data set has its own unique characteristics and many anomalies defy categorization. Since anomalies are by nature infrequent and elusive, we often do not have enough data for a supervised approach. However, novelty and surprise play a fundamental role in human and animal behavior for survival, attention and adaptation.
We use three sequence datasets (Swiss Health, Sepsis, and BioFamilies) which are are composed of simpler motifs which are used to build Probabilistic Suffix Trees (PST) which can capture complex relationships based on motif location and frequency of occurrence. New data that deviates from established motifs either in location of appearance, frequency of appearance, or motif composition may represent recurring patterns that may be different in some way. Bayesian surprise is the result of mismatches between our expectations and actual results, hence the degree of surprise or anomalousness attached to a pattern will vary with respect to these differences.
Each data set is assessed by Bayesian Surprise and several other criteria, providing indications of why certain patterns are interesting and why others are not.
Bayesian surprise can detect data with other properties that would be missed by information theoretic measures such as Shannon surprise and entropy for example.