Combination of Bayesian and Latent Semantic Analysis with Domain Specific Knowledge
Shen Lu, Richard S. Segall
With the development of information technology, electronic publications become popular. However, it is a challenge to retrieve information from electronic publications because the large amount of words, the synonymy problem and the polysemi problem. In this paper, we introduced a new algorithm called Bayesian Latent Semantic Analysis (BLSA). We chose to model text not based on terms but associations between words. Also, the significance of interesting features were improved by expand the number of similar terms with glossaries. Latent Semantic Analysis (LSA) was chosen to discover significant features. Bayesian post probability was used to discover segmentation boundaries. Also, Dirchlet distribution was chosen to present the vector of topic distribution and calculate the maximum probability of the topics. Experimental results showed us that both Pk [8] and WindowsDiff [27] decreased 10% by using BLSA in comparison to the Lexical Cohesion with the original data.
[8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990), ‘Indexing by latent semantic analysis’, Journal of the American Society for Information Science, vol. 41, n.6, pp. 391-407.
[27] Pevzner, L. and Hearst, M.A. (2002). A critique and improvement of an evaluation metric for text segmentation, Computational Linguistics, vol. 28, no. 1, pp. 19-36. Full Text
|