Using text mining and sentiment analysis for online forums hotspot detection and forecast
Introduction
In the Internet and information Age, online data usually grows in an exponential explosive fashion. The majority of these web data is in unstructured text format that is difficult to decipher automatically. Other than static WebPages, unstructured or loosely formatted texts often appears at a variety of tangible or intangible dynamic interacting networks [2], [4], [16], [34]. A variety of heterogeneous online communities, societies and forums embody the interacting networks nowadays. When faced with tremendous amounts of online information from various online forums, information seekers usually find it very difficult to yield accurate information that is useful to them. This has motivated the research on identification of online forum hotspots, where useful information are quickly exposed to those seekers. Our research is to provide a comprehensive and timely description of the interacting structural natural groupings of various forums, which will dynamically enable efficient detection of hotspot forums, thus benefit Internet social network members in the decision making process.
As efficient business intelligence methods, data mining and machine learning provide alternative tools to dynamically process large amounts of data available online. Another most recent technique called sentiment analysis, also referred to as emotional polarity computation, has always been simultaneously employed when conducting online text mining. The purpose of text sentiment analysis is to determine the attitude of a speaker or a writer with respect to some specific topic. The attitude can be any forms of judgment or evaluation, the emotional state of the author when writing, or the intended emotional communication. It is recognized that the performance of sentiment classifiers are dependent on domains or topics [22].
In this paper, online forums hotspot detection and forecast are studied using sentiment analysis and text mining approaches. We develop this approach in two stages: emotional polarity computation and integrated sentiment analysis based on K-means clustering and support vector machine (SVM).The proposed unsupervised text mining approach is used to group the forums into various clusters, with the center of each representing a hotspot forum within the current time span. Data are collected from Sina sports forums (webite: http://bbs.sports.sina.com.cn/treeforum/App/list.php?bbsid=33&subid=0), which include a range of 31 different topic forums and 220,053 posts. Computation indicates that within the same time window, SVM forecasting achieves highly consistent results with K-means clustering.
The rest of the paper is organized as follows. Section 2 discusses related work of our study. Section 3 presents models and methodology. Empirical results and discussion are given in Section 4. Finally, Section 4 concludes the paper.
Section snippets
Related work
This section investigates three streams of related work: dynamic cluster analysis of online forums, sentiment analysis of web documents and web text mining using machine learning.
Models and methodology
Our approach is mainly composed of the following steps: data collection and cleansing, text sentiment calculation and marking, hotspot detection based on K-means clustering and hotspot forecast based on SVM classification. Fig. 1 depicts the conceptual diagram of our approach, where three modules are defined to integrating text sentiment calculation, K-means and SVM for analyzing forum hotspots.
Module 1 is to convert Chinese texts into value based data through text sentiment computation and
Data preparation
The data preparation for the empirical studies primarily includes three tasks: data downloading, data cleansing and data statistics. The data sets used in our experiments are crawled down and compiled from the Internet by an automatic crawling Java program, which consists of two major modules: the target URL list generating module and the HTML page parsing module. We choose to conduct our experiments on Sina sports community because this is the most popular and prestigious online sports
Conclusions and discussions
We have developed an algorithm to automatically analyze the emotional polarity of a text, based on which a value for each piece of text is obtained. The absolute value of the text represents the influential power and the sign of the text denotes its emotional polarity. This algorithm is combined with K-means clustering and SVM classification to develop integrated approach for online sports forums cluster analysis. We apply unsupervised clustering algorithm to group the forums into various
Nan Li is a Ph.D. candidate at the Department of Computer Science, University of California, Santa Barbara. Her research mainly focuses on business data mining, text mining and Sentiment Analysis. Her work has been published/accepted at such journals as Human and Ecological Risk Assessment.
References (45)
- et al.
Mining customer product ratings for personalized marketing
Decision Support Systems
(2003) - et al.
Feature selection for classification
Intelligent Data Analysis
(1997) - et al.
HealthMap: global infectious disease monitoring through automated classification and visualization of internet media reports
Journal of the American Medical Informatics Association
(2008) - et al.
Recurring local sequence motifs in proteins
Journal of Molecular Biology
(1995) - et al.
An empirical study of sentiment analysis for chinese documents
Expert Systems with Applications
(2008) - et al.
Incorporating critique and argumentation in DSS
Decision Support Systems
(1999) Performance evaluation: an integrated method using data envelopment analysis and fuzzy preference relations
European Journal of Operational Research
(2009)- et al.
Using DEA-neural network approach to evaluate branch efficiency of a large Canadian bank
Expert Systems with Applications
(2006) - et al.
Automatic online news monitoring and classification for syndromic surveillance
Decision Support Systems
(2009) - et al.
Visualising sentiments in financial texts?
Proceedings of the Ninth International Conference on Information Visualisation
(2005)
The Influence Model: A Tractable Representation for the Dynamics of Networked Markov Chains
Movie review mining: a comparison between supervised and unsupervised classification approaches
Iterative structure discovery in graph-based data
International Journal of Artificial Intelligence Techniques
An approach to text classification using dimensionality reduction and combination of classifiers
Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration
Fast and exact out-of-core k-means clustering
Fourth IEEE International Conference on Data Mining
A scalable algorithm for clustering protein sequences
Global properties of the mapping between local amino acid sequence and local structure in proteins
Proceedings of the National Academy of Sciences of the United States of America
Predicting the semantic orientation of adjectives
Dialect classification on printed text using perplexity measure and conditional random fields
IEEE International Conference on Acoustics, Speech and Signal Processing
Text categorization with SVM: learning with many relevant features
Relationship algebra for computing in social networks and social network based applications
Network environment and financial risk using machine learning and sentiment analysis
Human and Ecological Risk Assessment
Cited by (0)
Nan Li is a Ph.D. candidate at the Department of Computer Science, University of California, Santa Barbara. Her research mainly focuses on business data mining, text mining and Sentiment Analysis. Her work has been published/accepted at such journals as Human and Ecological Risk Assessment.
Desheng Dash Wu is the affiliated Professor in RiskLab at the University of Toronto and the Director of RiskChina Research Center at the University of Toronto. His research interests focus on enterprise risk management, business data mining, and performance evaluation in financial industry. He is the coauthor of Enterprise Risk Management book. He is co-editor in chief of International Journal of Services Sciences. His work has appeared in several journals as International Journal of Production Research, European J. of Operational Research, IEEE Transactions on Knowledge and Data Engineering, Annals of Operations Research, J. of OR Society, International J. of Production Economics, Expert Systems with Applications, Computers and Operations Research, Human and Ecological Risk Assessment, International Journal of System Science, etc. He has more than forty journal papers and coauthored 2 books. He has served as Editor/Guest Editor/Chair for several journals/conferences. The special issues he edited include those for Annals of Operations Research, Human and Ecological Risk Assessment, and Production Planning and Control. He is a Member of the Professional Risk Managers' International Association (PRMIA) Academic Advisory Committee.