Elsevier

Decision Support Systems

Volume 48, Issue 2, January 2010, Pages 354-368
Decision Support Systems

Using text mining and sentiment analysis for online forums hotspot detection and forecast

https://doi.org/10.1016/j.dss.2009.09.003Get rights and content

Abstract

Text sentiment analysis, also referred to as emotional polarity computation, has become a flourishing frontier in the text mining community. This paper studies online forums hotspot detection and forecast using sentiment analysis and text mining approaches. First, we create an algorithm to automatically analyze the emotional polarity of a text and to obtain a value for each piece of text. Second, this algorithm is combined with K-means clustering and support vector machine (SVM) to develop unsupervised text mining approach. We use the proposed text mining approach to group the forums into various clusters, with the center of each representing a hotspot forum within the current time span. The data sets used in our empirical studies are acquired and formatted from Sina sports forums, which spans a range of 31 different topic forums and 220,053 posts. Experimental results demonstrate that SVM forecasting achieves highly consistent results with K-means clustering. The top 10 hotspot forums listed by SVM forecasting resembles 80% of K-means clustering results. Both SVM and K-means achieve the same results for the top 4 hotspot forums of the year.

Introduction

In the Internet and information Age, online data usually grows in an exponential explosive fashion. The majority of these web data is in unstructured text format that is difficult to decipher automatically. Other than static WebPages, unstructured or loosely formatted texts often appears at a variety of tangible or intangible dynamic interacting networks [2], [4], [16], [34]. A variety of heterogeneous online communities, societies and forums embody the interacting networks nowadays. When faced with tremendous amounts of online information from various online forums, information seekers usually find it very difficult to yield accurate information that is useful to them. This has motivated the research on identification of online forum hotspots, where useful information are quickly exposed to those seekers. Our research is to provide a comprehensive and timely description of the interacting structural natural groupings of various forums, which will dynamically enable efficient detection of hotspot forums, thus benefit Internet social network members in the decision making process.

As efficient business intelligence methods, data mining and machine learning provide alternative tools to dynamically process large amounts of data available online. Another most recent technique called sentiment analysis, also referred to as emotional polarity computation, has always been simultaneously employed when conducting online text mining. The purpose of text sentiment analysis is to determine the attitude of a speaker or a writer with respect to some specific topic. The attitude can be any forms of judgment or evaluation, the emotional state of the author when writing, or the intended emotional communication. It is recognized that the performance of sentiment classifiers are dependent on domains or topics [22].

In this paper, online forums hotspot detection and forecast are studied using sentiment analysis and text mining approaches. We develop this approach in two stages: emotional polarity computation and integrated sentiment analysis based on K-means clustering and support vector machine (SVM).The proposed unsupervised text mining approach is used to group the forums into various clusters, with the center of each representing a hotspot forum within the current time span. Data are collected from Sina sports forums (webite: http://bbs.sports.sina.com.cn/treeforum/App/list.php?bbsid=33&subid=0), which include a range of 31 different topic forums and 220,053 posts. Computation indicates that within the same time window, SVM forecasting achieves highly consistent results with K-means clustering.

The rest of the paper is organized as follows. Section 2 discusses related work of our study. Section 3 presents models and methodology. Empirical results and discussion are given in Section 4. Finally, Section 4 concludes the paper.

Section snippets

Related work

This section investigates three streams of related work: dynamic cluster analysis of online forums, sentiment analysis of web documents and web text mining using machine learning.

Models and methodology

Our approach is mainly composed of the following steps: data collection and cleansing, text sentiment calculation and marking, hotspot detection based on K-means clustering and hotspot forecast based on SVM classification. Fig. 1 depicts the conceptual diagram of our approach, where three modules are defined to integrating text sentiment calculation, K-means and SVM for analyzing forum hotspots.

Module 1 is to convert Chinese texts into value based data through text sentiment computation and

Data preparation

The data preparation for the empirical studies primarily includes three tasks: data downloading, data cleansing and data statistics. The data sets used in our experiments are crawled down and compiled from the Internet by an automatic crawling Java program, which consists of two major modules: the target URL list generating module and the HTML page parsing module. We choose to conduct our experiments on Sina sports community because this is the most popular and prestigious online sports

Conclusions and discussions

We have developed an algorithm to automatically analyze the emotional polarity of a text, based on which a value for each piece of text is obtained. The absolute value of the text represents the influential power and the sign of the text denotes its emotional polarity. This algorithm is combined with K-means clustering and SVM classification to develop integrated approach for online sports forums cluster analysis. We apply unsupervised clustering algorithm to group the forums into various

Nan Li is a Ph.D. candidate at the Department of Computer Science, University of California, Santa Barbara. Her research mainly focuses on business data mining, text mining and Sentiment Analysis. Her work has been published/accepted at such journals as Human and Ecological Risk Assessment.

References (45)

  • C. Asavathiratham

    The Influence Model: A Tractable Representation for the Dynamics of Networked Markov Chains

  • P. Chaovalit et al.

    Movie review mining: a comparison between supervised and unsupervised classification approaches

  • J. Coble et al.

    Iterative structure discovery in graph-based data

    International Journal of Artificial Intelligence Techniques

    (2005)
  • J. Gaurav et al.

    An approach to text classification using dimensionality reduction and combination of classifiers

    Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration

    (2004)
  • A. Goswami et al.

    Fast and exact out-of-core k-means clustering

    Fourth IEEE International Conference on Data Mining

    (2004)
  • V. Guralnik et al.

    A scalable algorithm for clustering protein sequences

  • K.F. Han et al.

    Global properties of the mapping between local amino acid sequence and local structure in proteins

    Proceedings of the National Academy of Sciences of the United States of America

    (1996)
  • V. Hatzivassiloglou et al.

    Predicting the semantic orientation of adjectives

  • R.Q. Huang et al.

    Dialect classification on printed text using perplexity measure and conditional random fields

    IEEE International Conference on Acoustics, Speech and Signal Processing

    (2007)
  • T. Joachims

    Text categorization with SVM: learning with many relevant features

  • J.I. Khan et al.

    Relationship algebra for computing in social networks and social network based applications

  • N. Li et al.

    Network environment and financial risk using machine learning and sentiment analysis

    Human and Ecological Risk Assessment

    (2009)
  • Cited by (0)

    Nan Li is a Ph.D. candidate at the Department of Computer Science, University of California, Santa Barbara. Her research mainly focuses on business data mining, text mining and Sentiment Analysis. Her work has been published/accepted at such journals as Human and Ecological Risk Assessment.

    Desheng Dash Wu is the affiliated Professor in RiskLab at the University of Toronto and the Director of RiskChina Research Center at the University of Toronto. His research interests focus on enterprise risk management, business data mining, and performance evaluation in financial industry. He is the coauthor of Enterprise Risk Management book. He is co-editor in chief of International Journal of Services Sciences. His work has appeared in several journals as International Journal of Production Research, European J. of Operational Research, IEEE Transactions on Knowledge and Data Engineering, Annals of Operations Research, J. of OR Society, International J. of Production Economics, Expert Systems with Applications, Computers and Operations Research, Human and Ecological Risk Assessment, International Journal of System Science, etc. He has more than forty journal papers and coauthored 2 books. He has served as Editor/Guest Editor/Chair for several journals/conferences. The special issues he edited include those for Annals of Operations Research, Human and Ecological Risk Assessment, and Production Planning and Control. He is a Member of the Professional Risk Managers' International Association (PRMIA) Academic Advisory Committee.

    View full text