Hello, everyone! Today I’m going to introduce some basic ideas about the sentiment analysis.
We can judge from the gesture or facial expression (Facial Recognition) and we can also talk with them, give them a call or email/text them (Natural Language). We can extract different kind of information from different information channels.
In sentiment analysis, we are supposed to deal with the natural language (unstructured data). We extract other peoples’ opinion from words, sentences or documents.
It’s a computational treatment to identify, extract, quantify and study affective states and subjective information, from which we can take use of some models and algorithms to compute.1
Affective state means the moods, feelings or attitudes.
Subjective information means the information based on personal feelings.
Before I describe the details, I’d like to explain quintuple terminologies.3
- The target entity is the object that we describe, such as the opinion on Michael Jackson.
- Aspect or feature of the entity is an attribute or component of an entity, such as the screen of a cell phone.
- Opinion holder is the person who expresses the opinion.
- The time is the when the opinion is expressed.
- The polarity or sentiment value* can be defined as positive, neutral and negative. Sometimes, we can use score to assess the sentiment.
As World Wide Web was invented in 1989, there emerged numerous information on the web where we can easily get. With the birth of social media in around 2004, we entered into the era of Web 2.0. More and more user-generated contents, which are also opinion rich contents, become accessible: the online reviews, the microblogging, the comments on items and so on. We can say that it is a really challenging task to mining the opinion online with such kind of large amount of data.
However, with the development of some technologies such as distributed system, and the increasing computational capacity of computers, it become achievable to mining opinion online.
The basic task of sentiment analysis is classify the opinion of the opinion holder on some entities. Based on the classification task, we can group the sentiment analysis as such several types.1
- If the sentiment value is binary or ternary, such as positive, negative and neutral, we can say that it’s polarity type of sentiment analysis.
- If the sentiment value is diverse (for example, angry, sad and happy), we can say that it’s emotional state type. In figure 2, there are really lots of level of emotional state. Generally, we use 6~20 distinctive emotional states.
- If we would like to classify the text into objective or subjective, we can say that it is a subjectivity and objectivity identification type.
Sentiment analysis is really powerful and there are already some applications.5
The review-related websites are the websites extracting real-time opinions about product, political issues online. For example, the Epinions.com is a website established in 1999, where visitors could read new and old reviews about a variety of items to help them decide on a purchase.
What’s more, sentiment analysis can also be a component in other technologies. For example:
- Recommendation system - the website can recommend some product or music to you based on your history of opinions.
- Detection of “flames” - the email system can detect the emails with overly heated or sensitive or inappropriate content.
- In Question answering system, opinion-oriented questions may require different treatment. For example, if you ask Siri, Do you like doing presentation in the morning of Sunday. It would give you the answer.
For the Business Intelligence & Government Intelligence, sentiment analysis can be used for:
- Reputation management
- Trend prediction
- Public opinion monitoring.
It can help the corporations or governments to extract the public mood and attitude.
Here are some examples that you can have a taste on sentiment analysis
The #1~#4 is obvious. But the #5 and #6 might be a little bit tricky. Based on English grammar, they are double negation. But in spoken English, it can still express negation.
For the #11, there is negative word in the positive sentence. If we just use lexicon to count the positive word or negative words, sometimes we might get into trouble.
For the #12, decadent means
moral or cultural decline, which is a negative word. However, the decadent dessert is a kind of dessert. Like this figure.
And for the #14, there is a multilingual problem.
koide9 is homophonic with ‘Quoi de neuf’ in French, which means ‘What’s up’ in English. It is a very popular Internet slang on Twitter even in the English environment. However, newly minted terms can be highly attitudinal but volatile in polarity and often out of known vocabulary.
The roots of sentiment analysis is said to be in the studies on public opinion analysis at the beginning of 20th century and performed by the computational linguistics community in 1990’s. Even though, studies on sentiment analysis has developed for around 30 years. There are still lots of challenges. We can have a taste on the difficulties of sentiment analysis from previous task.
We generalized the challenges at present as such 8 categories:
- Huge lexicon
- Extracting features
- NLP Overheads (Ambiguity/Short Abbreviations/Sacasm…)
- World knowledge
- Domain dependence
- Spam and fake
First of all, I’ll introduce one of the mainly used techniques, which is knowledge-based techniques. Based on the knowledge we have, we can artificially classify some affect word and label them, such as positive or negative.
Tokenization is splitting a string into its desired constituent parts. (Read more on Wikipedia: Lexical analysis)
In Bag of word model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
(1) John likes to watch movies. Mary likes movies too. (2) John also likes to watch football games.
We construct a lexicon contains all the words after tokenization:
[ "John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games" ]
And based on the
#occurrence of the word in the lexicon, we get a vector.
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0] (2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
And then we can use some model such navie bayes classifier to classify the text. It is successfully applied in email filtering.
After the tokenization and vectorization, we get the vectors of the text. Then generally we would use machine learning techniques.
Supervised learning is the method with specific labeled data for training. For example, if we have a bunch of documents and we already manually labeled the sentiment values of the text, we can take use of models such as Support Vector Machine to do the classification task.
Unsupervised Learning is the method for data without labels, such as Latent Semantic Analysis (LSA). LSA is similar to Principle Component Analysis and can get the eigenvectors of the text vis Singular Value Decomposition. After selecting the main top eigenvectors, we can analyze a set of concepts (such as
topics) related to the text.
- AFINN: -5 to 5
- SentiWordNet: Ternary
- Twitter Sentiment Corpus
- Twitter Sentiment Analysis Training Corpus
- Multi-Domain Sentiment Dataset
- Stanford Sentiment Treebank
- Kaggle: UMICH SI650 - Sentiment Classification
- Kaggle: Bag of Words Meets Bags of Popcorn
- SemEval-2017 Task 4: Sentiment Analysis in Twitter
- SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogsand News
5. Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2), 1-135. ↩
6. Hussein, D. M. E. D. M. (2016). A survey on sentiment analysis challenges. Journal of King Saud University-Engineering Sciences. ↩