Self-Improving Bayesian Sentiment Analysis for Twitter

August 27, 2010

That’s quite the mouthful.

Let me start with a huge caveat: I’m not an expert on this, and much of it may be incorrect. I studied Bayesian statistics about fifteen years ago in university, but have no recollection of it (that sounds a bit like Bill Clinton: “I experimented with statistics but didn’t inhale the knowledge”).

Even so, given the increasing quantity of real-time content on the Internet, I find the automated analysis of it fascinating, and hope that something in this post might pique your interest.

Naive Bayes classifier

Bayesian probability, and in particular the Naïve Bayes classifier, is successfully used in many parts of the web, from IMDB ratings to spam filters.

The classifier examines the independent features of an item, and compares those against the features (and classification) of previous items to deduce the likely classification of the new item.

It is ‘naïve’ because the features are assessed independently. For example, we may have hundreds of data points that classify animals. If we have a new data point:

  • 4 legs
  • 65kg weight
  • 60cm height

Each feature might be independently classified as:

  • Dog
  • Human
  • Dog

Although the overall result (“probably a dog”) is likely correct, note that it didn’t remove/discount “human” from the classification of weight when it saw that it had 4 legs (and no human had been classified with 4 legs in previous data) – because of the “naivety” of the algorithm.

Perhaps surprisingly, this naïve algorithm tends to give pretty good results. The accuracy of the results, though, depends entirely on the volume and accurate classification of the previous dataset, to which new data is compared against.

Classifying Sentiment

My classification needs were simple: I wanted to classify tweets about customer service as either ‘positive’ or ‘negative’.

In this instance, the ‘features’ that we use for comparison are the words of the sentence. Our evidence-base might point to ‘awesome’ as being a word that is more likely to result in a ‘positive’ tweet, and ‘fail’ as a negative tweet.

I started with Ian Barber’s excellent PHP class for simple Bayesian classification, but wanted to improve the basic quality.

The simplest way to do this was to remove all ‘noise’ words from the tweets and classification process – those words that do not imply positivity or negativity, but that may falsely skew the results.

There are plenty of noise word lists around, so I took one of those and removed any words that are relevant to sentiment analysis (e.g. ‘unfortunately’, that appears in the MySQL list, may be useful for identifying negative tweets).

It improved things substantially, and I spent quite a lot of time analysing which words were contributing towards each score, and adding to the noise word list as appropriate.

Next, I included additional noise words that were specific to my context: the words ‘customer’ and ‘service’ for example appeared in most tweets (I was using this as one of the ways of searching for relevant tweets to classify), so these were added.

Also, I needed to add the names of all businesses/companies to the list (this is an ongoing task). It turns-out that when a company has many, many negative tweets about their customer service, the ‘probability’ that any future tweet mentioning the same name is negative becomes huge. This causes incorrect classification when people tweet about “switching to X from Y”, “X could teach Y a thing or two”, or the occasional positive tweet about the business. I’m looking at you, Verizon.

I decided to make it a little less ‘naïve’, too, by trying to take account of some negative prefixes – i.e. using the relationships between some words. I noticed some false negatives/positives that were affected by phrases like “is not good” or “isn’t bad”, so used a regular expression to combine any words that started with “isnt” or “not”, etc (so in my code, ‘isntbad’ and ‘notgood’ are treated as separate words). This seemed to have a small but noticeable impact on the quality.

Stemming and N-grams

Some attempted improvements didn’t have an impact.

I tried stemming words: in my case, with a Porter Stemmer PHP Class. Stemming reduces all words to their root form, so that different tenses and variations of words are ‘normalized’ to the same root. So, ‘recommended’, ‘recommending’ and ‘recommend’ would all be stemmed to the same root.

This reduced the quality of my results.

Perhaps, where every character matters (in tweets), the chosen variation of a word has significance. For example, in my data (at the time of writing), “recommend” seems to be a neutral word (neither negative nor positive), but “recommended” is positive.

Next, to take my earlier experiment with word relationships further (i.e. the improvement gains by combining ‘not’ and ‘isnt’), I tried to include bigrams (two-word combinations) as classification features, not just unigrams (single words).

This means, for example, that the sentence, “Service is exceptionally bad” is tokenized into the following one- and two-word features:

Service, is, exceptionally, bad, Service is, is exceptionally, exceptionally bad

In theory, this should produce more accurate results than my rudimentary isn’t/not method, but the results were not improved. My guess is that as the existing dataset grows larger (I currently only have about 4,000-5,000 positive and negative tweets each), the bigrams will become more accurate and useful, as the same combinations of words become more frequent and their correlation with classification (negative/positive) more meaningful.

Self-Improving the Dataset

Refining the twitter sentiment data

To create the ‘prior’ 4-5k tweet dataset (that new data is compared against), I created a small interface (above) that pulls tweets from Twitter and uses any existing data to best guess the negative/positive sentiment. I could quickly tweak/fix the results, submit them back, and get a new set that should be slightly more accurate based on the new improved data.

There’s only so much time I can dedicate to building up this corpus though.

As soon as the analysis became fairly accurate at guessing the sentiment of new data, I built in an algorithm to calculate a subjective confidence of the probability. This was built largely on the variation and strength of words in a tweet.

Each word (‘feature’) has a strength of positive/negative sentiment, based on the number of positive/negative tweets it is previously featured in. For example, in my dataset, the word ‘new’ is fairly positive, but the word ‘kudos’ is extremely positive. By calculating the number of strong words and variation in positive/negative words, a confidence can be calculated (e.g. a tweet that includes 5 negative words, 1 positive word, 2 extremely negative words and no extremely positive words can be confidently assumed to be negative).

After a few test runs, I was “confident in my confidence” – tweets that were being rated with a high confidence were being classified as negative/positive sentiment with almost 100% accuracy.

What I’ve now done is set up an automated script that checks Twitter every hour for new customer service/support tweets.  Each tweet is run through the classifier, and any high-confidence classifications are automatically added to the corpus, thereby gradually improving the accuracy without manual input, which should also allow it to gradually become more confident and increase the rate at which high-confidence tweets are detected and added to the corpus.

It’s learning all by itself!

Next step: Skynet. In PHP.

Tags: , , , , , , ,

30 Responses to “ Self-Improving Bayesian Sentiment Analysis for Twitter ”

  1. steve on August 27, 2010 at 6:38 am

    Hey, I’d done this a while ago, and regret not using it with a bayes class. What i ended up doing is getting a “subjectivity lexicon” database from a university site (sorry, can’t remember) which had a list of about 8,000 words scored as positive, negative or neutral, and basically the same algorithm you used where i throw out stopwords, reverse polarity of words suffixed or prefixed by negatives (e.g. not bad, isn’t good), then add up the points to get a sentiment.

    The one thing i found that really can throw off scores are names and titles of things. Think about this phrase: “i saw the movie Bad Santa today and loved it”. That’d end up being neutral, so what you do is if you’re searching on a topic like the title of something, add related info like actor’s names, similar movies, books etc, and the title itself to the throw-out words.

    Anyway, sounds like you’re off to an amazing start. hit me up if you want that sentiment database. i added some slang words to it too which can also throw you off (e.g. “bad ass”)

  2. Dan on August 27, 2010 at 6:44 am

    Thanks Steve! Some excellent advice.

  3. Chris on August 27, 2010 at 9:05 am

    Great stuff. Any chance you might consider open-sourcing this work ? I’m sure there are a lot of people who would love to use this, or assist in improving it ?

  4. Matt on August 27, 2010 at 9:31 am

    Very interesting read; I was actually thinking about this exact topic last night (out of the blue), so it was quite creepy to wake up this morning and find it on HN!

  5. Mike Pearce on August 27, 2010 at 10:57 am

    Thoroughly enjoyed that post, and yes, it has piqued my interest – what else can I apply this to in the name of science!

    - Mike

  6. Karthik on August 27, 2010 at 11:34 am

    Brilliant!! Thanks for sharing.

    Any plans to open source the PHP or the database of slang words?

  7. Alexandre Passos on August 27, 2010 at 12:14 pm

    Hi,

    There’s a principled way to do this, either using EM or gibbs sampling. The clearest introduction is this http://www.cs.cmu.edu/~tom/pubs/NigamEtAl-bookChapter.pdf . It’s actually quite easy to implement sampling, and I’ve got code lying around for that, if you want.

  8. Chris Nicholls on August 27, 2010 at 12:47 pm

    Hey, this is really great stuff for someone new to the area!

    I did my masters thesis in this field and now I am doing work in it. It is awesome to see that you have come to some of the same conclusions that I have. One thing that I also have had no luck with is stemming. It seems like, as far as text classification problems go, stemming is an artifact from days when memory and processing resources were extremely expensive. It was a way to reduce your feature space without losing too much information. As a general rule I try to avoid stemming (IR is a different story).

    Something to look in to is feature selection, it is a way of automatically cutting out words that don’t contribute much weight to any particular class. I have seen significant improvements.

    One piece of advice is that if you train/evaluate your classifier and then manually go through and remove the problem words, and then re-train/re-evaluate on the same data, it is ‘cheating’ a little bit. The improvements you see might just be tuning to your specific data. The procedure for this would be to partition your data into two sets, train/tune on one and evaluate on the second.

    But again, great work!

    cheers

  9. srw on August 27, 2010 at 1:19 pm

    I found SVM more precise than Naive Bayes, but it. depends of the domain of your texts.
    Also a suggestion is to use an attribute selection to increase precision, for example including attributes that give more information to the machine learning algorihm.

  10. Michele on August 27, 2010 at 2:36 pm

    This is really cool!

    Have you tried inputting your scores into other types of classifiers yet, like a neural network or support vector machine, like LibSVM?

    That would just be for fun since you already did the hard part, the scoring system.

    Thanks for posting this.

  11. Andrew on August 27, 2010 at 3:38 pm

    This is very cool. The “self-improving” aspect of this actually falls into the area of machine learning known as semi-supervised learning (SSL). More specifically, what you are describing is pretty much exactly “self-training,” a popular and easy to implement SSL technique that has been successful in a number of natural language processing tasks. You may want to search the literature for “self-training” to find tips and tricks to improve it further.

  12. Dan on August 27, 2010 at 4:27 pm

    Thank you all so much for the comments, information and encouragement to pursue this further. I appreciate the time it takes you all to comment here.

    I’ll be traveling for the next few days (so, new commenters, I might not be able to approve comments immediately), but just to let you all know that I will read and research all of your suggestions. And yes, I hope to make the (amateur-ish!) code available soon – it’s based on other people’s code, so it’s only fair to make it available.

  13. Luis Cosio on August 27, 2010 at 7:57 pm

    Any chance you are planning on putting your code in Github?

    Would love to take a peek at it.

  14. » links for 2010-08-27 (Dhananjay Nene) on August 27, 2010 at 8:02 pm

    [...] Self-Improving Bayesian Sentiment Analysis for Twitter RT @newsycombinator: Self-Improving Bayesian Sentiment Analysis for Twitter http://j.mp/99sTIM (tags: via:packrati.us) [...]

  15. Zimbra Hungry on August 27, 2010 at 9:57 pm

    The reason why bigrams may have failed to increase accuracy is it introduces too much correlation into your feature set.

    Naive bayes cannot handle highly correlated features — remember the naivety assumption assumes independence between features. In the worst cast, two perfectly correlated features will skew judgment according to the square of their influence. Bigrams will of course be highly correlated with the unigrams they contain and other bigrams that they partly overlap with.

  16. Boris Gorelik on August 29, 2010 at 6:55 pm

    Indeed SVM is a very powerful tool it can crunch almost any data you provide it, however it suffers from two major drawbacks (compared to Naive Bayesian classifier)
    1. first, adding new observations to the classifier to fine-tune it to the ever changing reality is more computationally expensive. You may ignore this problem in your specific case, but one needs to keep it in mind when choosing a classifier

    2. (the most important one). SVM, when used with any, but linear, kernel is a huge black box. You need to trust it blindly. You can never “look inside” the classifier and learn what happens there. I’m not talking only about the fine tuning Dan was talking about (which I agree is problematic). For example, in Dan’s classifier he might notice that “time” is one of the “negative” features. This might shed a light to the fact that many customer service problems are caused by insufficient delivery time. You cannot do this with SVM. Even more, you can fix the problem, and the customers will start praising your delivery time, turning this keyword from “negative” to “positive” (which recursively brings us to #1 above :-])

  17. Benjamin on August 31, 2010 at 9:55 am

    If you try to include Bigrams, perhaps one could try to generate a corpus of meaningful two word phrases.

    So the sentence, “Service is exceptionally bad” would be tokenized as follows: First identifying phrases, in this case “exceptionally bad” and only then tokenize the rest:

    Service, is, “exceptionally bad”

    This way one would avoid overcount?

  18. Dan on August 31, 2010 at 3:47 pm

    @Benjamin – great idea.

  19. Abhi on September 6, 2010 at 7:12 am

    Great explanation Dan. I was wondering if you are planning of making this open source any time soon. I believe that will be a great help for all of us. Twitgraph already has made his open source (However, it is in Pyhton :)) http://code.google.com/p/twitgraph/source/browse/trunk/biz/tweets_analyzer.py

  20. Chris on September 6, 2010 at 12:05 pm

    Hey Dan,

    Just wondered if you had a timeframe for open-sourcing this ?

    Thanks,
    Chris.

  21. Abhi on September 9, 2010 at 12:36 am

    bayesian sentiment analysis PHP code: http://www.devcomments.com/Bayesian-Opinion-Mining-i16509.htm (you can just change spam/non-spam to positive/negative)

  22. Abhi on September 9, 2010 at 2:14 am
  23. Jacob on September 9, 2010 at 2:39 am

    You may be interested in my article on how eliminating the low information words can improve classifier accuracy: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/. It’s in Python with NLTK, but it shouldn’t be too hard to implement the frequency counting & word scoring in PHP.

    I also did some bigram testing, and it did improve accuracy, but the improvement went away once I started using only the high information features: http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/

  24. Leon Derczynski on September 13, 2010 at 1:04 pm

    If you’re only adding to your corpus examples where you’re really confident about the classification, does this improve performance on the more difficult, low-confidence cases?

  25. [...] for each image – data such as the image URL, alt text, and title. The filter could then build a self-improving probability classification system for image [...]

  26. Tom Ott on October 19, 2010 at 9:06 pm

    Hi Dan,

    A friend of mine, whom I’m working with on text mining Twitter, FWD’d me your post. It sounds like you’re on the right track, as I’m following a similar attack strategy.

    I use the Text Mining plugin for Rapidminer and it does all the things you mention in the post above (stemming, n-grams) and allows you to use Bayesian learners to classify the data. You can also use Support Vector Machines to weight neg/pos words too.

    I would suggest you check out Rapidminer, its very flexible and open source, and fairly easy to use too.

  27. Bryan Hunsinger on January 25, 2011 at 7:16 pm

    Great article Dan. Any thoughts on sharing this?

  28. marian on January 27, 2011 at 3:58 pm

    Hi
    have look at http://www.opfine.com, that is what I do, I don’t do Twitter but generic financial sentiment news.
    Overall yours approach is very similar to mine, though few advices I would dare to give you based on my system:

    1. I use 1-grams
    2. I don’t use 2-grams
    3. I use 3-grams and higher ( I go to 7-grams)
    4. I analyse 1-grams and 3,4,5,6,7-grams independently (then I combine two analyses and create final ‘sentiment score’)

    I have no idea why 2-grams does not work but rally from 3-grams higer it starts to work much better.
    I do undersand though that twitter is much shorter text data so aproach might be different.

    good luck

  29. Mark on June 3, 2011 at 6:54 am
  30. cyhex on July 13, 2011 at 9:27 am

    Take a look at Twitter sentiment analysis tool http://smm.streamcrab.com, its written in python and uses Naive Bayes classifier with semi-supervised machine learning

Leave a Reply