That’s quite the mouthful.
Let me start with a huge caveat: I’m not an expert on this, and much of it may be incorrect. I studied Bayesian statistics about fifteen years ago in university, but have no recollection of it (that sounds a bit like Bill Clinton: “I experimented with statistics but didn’t inhale the knowledge”).
Even so, given the increasing quantity of real-time content on the Internet, I find the automated analysis of it fascinating, and hope that something in this post might pique your interest.
Naive Bayes classifier
Bayesian probability, and in particular the Naïve Bayes classifier, is successfully used in many parts of the web, from IMDB ratings to spam filters.
The classifier examines the independent features of an item, and compares those against the features (and classification) of previous items to deduce the likely classification of the new item.
It is ‘naïve’ because the features are assessed independently. For example, we may have hundreds of data points that classify animals. If we have a new data point:
- 4 legs
- 65kg weight
- 60cm height
Each feature might be independently classified as:
Although the overall result (“probably a dog”) is likely correct, note that it didn’t remove/discount “human” from the classification of weight when it saw that it had 4 legs (and no human had been classified with 4 legs in previous data) – because of the “naivety” of the algorithm.
Perhaps surprisingly, this naïve algorithm tends to give pretty good results. The accuracy of the results, though, depends entirely on the volume and accurate classification of the previous dataset, to which new data is compared against.
My classification needs were simple: I wanted to classify tweets about customer service as either ‘positive’ or ‘negative’.
In this instance, the ‘features’ that we use for comparison are the words of the sentence. Our evidence-base might point to ‘awesome’ as being a word that is more likely to result in a ‘positive’ tweet, and ‘fail’ as a negative tweet.
I started with Ian Barber’s excellent PHP class for simple Bayesian classification, but wanted to improve the basic quality.
The simplest way to do this was to remove all ‘noise’ words from the tweets and classification process – those words that do not imply positivity or negativity, but that may falsely skew the results.
There are plenty of noise word lists around, so I took one of those and removed any words that are relevant to sentiment analysis (e.g. ‘unfortunately’, that appears in the MySQL list, may be useful for identifying negative tweets).
It improved things substantially, and I spent quite a lot of time analysing which words were contributing towards each score, and adding to the noise word list as appropriate.
Next, I included additional noise words that were specific to my context: the words ‘customer’ and ‘service’ for example appeared in most tweets (I was using this as one of the ways of searching for relevant tweets to classify), so these were added.
Also, I needed to add the names of all businesses/companies to the list (this is an ongoing task). It turns-out that when a company has many, many negative tweets about their customer service, the ‘probability’ that any future tweet mentioning the same name is negative becomes huge. This causes incorrect classification when people tweet about “switching to X from Y”, “X could teach Y a thing or two”, or the occasional positive tweet about the business. I’m looking at you, Verizon.
I decided to make it a little less ‘naïve’, too, by trying to take account of some negative prefixes – i.e. using the relationships between some words. I noticed some false negatives/positives that were affected by phrases like “is not good” or “isn’t bad”, so used a regular expression to combine any words that started with “isnt” or “not”, etc (so in my code, ‘isntbad’ and ‘notgood’ are treated as separate words). This seemed to have a small but noticeable impact on the quality.
Stemming and N-grams
Some attempted improvements didn’t have an impact.
I tried stemming words: in my case, with a Porter Stemmer PHP Class. Stemming reduces all words to their root form, so that different tenses and variations of words are ‘normalized’ to the same root. So, ‘recommended’, ‘recommending’ and ‘recommend’ would all be stemmed to the same root.
This reduced the quality of my results.
Perhaps, where every character matters (in tweets), the chosen variation of a word has significance. For example, in my data (at the time of writing), “recommend” seems to be a neutral word (neither negative nor positive), but “recommended” is positive.
Next, to take my earlier experiment with word relationships further (i.e. the improvement gains by combining ‘not’ and ‘isnt’), I tried to include bigrams (two-word combinations) as classification features, not just unigrams (single words).
This means, for example, that the sentence, “Service is exceptionally bad” is tokenized into the following one- and two-word features:
Service, is, exceptionally, bad, Service is, is exceptionally, exceptionally bad
In theory, this should produce more accurate results than my rudimentary isn’t/not method, but the results were not improved. My guess is that as the existing dataset grows larger (I currently only have about 4,000-5,000 positive and negative tweets each), the bigrams will become more accurate and useful, as the same combinations of words become more frequent and their correlation with classification (negative/positive) more meaningful.
Self-Improving the Dataset
To create the ‘prior’ 4-5k tweet dataset (that new data is compared against), I created a small interface (above) that pulls tweets from Twitter and uses any existing data to best guess the negative/positive sentiment. I could quickly tweak/fix the results, submit them back, and get a new set that should be slightly more accurate based on the new improved data.
There’s only so much time I can dedicate to building up this corpus though.
As soon as the analysis became fairly accurate at guessing the sentiment of new data, I built in an algorithm to calculate a subjective confidence of the probability. This was built largely on the variation and strength of words in a tweet.
Each word (‘feature’) has a strength of positive/negative sentiment, based on the number of positive/negative tweets it is previously featured in. For example, in my dataset, the word ‘new’ is fairly positive, but the word ‘kudos’ is extremely positive. By calculating the number of strong words and variation in positive/negative words, a confidence can be calculated (e.g. a tweet that includes 5 negative words, 1 positive word, 2 extremely negative words and no extremely positive words can be confidently assumed to be negative).
After a few test runs, I was “confident in my confidence” – tweets that were being rated with a high confidence were being classified as negative/positive sentiment with almost 100% accuracy.
What I’ve now done is set up an automated script that checks Twitter every hour for new customer service/support tweets. Each tweet is run through the classifier, and any high-confidence classifications are automatically added to the corpus, thereby gradually improving the accuracy without manual input, which should also allow it to gradually become more confident and increase the rate at which high-confidence tweets are detected and added to the corpus.
It’s learning all by itself!
Next step: Skynet. In PHP.