On What Types of Website Does Amazon Affiliates Make Sense?

October 2, 2010

I’m still snowed under with work at the moment so don’t have time for a long post, but thought I’d quickly post this little thing.

I was conversing with Carl Morris of the excellent Sleeveface (amongst other things) last week, and we got to talking about when it makes sense to put Amazon Associates on a site. I’ve tried Amazon Associates on a number of websites (some of which no longer exist, and others which do: 1, 2, 3, 4, 5, 6, 7), and based on my experience, this was my take on it (slightly edited):

I think they only really work if:

  1. The items you’re linking to are high-value (there’s no point trying to earn 40 cents per DVD sale, it’s just not worth it) – think $100+
  2. You can make the most of any Amazon special referral rates (rather than the dismal 4% standard) – think Kindle hardware and seasonal promotions.
  3. You have a website or a specific piece of content that ties-in very closely to the Amazon product you’re selling (e.g. product reviews), and
  4. The item(s) aren’t easy to purchase ‘off-the-shelf’ in your local supermarket or city mall.

That’s it for now!

Facebook App 760px/12-Col Design Grid (Fireworks/PNG)

September 26, 2010

Facebook App Design Grid

Apologies for the lack of updates recently; we’re spending every waking hour on project work, so have had little time for writing and sharing (and replying to emails/comments – sorry if you’re still waiting for a response).

In the meantime, here’s something I needed to create for a Facebook game that we’re building, that’s worth sharing: a 12-column grid at the full app-width of 760px (with 45px columns with 20px margins), inside the Facebook app ‘chrome’ (menu and ad-bar). You’ll need to play around with this to get the height that you need (Facebook apps/games can be any height), but hopefully the horizontal grid should help you out if you’re starting to build an app with the Facebook iframe/canvas.

Download the Fireworks/PNG Facebook App Grid Here.

Self-Improving Bayesian Sentiment Analysis for Twitter

August 27, 2010

That’s quite the mouthful.

Let me start with a huge caveat: I’m not an expert on this, and much of it may be incorrect. I studied Bayesian statistics about fifteen years ago in university, but have no recollection of it (that sounds a bit like Bill Clinton: “I experimented with statistics but didn’t inhale the knowledge”).

Even so, given the increasing quantity of real-time content on the Internet, I find the automated analysis of it fascinating, and hope that something in this post might pique your interest.

Naive Bayes classifier

Bayesian probability, and in particular the Naïve Bayes classifier, is successfully used in many parts of the web, from IMDB ratings to spam filters.

The classifier examines the independent features of an item, and compares those against the features (and classification) of previous items to deduce the likely classification of the new item.

It is ‘naïve’ because the features are assessed independently. For example, we may have hundreds of data points that classify animals. If we have a new data point:

  • 4 legs
  • 65kg weight
  • 60cm height

Each feature might be independently classified as:

  • Dog
  • Human
  • Dog

Although the overall result (“probably a dog”) is likely correct, note that it didn’t remove/discount “human” from the classification of weight when it saw that it had 4 legs (and no human had been classified with 4 legs in previous data) – because of the “naivety” of the algorithm.

Perhaps surprisingly, this naïve algorithm tends to give pretty good results. The accuracy of the results, though, depends entirely on the volume and accurate classification of the previous dataset, to which new data is compared against.

Classifying Sentiment

My classification needs were simple: I wanted to classify tweets about customer service as either ‘positive’ or ‘negative’.

In this instance, the ‘features’ that we use for comparison are the words of the sentence. Our evidence-base might point to ‘awesome’ as being a word that is more likely to result in a ‘positive’ tweet, and ‘fail’ as a negative tweet.

I started with Ian Barber’s excellent PHP class for simple Bayesian classification, but wanted to improve the basic quality.

The simplest way to do this was to remove all ‘noise’ words from the tweets and classification process – those words that do not imply positivity or negativity, but that may falsely skew the results.

There are plenty of noise word lists around, so I took one of those and removed any words that are relevant to sentiment analysis (e.g. ‘unfortunately’, that appears in the MySQL list, may be useful for identifying negative tweets).

It improved things substantially, and I spent quite a lot of time analysing which words were contributing towards each score, and adding to the noise word list as appropriate.

Next, I included additional noise words that were specific to my context: the words ‘customer’ and ‘service’ for example appeared in most tweets (I was using this as one of the ways of searching for relevant tweets to classify), so these were added.

Also, I needed to add the names of all businesses/companies to the list (this is an ongoing task). It turns-out that when a company has many, many negative tweets about their customer service, the ‘probability’ that any future tweet mentioning the same name is negative becomes huge. This causes incorrect classification when people tweet about “switching to X from Y”, “X could teach Y a thing or two”, or the occasional positive tweet about the business. I’m looking at you, Verizon.

I decided to make it a little less ‘naïve’, too, by trying to take account of some negative prefixes – i.e. using the relationships between some words. I noticed some false negatives/positives that were affected by phrases like “is not good” or “isn’t bad”, so used a regular expression to combine any words that started with “isnt” or “not”, etc (so in my code, ‘isntbad’ and ‘notgood’ are treated as separate words). This seemed to have a small but noticeable impact on the quality.

Stemming and N-grams

Some attempted improvements didn’t have an impact.

I tried stemming words: in my case, with a Porter Stemmer PHP Class. Stemming reduces all words to their root form, so that different tenses and variations of words are ‘normalized’ to the same root. So, ‘recommended’, ‘recommending’ and ‘recommend’ would all be stemmed to the same root.

This reduced the quality of my results.

Perhaps, where every character matters (in tweets), the chosen variation of a word has significance. For example, in my data (at the time of writing), “recommend” seems to be a neutral word (neither negative nor positive), but “recommended” is positive.

Next, to take my earlier experiment with word relationships further (i.e. the improvement gains by combining ‘not’ and ‘isnt’), I tried to include bigrams (two-word combinations) as classification features, not just unigrams (single words).

This means, for example, that the sentence, “Service is exceptionally bad” is tokenized into the following one- and two-word features:

Service, is, exceptionally, bad, Service is, is exceptionally, exceptionally bad

In theory, this should produce more accurate results than my rudimentary isn’t/not method, but the results were not improved. My guess is that as the existing dataset grows larger (I currently only have about 4,000-5,000 positive and negative tweets each), the bigrams will become more accurate and useful, as the same combinations of words become more frequent and their correlation with classification (negative/positive) more meaningful.

Self-Improving the Dataset

Refining the twitter sentiment data

To create the ‘prior’ 4-5k tweet dataset (that new data is compared against), I created a small interface (above) that pulls tweets from Twitter and uses any existing data to best guess the negative/positive sentiment. I could quickly tweak/fix the results, submit them back, and get a new set that should be slightly more accurate based on the new improved data.

There’s only so much time I can dedicate to building up this corpus though.

As soon as the analysis became fairly accurate at guessing the sentiment of new data, I built in an algorithm to calculate a subjective confidence of the probability. This was built largely on the variation and strength of words in a tweet.

Each word (‘feature’) has a strength of positive/negative sentiment, based on the number of positive/negative tweets it is previously featured in. For example, in my dataset, the word ‘new’ is fairly positive, but the word ‘kudos’ is extremely positive. By calculating the number of strong words and variation in positive/negative words, a confidence can be calculated (e.g. a tweet that includes 5 negative words, 1 positive word, 2 extremely negative words and no extremely positive words can be confidently assumed to be negative).

After a few test runs, I was “confident in my confidence” – tweets that were being rated with a high confidence were being classified as negative/positive sentiment with almost 100% accuracy.

What I’ve now done is set up an automated script that checks Twitter every hour for new customer service/support tweets.  Each tweet is run through the classifier, and any high-confidence classifications are automatically added to the corpus, thereby gradually improving the accuracy without manual input, which should also allow it to gradually become more confident and increase the rate at which high-confidence tweets are detected and added to the corpus.

It’s learning all by itself!

Next step: Skynet. In PHP.

Migrating from Twitter Basic Authentication to OAuth Credentials

August 13, 2010

At the end of August 2010, all Twitter apps that use Basic Authentication to post/query the API will no longer work. Apps need to migrate to OAuth authentication, but this can be a little tricky. I’ve created something for my particular use-case that you might find useful too.

I have a number of automated ‘bot’ accounts that use the Twitter API to post status updates, including @freelondon, @freenewyork, @reboundfinder, @twitexperiment and 17 accounts that I follow/unfollow automatically via the API.

Now, I don’t really need the full security benefits of OAuth, because I’m not authenticating third-party Twitter accounts – I’m just posting to MY accounts, so OAuth is a little over-the-top. I already know MY username/password, so passing them to the Twitter API is no big deal. But I have to switch to OAuth, because everyone has to.

OAuth is a pain-in-the-posterior because you have to ‘authenticate’ all of your accounts against registered ‘applications’. This means that I don’t just post my username/password to the API for my 17 accounts in order to follow/unfollow, but I have to set up a registered ‘app’ and go through the authentication process for each account. I don’t have the power to ask Twitter for an XAuth account, so this also entails a browser-based authentication process. For each account. And then updating the code to use the new credentials.

Anyway, I have a bunch of ‘apps’ that I need to validate my accounts against, so I built a very simple form that allows me to input the details (token/secret) of each registered app, and then authenticate each (logged in) user, to grab the new token/secret authentication details for each user/app combination. I can then use these in my updated code to make OAuth based requests to the API.

If you need to get OAuth tokens/secrets for users too, so that you can change from Basic to OAuth, you may find that my form saves you a little time – just input your app token/secret, and authenticate each user in turn. You’ll get the token/secret back for each user, that you can use instead of the un/pw to access the API via OAuth.

You can find the form here:

http://danzambonini.com/convert-to-oauth/index.php

However, you may (and should) find the idea of handing over your app secret/token to a third-party site a little dodgy. So, I’ve also made the source available, so you can install the form on your own server:

http://danzambonini.com/convert-to-oauth/index.txt

Note that you’ll also need to download Abraham’s PHP Twitter OAuth library, and put it in an ./oauth/ directory in the same directory as the script.

What this form does is allows you to enter your app’s token/secret, and then authenticate the currently logged-in Twitter user, returning the token/secret for that particular user. You can then update your bot code so that it uses Abraham’s OAuth library with something like:

$connection = new TwitterOAuth(CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, USER_SECRET);
$connection->post(‘statuses/update’, array(‘status’ => ‘This is the new updated status text’));

As you can see, once you have the key/secret for both an app and each user, and you use Abraham’s library, it’s not that much more difficult than Basic Authentication – it’s just that you have to authenticate each of your users to get their token/secret, rather than using their un/pw.

On a final note, it’s worth mentioning the single access-token short-cut if your use-case involves a single Twitter account (thanks ffffelix for this).

Guess The Website By The Palette

August 7, 2010

As part of another project I’m working on, I’ve built something that extracts palettes from websites – not just hex codes from the CSS, but the proportion of ‘on screen’ colours, including images.

I’ve built a palette – with proportions – for a number of popular websites. Can you guess which is which? Here are the websites (in no particular order) – you can check the results by looking at the filename of each image (don’t hover over an image unless you want the answer!).

  • Amazon.com (note that the website featured a large ‘letter from Bezos’ when I performed the analysis)
  • Apple.com
  • BBC.co.uk
  • Facebook.com (logged out)
  • Flickr.com
  • Google.com
  • IMDB.com
  • Twitter.com (logged out)
  • Wikipedia.org

Note that the proportions are taken with the assumption of a 1024 width web browser, and the full-height of the home page.