How many stars are enough to rate an item?

Sometime back, I was faced with this question – Is there a right amount of stars in a star rating system which should be used to rate an item? Different websites use a different number of stars. For instance, imdb uses 10 stars to rate a movie, amazon, tripadvisor, and a million more use 5 stars, facebook and many other use 1 star. From a general consensus it seems that 10 is the maximum number of stars used and 1 the minimum with most of the rating sites using 5 stars. These scales are often referred to as Likert scales [1] after the psychologist Likert who used them to gauge the “intensity” of the responses to questions. Likert scales typically range from 2 to 10 – with 5 or 7 being the most common.  Rating scales are a big thing is psychological testing and there are multiple test which employ anywhere from 3 stars to more than 15 stars. I suspect measure theory also has a lot to say about rating scales for “items”, however I have not researched it not do I intend to. A cursory Wikipedia search reveals that there are 4 different types of scales: nominal (ex. Gender, nationality, ethnicity), ordinal (rank 1, rank2, etc. signifying an order), interval (agreeableness or disagreeableness, hot or not, etc. signifying difference) and ratio (mass, length, etc. signifying multiplicity between extremes) [2]. Generally scales are bipolar the “poles” being “strongly agree” and “strongly disagree” whereas the actually stars represent the distance of the response from one of those two poles. A neutral response is equidistance from either poles. Neutral responses do not provide much information since that is the expectation. The most interesting responses are the ones which are not near the expectation. However bipolar scales may face difference kinds of biases [1].

But I digress. The question which I wanted answered was does it make sense to use 3 stars for rating an item or 5 stars or for that matter 10 stars. What if we had a choice of using 1000 stars? Wouldn’t the higher resolution of the response be better in performing post scoring analysis? Well yes, but intuitively using a 1000 stars doesn’t make sense at all. That is because even if 1000 stars were used my brain would end up splitting the number of stars into 5 categories; 200 stars for the lowest and 1000 stars for the highest. Thus the question is what is the correct amount of categories that a star system should correspond to such that the cognitive effort [3] incurred by my brain does not impend my ability to score correctly. Clearly from the ample amount of example all around us that number of categories has come to be determined as 5.

Sentiment analysis of reviews or comments

This question can possibly be answered by the analysis of another response which is usually collected from users i.e. the review/comment. Generally sites like amazon, ebay, etc. also tend to collect responses like comments, reviews, discussion about the products being sold or sellers of product or even buyers of products in conjunction with the star rating. There are companied like Bazarvoice whose sole business is to analyze these comments and reviews for qualitative insights into what the commenter or reviewer actually meant beyond the star rating that she gave. As expected it has tremendous commercial value if the producer knew exactly what the customer wanted. Generally comments/reviews give a higher resolution of intensity than star ratings.

The rationale behind using sentiment analysis of comments to answer the stated question is – if I were to write a review/comment on an item which I have given 4 stars to, I will probably use words which are more “positive”. If I were to give some item a 1 star I will probably use “negative” words and if I give some item 2.5 stars I will mostly use “neutral” words. Therefore, overall, if in a site it is observed that people tend to use more positive words, it means they are also giving star ratings which are higher than average. If overall negative words are used to describe products, it means overall ratings are below average. If this rationale holds true for whatever cases, we can then argue about how many stars offer the same expressibility as comments/reviews.

How?

SentiwordNet [4] is a Wordnet [5] with sentiment values attached to each word. The sentiment values are computed using a machine learning technique and training the algorithm on a large corpus of text. The output is a dictionary of words each labelled with a positivity value, negativity value and a neutrality value. There are also algorithms to combine the values of a word being used in different context. It is really a pretty nifty thing. However the more you smoosh the value in order to simplify and make it useable, the more they lose their predictive power. Anyway, so here is what I did. I collected all the comments every given on the Spigit website and found 40,000 of the most frequently used words in that corpus. Then I matched those words from the sentiwordnet dictionary and generated another dictionary of the words common in sentiwordnet and Spigit corpus. Then I counted the number of words in every comment which were present in the dictionary that I created and the sentiment values of those words.

Results

For figure 1, the no. of words is the number of matching words from the dictionary whereas the total count is the total number of time those words were repeated in the comments. The sentiment ranges from -1 as most negative word and +1 as the most positive word and 0 as neutral words. It is clear that mostly neutral words are used with slight positivity or slight negativity. The most words used are slightly positive words. But the usage of words which are positive beyond 0.2 and negative beyond -0.2 drops pretty dramatically. People are generally pleasant.

words and total count across sentiments

Figure 2 is same data represented as a “rate”. The rate of using a neutral or slightly positive word is greater than the rate of using a word of any other sentiment “category”. Slightly negative words are used more often than slightly positive words. But positive words between 0.2-0.4 are used way more often than negative words between sentiment -0.2 to -0.4.

Inference about star ratings

Does the above analysis say anything about star ratings? Well it could mean that people choose to use at most -0.6 to 0.6 sentiment range words to describe their sentiment and rarely venture using negative words beyond -0.6 to -1 or positive words beyond 0.6 to 1. Therefore, a star rating system comprised of 6 stars (-0.6 to -0.4 = *, -0.4 to -0.2 = **, -0.2 to 0 = ***, 0 to 0.2 = ****, 0.2 to 0.4 = *****, 0.4 to 0.6 = ******) maybe enough for them to express whatever they want to. Maybe even the 5 star rating may have originated the same way!

References:

  1. http://en.wikipedia.org/wiki/Likert_scale
  2. http://en.wikipedia.org/wiki/Level_of_measurement
  3. http://en.wikipedia.org/wiki/Cognitive_load
  4. http://sentiwordnet.isti.cnr.it/
  5. http://wordnet.princeton.edu/
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s