Archive

Posts Tagged ‘python’

Fast tutorial to NLTK using Python. Example of Sentiment Analysis for movie reviews

2014/04/28 1 comment

Fast tutorial to NLTK using Python. Example of Sentiment Analysis for movie reviews

# http://www.nyu.edu/projects/politicsdatalab/workshops/NLTK_presentation%20_code.py
# http://www.nltk.org/howto/corpus.html

# We have python installed:

$ python
Python 2.6.6 ...

# And we try to use NLTK:

import nltk

	ImportError: No module named nltk

# It seems there is no nltk. Let's verify it:

import sys
for pth in sys.path:
    print pth

/usr/lib64/python26.zip
/usr/lib64/python2.6
/usr/lib64/python2.6/plat-linux2
/usr/lib64/python2.6/lib-tk
/usr/lib64/python2.6/lib-old
/usr/lib64/python2.6/lib-dynload
/usr/lib64/python2.6/site-packages
/usr/lib64/python2.6/site-packages/gtk-2.0
/usr/lib/python2.6/site-packages


# So, we need first to install nltk

exit()

# Go root:

$ yum install python-nltk
...Complete!

# Ok, let's try again:

$ python
Python 2.6.6 ...
import nltk
nltk.download()
d movie_reviews
	...
	Downloading package 'movie_reviews' to /home/jips/nltk_data...
	Unzipping corpora/movie_reviews.zip.
q

# We have now the corpus downloaded. Next step, use it:

import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
        for category in movie_reviews.categories()
        for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]
print word_features[:100]
	[',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in', 's', '"', 'it', 'that', '-', ')', '(', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by', 'be', 'one', 'movie', 'an', 'who', 'not', 'you', 'from', 'at', 'was', 'have', 'they', 'has', 'her', 'all', '?', 'there', 'like', 'so', 'out', 'about', 'up', 'more', 'what', 'when', 'which', 'or', 'she', 'their', ':', 'some', 'just', 'can', 'if', 'we', 'him', 'into', 'even', 'only', 'than', 'no', 'good', 'time', 'most', 'its', 'will', 'story', 'would', 'been', 'much', 'character', 'also', 'get', 'other', 'do', 'two', 'well', 'them', 'very', 'characters', ';', 'first', '--', 'after', 'see', '!', 'way', 'because', 'make', 'life']

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

print document_features(movie_reviews.words('pos/cv957_8737.txt'))

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)

	0.86

classifier.show_most_informative_features(5)

	Most Informative Features
		 contains(damon) = True              pos : neg    =     11.2 : 1.0
	   contains(outstanding) = True              pos : neg    =     10.6 : 1.0
		 contains(mulan) = True              pos : neg    =      8.8 : 1.0
		contains(seagal) = True              neg : pos    =      8.4 : 1.0
	   contains(wonderfully) = True              pos : neg    =      7.4 : 1.0

So, it works. Training a naive-Bäyes classifier with Python and NLTK library it is possible to find out what are most significant words that describe a good movie. This way, now it is easy to calculate a score of a movie comment and find out whether it is positive or negative. How do we call that? Sentiment Analysis.

Advertisements