Archive
Posts Tagged ‘Natural Language’
Fast tutorial to NLTK using Python. Example of Sentiment Analysis for movie reviews
2014/04/28
2 comments
Fast tutorial to NLTK using Python. Example of Sentiment Analysis for movie reviews
# http://www.nyu.edu/projects/politicsdatalab/workshops/NLTK_presentation%20_code.py # http://www.nltk.org/howto/corpus.html # We have python installed: $ python Python 2.6.6 ... # And we try to use NLTK: import nltk ImportError: No module named nltk # It seems there is no nltk. Let's verify it: import sys for pth in sys.path: print pth /usr/lib64/python26.zip /usr/lib64/python2.6 /usr/lib64/python2.6/plat-linux2 /usr/lib64/python2.6/lib-tk /usr/lib64/python2.6/lib-old /usr/lib64/python2.6/lib-dynload /usr/lib64/python2.6/site-packages /usr/lib64/python2.6/site-packages/gtk-2.0 /usr/lib/python2.6/site-packages # So, we need first to install nltk exit() # Go root: $ yum install python-nltk ...Complete! # Ok, let's try again: $ python Python 2.6.6 ... import nltk nltk.download() d movie_reviews ... Downloading package 'movie_reviews' to /home/jips/nltk_data... Unzipping corpora/movie_reviews.zip. q # We have now the corpus downloaded. Next step, use it: import random from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[:2000] print word_features[:100] [',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in', 's', '"', 'it', 'that', '-', ')', '(', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by', 'be', 'one', 'movie', 'an', 'who', 'not', 'you', 'from', 'at', 'was', 'have', 'they', 'has', 'her', 'all', '?', 'there', 'like', 'so', 'out', 'about', 'up', 'more', 'what', 'when', 'which', 'or', 'she', 'their', ':', 'some', 'just', 'can', 'if', 'we', 'him', 'into', 'even', 'only', 'than', 'no', 'good', 'time', 'most', 'its', 'will', 'story', 'would', 'been', 'much', 'character', 'also', 'get', 'other', 'do', 'two', 'well', 'them', 'very', 'characters', ';', 'first', '--', 'after', 'see', '!', 'way', 'because', 'make', 'life'] def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features print document_features(movie_reviews.words('pos/cv957_8737.txt')) featuresets = [(document_features(d), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier, test_set) 0.86 classifier.show_most_informative_features(5) Most Informative Features contains(damon) = True pos : neg = 11.2 : 1.0 contains(outstanding) = True pos : neg = 10.6 : 1.0 contains(mulan) = True pos : neg = 8.8 : 1.0 contains(seagal) = True neg : pos = 8.4 : 1.0 contains(wonderfully) = True pos : neg = 7.4 : 1.0
So, it works. Training a naive-Bäyes classifier with Python and NLTK library it is possible to find out what are most significant words that describe a good movie. This way, now it is easy to calculate a score of a movie comment and find out whether it is positive or negative. How do we call that? Sentiment Analysis.
Advertisements
Categories: CentOS, Snippets of Code
Natural Language, NLTK, python, Sentiment Analysis