Home > CentOS, Snippets of Code > Fast tutorial to NLTK using Python. Example of Sentiment Analysis for movie reviews

Fast tutorial to NLTK using Python. Example of Sentiment Analysis for movie reviews

Fast tutorial to NLTK using Python. Example of Sentiment Analysis for movie reviews

# http://www.nyu.edu/projects/politicsdatalab/workshops/NLTK_presentation%20_code.py
# http://www.nltk.org/howto/corpus.html

# We have python installed:

$ python
Python 2.6.6 ...

# And we try to use NLTK:

import nltk

	ImportError: No module named nltk

# It seems there is no nltk. Let's verify it:

import sys
for pth in sys.path:
    print pth


# So, we need first to install nltk


# Go root:

$ yum install python-nltk

# Ok, let's try again:

$ python
Python 2.6.6 ...
import nltk
d movie_reviews
	Downloading package 'movie_reviews' to /home/jips/nltk_data...
	Unzipping corpora/movie_reviews.zip.

# We have now the corpus downloaded. Next step, use it:

import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
        for category in movie_reviews.categories()
        for fileid in movie_reviews.fileids(category)]

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]
print word_features[:100]
	[',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in', 's', '"', 'it', 'that', '-', ')', '(', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by', 'be', 'one', 'movie', 'an', 'who', 'not', 'you', 'from', 'at', 'was', 'have', 'they', 'has', 'her', 'all', '?', 'there', 'like', 'so', 'out', 'about', 'up', 'more', 'what', 'when', 'which', 'or', 'she', 'their', ':', 'some', 'just', 'can', 'if', 'we', 'him', 'into', 'even', 'only', 'than', 'no', 'good', 'time', 'most', 'its', 'will', 'story', 'would', 'been', 'much', 'character', 'also', 'get', 'other', 'do', 'two', 'well', 'them', 'very', 'characters', ';', 'first', '--', 'after', 'see', '!', 'way', 'because', 'make', 'life']

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

print document_features(movie_reviews.words('pos/cv957_8737.txt'))

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)



	Most Informative Features
		 contains(damon) = True              pos : neg    =     11.2 : 1.0
	   contains(outstanding) = True              pos : neg    =     10.6 : 1.0
		 contains(mulan) = True              pos : neg    =      8.8 : 1.0
		contains(seagal) = True              neg : pos    =      8.4 : 1.0
	   contains(wonderfully) = True              pos : neg    =      7.4 : 1.0

So, it works. Training a naive-Bäyes classifier with Python and NLTK library it is possible to find out what are most significant words that describe a good movie. This way, now it is easy to calculate a score of a movie comment and find out whether it is positive or negative. How do we call that? Sentiment Analysis.

  1. yash789
    2017/02/20 at 1:35 pm

    I want the code for product_reviews_1 and product_reviews_2 present in corpus for doing sentiment analysis

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: