Project on Text Classification
using Pandas (python)
# import library and read dataset
import pandas as pd
df_reviews = pd.read_csv('IMDB Dataset.csv')python
df_reviewspython
review sentiment 0 One of the other reviewers has mentioned that ... positive 1 A wonderful little production. <br /><br />The... positive 2 I thought this was a wonderful way to spend ti... positive 3 Basically there's a family where a little boy ... negative 4 Petter Mattei's "Love in the Time of Money" is... positive ... ... ... 49995 I thought this movie did a down right good job... positive 49996 Bad plot, bad dialogue, bad acting, idiotic di... negative 49997 I am a Catholic taught in parochial elementary... negative 49998 I'm going to have to disagree with the previou... negative 49999 No one expects the Star Trek movies to be high... negative [50000 rows x 2 columns]
# take a sample of 10000 rows to make processing faster and get imbalance data
# 9000 positives
df_positive = df_reviews[df_reviews['sentiment']=='positive'][:9000]
# 1000 negative
df_negative = df_reviews[df_reviews['sentiment']=='negative'][:1000]python
df_reviews_imd = pd.concat([df_positive,df_negative])python
df_reviews_imd.value_counts('sentiment')python
sentiment positive 9000 negative 1000 Name: count, dtype: int64
Dealing with Imbalanced Classes
# make a barplot to show how data is distributed
df_reviews_imd.value_counts('sentiment').plot(kind='bar')python
<Axes: xlabel='sentiment'>
Balancing data with .sample()
review sentiment 0 Drum scene is wild! Cook, Jr. is unsung hero o... positive 1 I am a big fan of Lonesome Dove and all the bo... positive 2 It does come out of left field, and REALLY isn... positive 3 I enjoyed the innocence of this film and how t... positive 4 The second "Mr. Eko" episode has somewhat less... positive ... ... ... 1995 Stranded in Space (1972) MST3K version - a ver... negative 1996 I happened to catch this supposed "horror" fli... negative 1997 waste of 1h45 this nasty little film is one to... negative 1998 Warning: This could spoil your movie. Watch it... negative 1999 Quite what the producers of this appalling ada... negative [2000 rows x 2 columns]
Balancing data with RandomUnderSampler
sentiment negative 1000 positive 1000 Name: count, dtype: int64
Splitting data into train and test
review sentiment 11301 Refreshing `lost' gem! Featuring effective dia... positive 681 Never saw the original movie in the series...I... negative 7513 Once upon a time, in Sweden, there was a poor ... positive 1821 At the beginning of the film, you might double... negative 549 Another Spanish movie about the 1936 Civil War... positive ... ... ... 207 I have seen most, if not all of the Laurel & H... negative 7616 D.W. Griffith could have made any film he want... positive 13483 Cardiff, Wales. A bunch of 5 mates are deeply ... positive 1051 I rented this movie with my friend for a good ... negative 1028 Jim Carrey is one of the funniest and most gif... negative [660 rows x 2 columns]
sentiment negative 675 positive 665 Name: count, dtype: int64
.png)