import re
def load_liwc_list(filepath):
"""
:return: A list of words loaded from the file at `fpath`
"""
with open(filepath, 'r', encoding='utf-8') as infile:
= infile.read().split()
words return words
def liwc_to_regex(liwc_list):
"""
Converts LIWC expression list into Python regular expression
"""
= [w.replace('*', r'[^\s]*') for w in liwc_list]
wildcard_reg = r'\b(' + '|'.join(wildcard_reg) + r')\b'
reg_str return reg_str
def num_matches(reg_str, test_str):
= len(re.findall(reg_str,test_str))
num_matches return num_matches
def file_to_regex(filepath):
= load_liwc_list(filepath)
liwc_list = liwc_to_regex(liwc_list)
liwc_regex_list return liwc_regex_list
# You can call the following helper function if
# you'd like to see the full regular expression
def print_regex(regex_str, wrap_col=70):
import textwrap
print(textwrap.fill(regex_str, wrap_col))
Using LIWC for Document-Level Sentiment Analysis
The Text Files
You can download the .txt files for positive and negative sentiment at the following links (click them to view the contents, or right click and choose “Save Link As…” to download)1:
Converting Into Python Regular Expressions
Using them in this raw format is a bit tricky, however, since they are formatted not as individual words but as regular expressions, which will match entire families of positive and negative words. For example, 032-negemo.txt
contains the entry troubl*
, which will therefore match the words trouble
, troubles
, troubling
, and so on.
So, to work with these files in Python, we’ll need to load the .txt files but then convert each entry into a regular expression object. This can be done using the following collection of functions:
We can use these functions to load the .txt files and create lists of regex (Regular Expression) objects from them:
= "./assets/liwc/031-posemo.txt"
pos_fpath = file_to_regex(pos_fpath)
pos_regex = "./assets/liwc/032-negemo.txt"
neg_fpath = file_to_regex(neg_fpath)
neg_regex # Uncomment this line to see the full regular expression
#print_regex(neg_regex)
140]) print_regex(neg_regex[:
\b(dismay[^\s]*|ignorant|poorest|tragic|disreput[^\s]*|ignore|poorly|t
rauma[^\s]*|abandon[^\s]*|diss|ignored|poorness[^\s]*|trembl[^\s]*|abu
Using the Regular Expressions to Generate Sentiment Scores
And now we can use these generated regular expressions to count the number of times “positive” and “negative” words appear in our string! Here we provide two final helper functions for accomplishing this:
def extract_sentiment_data(text):
# First compute positive sentiment using pos_reg
= num_matches(pos_regex, text)
pos_count # Then negative sentiment using neg_reg
= num_matches(neg_regex, text)
neg_count # And finally the overall sentiment score as the difference
= pos_count - neg_count
sentiment return {
'pos': pos_count,
'neg': neg_count,
'sentiment': sentiment
}
def compute_sentiment(text):
= extract_sentiment_data(text)
full_results # Return just the overall sentiment score
return full_results['sentiment']
And here we test these helper functions out by creating positive, negative, and neutral test strings and checking the results for these strings:
= "Python is terrible, I hate Python, I despise Python"
neg_test_str = extract_sentiment_data(neg_test_str)
neg_str_results print(f"{neg_test_str}\n{neg_str_results}")
= "Python is wonderful, I love Python, I adore Python"
pos_test_str = extract_sentiment_data(pos_test_str)
pos_str_results print(f"{pos_test_str}\n{pos_str_results}")
= "Python is ok, Python is mid, I guess I can do Python maybe"
neutral_test_str = extract_sentiment_data(neutral_test_str)
neutral_str_results print(f"{neutral_test_str}\n{neutral_str_results}")
Python is terrible, I hate Python, I despise Python
{'pos': 0, 'neg': 3, 'sentiment': -3}
Python is wonderful, I love Python, I adore Python
{'pos': 3, 'neg': 0, 'sentiment': 3}
Python is ok, Python is mid, I guess I can do Python maybe
{'pos': 1, 'neg': 0, 'sentiment': 1}
Computing Sentiment Scores for a DataFrame Column
Even though above we printed out the full results of each sentiment computation (by calling extract_sentiment_data()
, which returns a dictionary containing the results), if you have a DataFrame with a text column that you’d like to perform sentiment analysis on, you can just use the simpler compute_sentiment()
function to obtain a single number, like in the following code:
import pandas as pd
= pd.DataFrame({
text_df 'text_id': [1,2,3],
'text': [neg_test_str, pos_test_str, neutral_test_str]
}) text_df
text_id | text | |
---|---|---|
0 | 1 | Python is terrible, I hate Python, I despise P... |
1 | 2 | Python is wonderful, I love Python, I adore Py... |
2 | 3 | Python is ok, Python is mid, I guess I can do ... |
'sentiment'] = text_df['text'].apply(compute_sentiment)
text_df[ text_df
text_id | text | sentiment | |
---|---|---|---|
0 | 1 | Python is terrible, I hate Python, I despise P... | -3 |
1 | 2 | Python is wonderful, I love Python, I adore Py... | 3 |
2 | 3 | Python is ok, Python is mid, I guess I can do ... | 1 |