Sentiment Analysis in Financial News

ML Modeling with Multinomial Logistic Regression, CNN, LSTM

Introduction

Sentiment analysis is an application of Natural Language Processing in order to quantify subjective information; models extract information from opinion-based statements and determine the sentiment, or emotion, related to the statement [7]. In particular, models usually identify and label positive, negative, and neutral sentiment from statements and documents.

Sentiment Analysis

Source: Symeon Symeonidis

As with other news headlines, financial news headlines have the same sentiment as that of the information within the news itself. Furthermore, financial news headlines usually closely correlate with investor confidence [2]. Thus, identifying the sentiments of these news headlines can aid predictions on market volatility, trading patterns, and stock prices. Attaining models that can consistently label the correct sentiment of financial news headlines with a high accuracy would have many further applications, including measuring the general investor sentiment about the market and predict expected economic trends [6].

Purpose

The purpose of this project is to build a model that will be able to accurately determine positive, neutral, or negative sentiment in financial news headlines. Our group applies a supervised learning model through multinomial logistic regression in order to achieve the goal. Furthermore, our group applies deep learning models, including convolutional neural networks (CNN) and long short term memory networks (LSTM), to conduct financial sentiment. We then compare the results of all three models and conclude what the strengths of each model are.

Choosing our Models

We tested a conventional supervised learning model to see what the best accuracy would be for conducting sentiment analysis with models that were not specifically deep-learning models. We chose multinomial logistic regression due to the fewer weights needed to train for the model, and the documentation it has had with published papers [13].

The remaining models were deep learning models. Convolutional neural networks are models that use convolutional layers, which slide a kernel onto the input data. CNNs have had applications in NLP, as word embeddings provide the possibility to use convolution layers to capture semantic information and relations between individual words [7]. Through these convolution layers, partial context can be captured. Thus, we used CNNs as the partial context that could be captured likely would outperform the conventional supervised learning model we tested [7].

Convolutional Neural Network

Source: Sumit Saha

We also tested an LSTM model, which is a type of recurrent neural network (RNN). Recurrent neural networks are a type of neural network in which information is fed into the network sequentially. For text and NLP, the RNN creates a hidden state based on running the model on a word, and a new hidden state is generated through using the the previous hidden state and the new word to run the model on [11]. LSTM models are a subtype of RNN models which improve upon the RNN by selectively choosing which information from previous states to remember and which information to forget. We used an LSTM model because LSTM models and RNNs are able to capture sequential information, and, under the assumption that text occurs sequentially, these models may best capture previous context information in order to correctly determine sentiment [4].

Recurrent Neural Network

Source: Pedro Torres Perez

Dataset

For the financial news dataset, we used the Financial Phrasebank dataset, which can be found in a refined form on kaggle, https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news. The dataset contains 4846 individual datapoints, each one divided between one of three labels - positive, negative, neutral - and the text of the dataset. Note that the length of the text is considerably shorter than that of other sentiment analysis datasets, which may have an effect with the results of our models.

Below are the first three entries of the dataset:

According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .
The international electronic industry company Elcoteq has laid off tens of employees from its Tallinn facility ; contrary to earlier layoffs the company contracted the ranks of its office workers , the daily Postimees reported .

Word cloud of the stemmed dataset

The graph below indicates the distribution of labels. Due to the unbalanced nature of the data, it is possible that the results of every model could be affected.

Label Distribution

Distribution of labels

Multinomial Logistic Regression

Pre-processing

To preprocess our data, we tokenized and stemmed each headline. We treated each word in the dataset as a token, and we used the common Porter Stemmer to stem the words, ignoring any inflections that the word may have. Thus, the headlines are converted into lists of root words, with tenses and plurality removed for consistency [10]. Afterwards, we removed common stopwords from our data using a preset list of English stopwords. We applied the common bag-of-words model and used a count vectorizer to vectorize the words in our training dataset by the number of times each word appeared. We used a minimum frequency of 5, so the model ignored words that appeared less than 5 times throughout all the training headlines. In addition to word counts, we processed the headlines using a term frequency- inverse document frequency (TFIDF) matrix. This process assigns each word in a headline a frequency proportional to number of times it appears in a headline and the number of headlines the word appears in. TFIDF reflects the importance of a word to the document and the corpus of the dataset as the inverse relationship allows for common words present in all documents to not have much information [9]. Comparatively, a word with high frequency in one headline but relatively low frequency in the other headlines would provide more information regarding the sentiment of that headline.

Below are some visualizations of the pre-processing steps we took for our logistic regression model.

Porter Stemmer and Count Vectorizer

Before	Porter Stemmer	After	Count Vectorizer	Key
According	⭢	accord	⭢	784
to	⭢	to	⭢	N/A
Gran	⭢	gran	⭢	3211
the	⭢	the	⭢	N/A
company	⭢	compani	⭢	1826

Example words being stemmed and stop words being removed

TF-IDF Matrix

	accord	area	compani	comput	develop	gran	grow
Sentence 1	0.342369	0	0.48719673	0	0	0.342369	0.342369
Sentence 2	0	0.24244659	0.17250275	0.24244659	0.24244659	0	0

TF-IDF matrix weighting the first few words of the corpus in our dataset

(Company has the has uniqueness score for the first sentence because it is used twice, but lowest for the second sentence because it has also been used twice in the first sentence compared to the other words which only appear in the second sentence)

Model

To create our model, we used the logistic regression function provided by the sklearn library [13]. Our code was conducted on Google Colab, and is available on the github repository. We used 80 percent of our data to train on the model, and we used the other 20 percent to test the accuracy of our model. We ran a 5-fold cross validation algorithm on our model trained with several different regularization level. We were able to achieve the best accuracy results of 70 percent with an inverse regularization value of 0.26.

Showing how mean cross validation accuracy changes as inverse regularization in logistic regression changes

Results

We compared the number of correctly classified financial news headlines with the total number of the test set. Our model predicted 37% of the test headlines correctly. When the headlines were processed with TFIDF, we saw a slight decrease in the accuracy to 36%. Below we charted the 25 most positivley and negativley influential words for each sentiment.

Most influential neutral words

Most influential negative words

Most influential positive words

Discussion

We believe the relatively low accuracy of the logistic regression model can be attributed to the relatively low sample size and the majority of the dataset being classified as neutral. In particular, the logistic regression model may have suffered as the training set produced a vocab of only 2340 words, meaning many of the words that appeared in the test set headlines may have been unknown to the model and therefore, was lost information. The TFIDF pre-processing may have had a negligible effect due to the nature of the financial text corpus; in general, financial text may have different frequencies for the occurrences of different words and thus would not have much impact on the dataset. Finally, the bag-of-words model likely captured insufficient information regarding the relationships and semantic meaning of text, leading to a low overall accuracy [5].