Using NLP to Detect Fake News πŸ“°

Arjun Mohnot
6 min readMar 14, 2020
Fake News Everywhere πŸ“’

Introduction

Nowadays, information can easily be accessible from anywhere 🌏. It is the age of information, where an individual can access the happenings of various events around the world in the comfort of his/her own home. It has resulted in the inaccuracy and irrelevancy in updating information by people which is commonly known as fake news 😨. Since a large proportion of the population uses social media for updating themselves with news, delivering accurate and altruistic information to them is of utmost importance. Due to the increasing number of users in social media, news can be quickly published by anybody, and its credibility stands compromised, As fake news is written to mislead readers, it makes it a difficult task to detect based on the content of the news only. The news content is diverse in terms of styles, the subject in which it is written, it becomes essential to bring an efficient system for its detection. Fake news detection has recently garnered much attention from researchers πŸ‘¨β€πŸ”¬ and developers alike. This work proposes to detect fake news using various modalities available in an efficient manner using Deep Learning algorithms such as Convolutional Neural Network πŸ•ΈοΈ and Long Short-Term Memory.

Source: Statista, World Economic Forum

METHODOLOGY

A. Fake News Classification with Deep Learning

  • Deep learning is a subset of machine learning which contains many useful and efficient algorithms when compared to other learning algorithms. In deep learning, the performance of a model is directly proportional to the amount of data that is being fed to the model. The Convolutional Neural Network and Long Short-Term Memory were used to create πŸ”§ the fake-news detection model.
Fig. 1. Deep Learning Model Performance with Amount of Data

B. Convolutional Neural Network

  • Convolutional Neural Network is a type of Artificial Neural Network that uses perceptron for cognitive tasks like image processing, language processing πŸ‘¨πŸ»β€πŸš€. It is a class of deep neural networks that are also called a shift variant or space-variant artificial neural networks. They are a regularized version of a multilayer perceptron in which each node in a layer is connected to all the nodes in the next layer.
  • In CNN, there is little to no pre-processing of data required as they are more advanced and the filters that are supposed to be hand-engineered are learned by this independently.
Fig. 2. CNN Architecture

C. Long short-term memory

  • Long short-term memory 🧬 is a type of artificial neural network architecture that is used to process multiple data points in images, speech, audio as well as text. It consists of a cell and three gates, namely an input gate, forget gate and output gate. Unlike other architectures, LSTM has connections for feedback which are helpful in regulating the information flow through the gates.
  • This architecture is designed in such a way that it can remember the long-term dependencies of the data being presented to it. It could overcome the vanishing gradient problem that arises when using Recurrent Neural Networks. These models can be trained in both supervised and unsupervised manner.
Fig. 3. LSTM Architecture

D. Word Vector

  • The text in the dataset is converted into word vectors βš”οΈ using techniques like co-occurrence matrix, count vectorizer or TFIDF vectorizer, Continuous bag of word or skip-gram (Word2vec). Here, each sentence consists of words that are converted to vectors using embedding techniques. The pre word embedding models like GloVe, ELMo, fastText, BERT can be used for the purpose of obtaining a vector representation of words.
  • It is based on the observation that the ratio of word-word co-occurrences probabilities can be used to encode meaning. It is trained on non-zero word-word occurrence entries which shows how frequently words co-occur with each other.
Fig. 4. Euclidean representation of words using Word Embedding

E. Fakeddit DataSet

  • The dataset is called Fakeddit as it is derived from Fake News + Reddit. Fakeddit, a novel dataset comprising of around 800,000 examples from different classifications of fake news. Each example is marked by 2-way, 3-way, and 5-way characterization classes. The dataset contains features like text, clean title, number of upvotes, comments, score, upvote ratio.
  • The dataset πŸ“… containing text is fed into the model in which the words and sentences are converted into vectors and pass through the different layers containing a receding number of nodes to finally get classified as real or fake in the output layer. In this model, we use the feature β€œclean title” in the dataset as input. It consists of 69954 entries of data occurrences for each column in the training dataset. (Link for the dataset- https://github.com/entitize/Fakeddit)
Fig. 5. Word Cloud for the Fakeddit Dataset
  • The most frequent words as Real and Fake in the dataset has been plotted to visualize trends whether the given sentence is fake or real.
Fig. 6. Scatter Plot between Fake v/s Real Words in the Fakeddit Dataset
  • Fakeddit provides a large quantity of text+image samples with multiple labels for various levels of fine-grained classification.
  • With such massive data points in the dataset, it can provide more generic results and helps to identify better credibility of the news.
Fig. 7. Top 8 words in real and fake news respectively in Fakeddit Dataset

Code

The Jupyter notebook and the code πŸ‘¨β€πŸ’» can be found at my GitHub account. https://github.com/Arjun009/Fake-News-Detection.

  • The fastText word embedding implementation in python.
  • Following is the basic implementation of the LSTM model in python.

RESULTS

Fake news detection model was successfully πŸ’― built with an accuracy of more than 90% for the training set and more than 80% for the validation and test set.

Fig. 8. Accuracy and Loss v/s no. of epoch for Training and Testing for LSTM

Conclusion

In this digital age πŸ’†β€β™‚οΈ where hoax news is present everywhere in digital platforms, there is an ultimate need for fake news detection, and this model serves its purpose by being the need of the hour tool.

In future, Multi-Modal for fake news detection can be made which can detect fake news based on both images and caption.

⭐ Thank You!

Thank you πŸ™, and I am open to your suggestions πŸ‘.

If you love my article, please don’t forget to give an applaud πŸ‘.

πŸ“ Save this story in Journal.

πŸ‘©β€πŸ’» Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

--

--