Fake News Detection

AI Code for Business

In this tutorial write-up, we provide a simple approach to detect fake news. Do you trust all the news you hear from social media? All news are not real, right? How can you define an impartial judge of online news?

Fake news not only affects individuals on a small scale but on a larger scale as well. Public health and health policy decisions may be based on political and economic concerns rather than well-rounded research-based information. Health providers must also engage in critical and evidence-based evaluation of health information. Incorrect or misleading medical information can adversely affect health and delay proper treatment. The impact of advertising and commercial search engine optimization techniques could promote search results to the top of the page from sources with economic interests rather than the most accurate and well-researched information. Fake news detection is helpful in all types of businesses because in today's society there is a lot of fake news.

How Can You Detect Fake News?

The answer is with Python. By practicing this python project for detecting fake news, you can have a readily available tool that will differentiate between between real and fake news.

Before moving ahead, let's first get familiar with some related terms: TfidfVectorizer and PassiveAggressive Classifier.

What is a TfidfVectorizer?

TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

IDF (Inverse Document Frequency): Words that occur many times in a document, but also occur many times in many others, maybe irrelevant. IDF is a measure of how significant a term is in the entire corpus.

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

What is a PassiveAggressiveClassifier?

Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome and turns aggressive in the event of a miscalculation, updating, and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.

Detecting Fake News with Python

The objective is to build a model to accurately classify a piece of news as REAL or FAKE.

About Detecting Fake News with Python

This python project of detecting fake news deals with fake and real news. Using sklearn, a data science library, we build a TfidfVectorizer on our dataset. Then, we initialize a PassiveAggressive Classifier and fit the model. In the end, the accuracy score and the confusion matrix tell us how well our model fares. We apply the built model on two news stories. Are they REAL or FAKE?

Fake News Detection Project Code

The dataset we’ll use for this python project (news.csv) consists of 7795 entries. The first column identifies the news, the second and third are the title and text, and the fourth column has labels denoting whether the news is REAL or FAKE. This dataset will be used to trained the prediction model.

Follow the steps below and run the code on the colab notebook linked here. (To run the code, click on the round ▶️ next to each cell)

Cell 1: Imports the python libraries needed for this project.

import numpy as np

import pandas as pd

import itertools

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import PassiveAggressiveClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

Cell 2: Prior to running this cell, make sure to download the news.csv spreadsheet and upload this. The code reads in the spreadsheet and displays the first 5 news item.

data = pd.read_csv('news.csv')

data.head()

Cell 3: Splits 80% of the data for training the machine learning model with the remaining 20% for testing the accuracy of the model. So, 80% of the news text along with the corresponding REAL or FAKE labels are stored in the x_train and y_train variables and the remaining 20% stored in the x_test and y_test variables.

x_train, x_test, y_train, y_test = train_test_split(data.text.values, data.label.values, test_size=0.2, random_state=42)

Cell 4: Next, we initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data. And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features. The PassiveAggressiveClassifier is initialized and the model trained with the x_train and y_train traning data

tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

tfidf_train = tfidf_vectorizer.fit_transform(x_train)

tfidf_test = tfidf_vectorizer.transform(x_test)

pac = PassiveAggressiveClassifier(max_iter=50)

pac.fit(tfidf_train, y_train)

Cell 5: Testing the trained model against the test data shows that the model has achieved an accuracy of 93.61%. This means that out of 100 stories the model was able to correctly 93 news stories whether it is REAL or FAKE. The PassiveAggressiveClassifier is initialized and the model trained with the x_train and y_train traning data. Display a pie chart of the emotions analyzed from the tweets.

y_pred = pac.predict(tfidf_test)

score = accuracy_score(y_test, y_pred)

print(f'Accuracy: {round(score*100,2)}%')

Cell 6: The confusion matrix provides a table layout of the performance of the model on the test data with 588 fake stories correctly predicted as fake, 40 fake stories wrongly predicted as real, 41 real stories wrongly predicted as fake, and 598 real stories correctly predicted as real.

confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL'])

Cell 7, 8, 9, 10, 11, 12: Now that we have the model finished and working, we can use it to detect is a news story is genuine. Replace the news story in between the single quotes with another, run this cell, and the next to get the detection from the model of whether the news story is FAKE or REAL. Make sure to add a \ add the end when splitting onto the next line as shown. Also, if there are any single quote (apostrophe) in the news story, this needs to be deleted/removed.

Conclusion

This project presents a quick and simple method to detect and combat questionable news. This machine learning fake news detecton model can help. It will dutifully indicate the validity of a news item without bias or any other considerations that can sway us humans.

Check out the other articles to see more applications and related code on maching learning. If you need support and would like to find out more, get in touch with the contact link.