An existing problem for any major website today is how to handle virulent and divisive content. Quora wants to tackle this problem by providing a platform where users can safely share their knowledge with the world.
Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions — those based on false premises, or, intend to make a statement rather than look for helpful answers.
The goal is to develop a Naïve Bayes classification model that identifies and flags insincere questions.
The dataset can be downloaded from here. Once you have downloaded the train and test data, load it and check.
The next step is to preprocess text before splitting the dataset into training set and test set. The preprocessing steps involve: Removing Numbers, Removing Punctuations in a string, Removing Stop Words, Stemming of Words, and Lemmatization of Words.
Constructing a Naive Bayes Classifier:
Combine all the preprocessing techniques and create a dictionary of words and each word’s count in training data.
Calculate probability for each word in a text and filter the words, which have a probability less than the threshold probability. Words with probability less than threshold probability are irrelevant.
Then for each word in the dictionary, create a probability of that word being in insincere questions and its probability in sincere questions. Then find the conditional probability to use in naive Bayes classifier.