Reddit SFW Plugin
Reddit SFW Plugin
Motivation

Functionality

Finally
our Reddit SFW plugin offers the following functionalities-
○ You can create/destroy a
profile
○ In each profile you can
add Subreddits into a blacklist/whitelist of a profile.
○ You can switch Safesearch
on/off. If it is on then our plugin will run the spacy model on the
backend for the comments of each post. It classifies each post into
Top/NSFW/Controversial for those which are not explicitly marked by
Reddit.
○ You can load the profile
and correspondingly choose blacklist/whitelist to follow
This is the UI of
our plugin. It offers the create a new profile, select a profile,
select blacklist & whitelist. The current chosen profile is home
which has whitelist selected which contains r/news, r/politics,
r/AskReddit. As you can see in figure2, only posts from these
subReddits are there in the feed.
Approach
The
approach that we use is as follows- We first Collect top,
controversial and NSFW posts from Reddit which are already marked as
such. We extract their comments data using the Reddit API PRAW. There
are a lot of posts on Reddit which don’t have text but images or
links. So for these posts we scrape their comments and analyse them
to classify them. We collected 100K comments from each of the
categories(Top, Controversial, NSFW) and use these as the dataset to
train our classifer model.
We
collected NSFW posts from some famous NSFW SubReddits like r/nswf,
r/porn, r/morbidreality. SubReddits used for top as well as
controversial posts were r/funny, r/AskReddit, r/news, r/politics.
These subReddits are followed by a large number of a people and have
a variety of posts that can be used to build a good enough dataset
spanning over multiple kinds of features. These are the subReddits
that define Reddit itself.
For
our current training approach we used Spacy’s TextCategorizer model
to train on the extracted data. It is a CNN based model which assigns
positive/negative vector to each word in the document. The document
tensor is then summarized by concatenating max and mean pooling.
Multilayer perceptron is then used to predict an output vector of
length Number of Classes.
We
acheived pretty good results with an F-score of 0.65 after 15 epochs
of training. Considering that the dataset was not human annotated but
just based on the belief that the comments of the post can accurately
identify it as NSFW or controversial, the results are quite
remarkable.
Demo
We also have the
demo for our plugin here. Though in the final plugin we collapse the
NFSW and controversial posts from the feed if safe search is on, for
demo purposes we have just tagged it(in RED on top right). The post
“ Trump retweets... ‘for treason’ ” has been marked as
Controversial.We also have a post from Statue gropers where a man is
groping a statue of Wonder Woman which has been marked as NSFW. These
posts are not explicitly marked by Reddit but our model feels they
have potential to be controversial and NSFW respectively after
analysing their comments.
Comments
Post a Comment