top of page

Yahoo's Not Safe For Work (NSFW)

To run this tutorial you will surely require a GPU, you may use AWS P2 GPU instances use NVIDIA Tesla K80 GPUs. For to "How to set up AWS infrastructure for machine Learning"

  1. Code discussed here can be found on my GitHub repository.

  2. All safe and unsafe images can be found here [for educational purpose only] (password - nsfw)

  3. Final model generated after training can be found here. You may use this model for further training and fine tuning on your custom data-set.

  4. If you get any error then, please check with requirements.txt file to check your python package versions.

  5. Code Compatibility : python 2.7 , tested on ubuntu 16.04 with theano as backend

Illustrative image of NSFW

Figure 1. An illustrative image of NSFW (source - Salt Lake magazine)

Not Safe For Work (NSFW) is a class of content that is unstable for minor or at public places. NSFW mainly involves porn/Adult content. Presently NSFW content is majorly filtered through source based approaches. Source based approaches means blocking appropriate URL/ feed which is known source of such content. International Foundation for Online Responsibility (IFFOR) on 15 April 2011 initiated a domain called .xxx for adult content, so that on the basis of users preference such content can be blocked. wider acceptance of such domain would take time. So it better we apply science to this problem.

Present approaches to detect nudity depends on following:

  1. URL Name based blocking.

  2. File Name based blocking.

  3. Page Name based blocking.

  4. keyword based blocking.

However these approaches fails many time due to following reasons:

  1. Ambiguous URLs.

  2. Ambiguous File Name/ Page Name.

  3. With Ever changing nature of Internet and billions of digital device pushing data continuously, Its becomes even difficult to track such content

On SEP 30th 2016 Yahoo open sourced a model that is capable of differentiating NSFW content form SAFE FOR WORK (SFW) content. In this blog post I will walk you through step by step procedure to train a network to differentiate NSFW and SFW content.

The tutorial is divided in to following sub part for easy understanding:

1) Data Collection

2) Peeping inside collected data

3) Data Preprocessing

4) Data set size and Distribution

5) Examining results

1) Data Collection

Yahoo doesn't provide data-set which was used for their research work, So I made one of mine. Here is how I did it.

A) NSFW images collection (+ve) :

You may go to any site contain adult content and scrap it. SIMPLE!! Isn't it ?? I will neither be describing the actual proceeds nor provide any program for the same because of known copyright / abuse troubles.

B) SFW images collection (-ve) :

In fact this part is very simple. in Mozilla Firefox download a plugin named as "Flashgot". Now go-to any popular image search site like tumbler, google images, bing Image search and search with keywords like "couple", "public kissing", "love couple". Download all images from given page using Flashgot.

To download NSFW and SFW images which I have used for present tutorial click here [password : nsfw]

Note that provided images are for Education purpose only, It doesn't carry any monetary intention.

2) Peeping inside collected data

It is very essential to check quality of the data to ascertain the quality of resulting model. Images for NSFW is very clear with very less noise as I have collected these from Adult site. Images for SFW is considerably noisy. As SFW is collected from search engines and sometime images not at all related to our purpose are also captured. Any ways we got to move ahead with this data we cannot get clean data without manual curation. Impurity in data will lower down our accuracy but this exercise will surely provide an intuition that our approach really works. [You may go further on cleaner data, on your own]

safe for work images

Figure 2. Totally unrelated images (junk) to our purpose collected in SFW Images

not safe for work images

Figure 3. NSFW image collected in SFW Images

Figure 2. and Figure 3. shows that our SFW Image data-set is not clean, but lets move ahead with this data only.

3) Data Pre-processing

Download above said data-set and put in working directory before going ahead with coding A) Importing Requirements - while experimenting with things I have imported many unwanted packages, You may remove them.

import os.path as path import matplotlib.pyplot as plt import numpy as np import theano import keras import traceback from keras.models import Sequential from scipy.misc import imread, imresize, imsave from keras.layers.core import Flatten, Dense, Dropout,Activation from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D, UpSampling2D, UpSampling1D, Cropping2D from keras.optimizers import SGD from keras.utils.np_utils import to_categorical %matplotlib inline from keras import backend as K from keras.utils import np_utils K.set_image_dim_ordering('th') import traceback from scipy import ndimage from sklearn.cross_validation import train_test_split

B) Resizing Images - Collected images are of various size and extension, In this step we will convert images to 224*224 dimension.After invoking below given functions, each image will be resized and placed in SAFE_resized and UNSAFE_resized folder.

def imageResize(basename,imageName): """ resize image basename : eg. /home/sunil/fishes/bet imagename : xyz.jpg """ new_width = 224 new_height = 224 try: img = Image.open(basename+"/"+imageName) # image extension *.png,*.jpg img = img.resize((new_width, new_height), Image.ANTIALIAS) img.save(basename+'_resized/'+imageName) except: os.mkdir(basename+'_resized/') img = Image.open(basename+"/"+imageName) # image extension *.png,*.jpg img = img.resize((new_width, new_height), Image.ANTIALIAS) img.save(basename+'_resized/'+imageName)

def resizer(folderPath): """ to resize all files present in a folder resizer('/home/sunil/imageTagging/data/allCats_resized/') resizer('/home/sunil/imageTagging/data/allCats_resized/') """ for subdir, dirs, files in os.walk(folderPath): for fileName in files: try: # print os.path.join(subdir, file) filepath = subdir + os.sep + fileName if filepath.endswith(".jpg"): imageResize(subdir,fileName) except: print traceback.print_exc() os.remove(subdir+"/"+fileName)

# Actually applying resizing to images resizer('/home/sunil/imageTagging/downloadData/SAFE') resizer('/home/sunil/imageTagging/downloadData/UNSAFE')

C) Loading images as Numpy array

def load_image( infilename ) : """ load image from disk :param infilename: :return: """ img = ndimage.imread( infilename ) data = np.asarray( img, dtype="int32" ) resized = data.reshape(data.shape[2],data.shape[0],data.shape[1]) return resized

def turnToNumpy(folderPath): """ turn stored images on disk to numpy turnToNumpy('/home/sunil/imageTagging/ """ temp = [] for subdir, dirs, files in os.walk(folderPath): for fileName in files: try: filepath = subdir + os.sep + fileName if load_image(subdir+"/"+fileName).shape == (3,224,224): # print filepath temp.append(load_image(subdir+"/"+fileName)) except: os.remove(subdir+"/"+fileName) return np.asarray(temp)

SAFE = turnToNumpy('/home/sunil/imageTagging/downloadData/SAFE_resized') UNSAFE = turnToNumpy('/home/sunil/imageTagging/downloadData/UNSAFE_resized')

4) Data set size and Distribution

I have 5805 safe images and 8081 unsafe images, total about 11610. Out of this randomly 70 % images (8127) will go to Train and remaining 30% images (3483) will go to test.

5) Model Definition

Again! I have taken a popular VGG-16 network. Stochastically but It does perform well on images.

VGG16 network architecture

Figure 4. VGG16 network architecture.

# defining model model = Sequential() model.add(ZeroPadding2D((1,1),input_shape=(3,224,224))) model.add(Convolution2D(64, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(64, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) mod5el.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2))) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(64, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(2, activation='softmax')) # model.summary()

Output layer will be giving us any of the classes, 1 - SFW and 2 - NSFW. I have taken stochastic gradient descent as optimizer function. Categorical cross entropy was choose as loss function as it perform well on multi class classification.

# compiling and fitting model model.compile(optimizer='sgd', loss='categorical_crossentropy',metrics=['accuracy'])

model.fit(X_train, y_train, batch_size=10, nb_epoch=10, verbose=1, validation_data=(X_test, y_test))

5) Examining results

Although accuracy was recorded continuously as Performance matrix, we require to visually inspect some of the images to actually get idea about performance. accuracy for the entire dataset was found to be 72%. Below given code snippet will save all test data-set images to disk with actual and predicted class label.

# writing test images to disk #the name of the file would be imageNumber_actualClass_predictedClass.png imageNumber = 0 for imageNumber in range (0,len(X_test)): tempX = X_test[imageNumber].reshape(X_test[imageNumber].shape[1], X_test[imageNumber].shape[2], X_test[imageNumber].shape[0]) # plt.show() # to show image here as well predicted = pr[imageNumber] if (int(actual[imageNumber]) == 1 and int(predicted)== 1): imsave("tp/"+str(imageNumber)+"_"+str(actual[imageNumber])+"_"+str(predicted)+".png",tempX) if (int(actual[imageNumber]) == 0 and int(predicted)== 0): imsave("tn/"+str(imageNumber)+"_"+str(actual[imageNumber])+"_"+str(predicted)+".png",tempX) if (int(actual[imageNumber]) == 0 and int(predicted)== 1): imsave("fp/"+str(imageNumber)+"_"+str(actual[imageNumber])+"_"+str(predicted)+".png",tempX) if (int(actual[imageNumber]) == 1 and int(predicted)== 0): imsave("fn/"+str(imageNumber)+"_"+str(actual[imageNumber])+"_"+str(predicted)+".png",tempX)

classified images in as safe for work and not safe for work.

Figure 5. Actually classified images in as (True Positive, True Negative, False Positive and False Negative)

Above given image summarizes performance of our model on test data-set. Accuracy on test test data-set was found to be ~72%, that implies algorithm is performing well on test data-set. lets examine where we re missing,

1) False Negative - these are sample which were originally positive (unsafe) and classified as negative (safe). Actually when you look at images will find that algorithm did quite a good job. All images classified as safe are actually safe. As we have scrapped this data without manual curation these safe images were by mistake given as unsafe.

2)False Positive - these are sample which were originally positive (safe) and classified as negative (safe). Actually when you look at images will find that algorithm did quite a good job. All images classified as unsafe are actually unsafe. As we have scrapped this data without manual curation these unsafe images were by mistake given as safe.

Scope of improvement:

1) More data , the most required thing. I have conducted experiment with 11000 images and still perform good. Similar experiment if repeated with millions of images it will perform the best.

2) quality of images - As we have seen, we are loosing on accuracy due to mis-tagged images. If training set with good quality is taken, it will surely improve results.

Here I have used training from beginning, next time I will demonstrate much better model using transfer learning.

Comments


If you like this tutorial please share with your colleague. Discuss doubts, ask for changes on GitHub. It's free, No charges for anything. Let me to get inspired from your responses and deliver even better.

Never Miss My Post!
bottom of page