Sarcasm Detection Engine for Twitter Sentiment Analysis using Textual and Emoji Feature

Twitter is a social media platform used to express sentiments about events, topics, individuals, and groups. Sentiments in Tweets can be classified as positive or negative expressions. However, the sentiment is an expression that is the opposite of what is meant to be, and this is called sarcasm. The existence of sarcasm in a Tweet is chalenging to be detected automatically by a system, even by humans. In this research, we propose a weighting scheme based on the inconsistency between the sentiment of Indonesian tweets and the usage of emoji. The weighting scheme for detecting sarcasm can be used to find out a sentiment about an event, topic, individual, group, or product's review. The proposed method calculates the distance between the textual feature polarity score obtained from the Convolutional Neural Network and the emoji polarity score in a Tweet. This method is used to find the boundary value between Tweets that contain sarcasm or not. The model's experimental results developed obtained an f1-score of 87.5%, precision 90.5%, and recall 84.8%.


Introduction
One example of social media that is widely used today is Twitter. Twitter is a social media platform used to discuss sentiments about events, topics, individuals, and groups [1]. Sentiment analysis (opinion mining) techniques analyze opinionated text, which contains people's opinions toward entities such as products, organizations, individuals, and events [2]. Sentiment snippets are an essential part of both companies and individuals looking to monitor their reputation [3]. They can be used as a convinient tool for feedback on their products and actions. The sentiment of tweets can be classified as positive responses or negative responses. Sentiments contained in tweets attract several companies or organizations, or individuals to dig up some information. because the number of characters that can be written in a tweet is limited, causing people to express their opinions using slang, characters, Etc., which sometimes the understanding of the use of these characters is not the same between people [4].
In a sentiment, some expressions contradict what they mean. The different meanings and expressions are called sarcasm [5]. The existence of sarcasm in tweets is challenging to detect automatically by a system, even by humans, because of textual data in tonal and genital instructions such as speech tone pressure, eye friction, hand movements, and whether it cannot be detected [6]. The content of tweets is textual features that contain sentences or words and non-textual features, namely emoji. When users write sarcasm expressions on tweets, they will deviate from the use of emoji. The positive sentiment of the tweet will be paired with negative value emojis and vice versa. Therefore the value of sarcasm in a tweet sentiment can be obtained based on sentiment analysis in the context of sentences and emojis in a tweet.
Sentiment analysis is a part of Natural Language Processing (NLP), which is related to finding the intention of opinions in a piece of text about the topic being discussed [6]. Sentiment analysis will identify sentiments in an expression, which then classifies based on its polarity score [7].
Several studies have been conducted to test sarcasm in textual data. Kumar's research [8] conducted sarcasm classification of a novel approach using the Content-Based Feature Selection Method. The data consists of an Amazon review. The feature selection stage is carried out in two stages. The first feature is selected using a comparison method between chi-square, information gain, and mutual information. In the second stage, the grouping is done to choose the features that best represent the Related features using the k-means algorithm. The next step is to compare the text classification results between the SVM method and the random forest method. The study [9] focused on the score to get the results of the detection of sarcasm. The recommended score is the sarcasm score obtained from the comparison of tweets with the corpus-based on sarcasm. In [8] and [9], sarcasm detection is based on textual data features that will get good results only if sentence content is long enough and tweets also contain short text. Therefore, we assume that sarcasm detection is difficult to deal with only with a sentiment in the text. Besides, research related to the detection of irregularities in Indonesian tweets is still rare. Therefore we focus on tweets in Indonesian.
In this research, we propose a weighting scheme based on the inconsistency between the sentiment of tweets in Indonesian and emoji usage. The proposed method calculates the distance between the polarity of textual features obtained from the convolutional neural network and the nontextual polarity score (emoji) in a tweet. The method is used to find the boundary value between tweets that contain sarcasm or not.

Methodology
In our proposed method, the model we build is used to detect sarcasm in tweets that can be done using two features, textual and non-textual features such as emojis. The two main features will be calculated based on the polarity score, then labeled positive, negative, and neutral. However, the neutral label is no longer needed because it does not effect on the other process.
After getting a label from each feature, the filtering is done to remove features with a neutral label. Then the value of two features in the tweet is compared. If one of the features is the opposite of the other features, then the tweet's sarcasm label is positive and vice versa.The tweet dataset that already has a label will be used as training data and testing data to build a sarcasm detection engine. The example of the dataset is shown in Table 1. The sarcasm detection engine has two main components, and the first is a text sentiment classifier using CNN and Emoji sentiment classifier. Input for the text sentiment classifier is text features from training and testing dataset, and the training dataset is used to train the CNN and testing dataset to get the sentiment score. The emoji sentiment classifier's input is the emoji feature of the testing dataset to calculate the polarity score of the testing dataset. After getting the sentiment value of each feature from the testing dataset, the sarcasm classification calculates the difference in distance from the text and emoji features. The determination of the optimal threshold for sarcasm labeling is obtained from the f1-score of the predetermined interval. Figure 1 explains the proposed method's steps, and each step will be explained more in the explanation.

Data Preparation
At the stage of data preparation, Twitter data retrieval is taken from 3 November 2019 to 10 November 2019. In this research, we can only retrieve data within seven days due to the unpaid Twitter public fire limitations. During the research process, political topics became trending topics, so we only used political topics where at that time, the political topics had many controversial things made/taken up by political figures. The keywords we use to collect tweets are 'Jokowi', 'Prabowo', 'Fachrul Razi', and 'Anis Baswedan'.  Table 2 explains the raw data obtained for each keyword, and the total obtained tweets are 77961. At the filtering stage, the filtering of the tweets is already obtained. At the filtering stage, filtering will be done by removing a tweet containing emojis automatically using dictionary emojis [10]. The total dataset 1 is 6478 tweets.

Data Preprocessing
The tweet data that has been obtained needs to be done by preprocessing data. This research's preprocessing data stages by removing HTML encoding, mentions, hashtags, weblinks, punctuation, and stopwords. After that, case folding, stemming, and replacing slang & unknown words are applied for each word in the tweet. Preprocessing data needs to be done because the tweet data is unstructured, and there is noise.
Stopword deletion needs to be done to 1 https://intip.in/SRCSMP eliminate words that have no meaning. The stopword dictionary used comes from the NLTK and Sastrawi libraries. Stemming is used to change words into basic words by removing suffix, infix, prefix, and confix affixes. Replacement of slang and unknown words is done by building a custom slang and an unknown dictionary. The slang and unknown word dictionary are obtained from searching every word in the dataset into Kamus Besar Bahasa Indonesia (KBBI). However, If the word is not in the KBBI, it is a candidate for the slang / unknown word. Making a dictionary of the words dictionary is done manually annotated. Table 3 is a sample from the slang and unknown word dictionary. In this study, there were 6306 slang words and unknown words.

Data Sentiment Labeling
At the data labeling stage, the preprocessed dataset will be labeled. Each tweet contained in the dataset has three labels, namely sentiment label, emoji label, and sarcasm label. Sentiment labeling is done by using the SentiWord dictionary. SentiWord is a lexiconbased sentiment feature that is generally used for sentiment analysis, and SentiWord deriving a high precision and high coverage lexicon for sentiment analysis [13]. The SentiWord dictionary is built from a collection of positive, negative, and neutral values. In this research, we use Barasa SentiWord 2 , which belongs to David Moeljadi, to label sentiment value from a tweet. is the rule for the sentiment label of a tweet. In equation 1 is a positive ratio value obtained from the number of positive words in a tweet divided by the total words in the tweet. In equation 2, a negative ratio value is obtained from the number of negative words in a tweet divided by the number of words in a tweet. The words used in SentiWord are a type of noun, verb, adverb, and adjective. The results of sentiment labeling are obtained in Table 4. After getting the sentiment label, the filtering dataset is done by removing tweets with a neutral value of sentiment label.

Data Emoji Labeling
Emojis are graphical representations of user feelings. Emojis are generally in the form of character combinations or Unicode. Emojis are very effective in describing the condition of one's feelings [10]. Emoji labeling is done using the emoji polarity dictionary [14]. In the emoji polarity dictionary, there are positive and negative polarity values for each emoji. Emoji labeling is explained in equation 4 below.
Where _ is the number of positive-value emojis while _ is the number of negativevalue emojis.  Table 5 shows the results of emoji labeling. Tweets that have a neutral label emoji will be discarded.

Sarcasm Sentiment Labeling
The sarcasm sentiment labeling stage is the last step in the data labeling step. Sarcasm labeling is done by using the rules described in equation 5 below.
In equation 5, when the value of being different with the value of then the value of Sarcasm label is positive, if the value of the two labels is the same, the value of the Sarcasm label is negative. A tweet is called positive sarcasm if there is a deviation from emojis from a sentence in a tweet or vice versa, but a tweet can be called negative sarcasm if the use of emojis matches the sentence conveyed in the tweet.
[15]  Table 6 shows the results obtained from the Sarcasm Labeling process. There several tweets in 2018 are labeled sarcastic, and 2460 others are nonsarcasm. However, it is necessary to balance the dataset by removing tweets with a neutral sentiment or emoji label. The final dataset is described in table 7 below.  In table 7, there are tweets with 1200 positive sentiment labels where 103 of them are sarcastic, and 1097 are not. While tweets with a negative sentiment label are 1200 and 1094 were sarcastic, and 106 were not sarcastic.

Word Embedding Creation
Word embedding is a topic in natural language processing that aims to build the vector representation of word dimensions from various of texts. Word embedding takes on a more expressive and efficient representation by maintaining each word's contextual terms until a low-dimensional vector is obtained. One well-known method, namely Global Vector (GloVE) was proposed by Pennington et al [11].
At the stage of making word embedding, a final dataset of 2400 is used. Each text in the tweet in the dataset will be tokenized and stored in the form of a corpus. The GloVe model that will be built uses the parameters described in Table 8. After creating the GloVe model, a document containing a unique word with a 100-vector number vector is generated. This vector document is then used for embedding layers on the CNN architecture.

Sarcasm Detection Engine
The development stage of the Sarcasm Detection Engine is the last stage of this research. Sarcasm Detection Engine has two main components, namely text sentiment classifier using CNN and Emoji sentiment classifier.  [9]. The concept of CNN was refined by a researcher from AT&T Bell Laboratories in Holmdel, New Jersey, USA, Yann LeChun, with a CNN model named LeNet that was used by LeChun to detect numbers and handwriting. [12].
CNN is one of the methods in applied deep learning. Like neural networks in general, this system will also be trained with backpropagation. The CNN method has many layers, namely convolution layer, subsampling/pooling layer, and fully-connected layer. CNN also has several activation functions, for example, ReLu and sigmoid functions. Figure 3 is the CNN architecture that will be used. n this research, we did not use the reference parameters for the existing researches. We have done several experiments, including changing the form of CNN architecture and its respective layers from several experiments, we took the best results, but these results are not the most optimal because this research have not covered all the parameters yet.
Detailed parameters for each layer are explained in Table 9, 10, 11, 12, 13, 14 and 15.     Input on the text sentiment classifier is a text feature of training and testing datasets. The training dataset is used to train CNN, and the testing dataset is used to get a sentiment score.
In the emoji polarity score calculation, each emoji in a tweet will be calculated for its polarity score using equation 6.
The Input on the emoji sentiment classifier is an emoji feature from the testing dataset to calculate the testing dataset's polarity score. Polarity score is the sum of the positive emoji polarity score while it is the sum of the negative emoji polarity values. The function produces a range of values between [-1,1], then it needs to be normalized by using MinMax normalization and resulting values with ranges between [0,1]. After getting the polarity value of each feature from the testing dataset, the classification of sarcasm is performed by calculating the difference in distance between the text and emoji features. Determination of the optimal distance limit for sarcasm labeling is obtained from the highest f1score from the interval value obtained in the pseudocode of figure 4.

Result and Analysis
To get optimal results from the Sarcasm detection engine model, we conducted several experiments of a sarcasm detection engine component.
The first trial we did was to maximize the hyperparameter value on the CNN model. This experiment uses the architecture mentioned in the proposed method section. This trial was conducted by cross-validation. We are dividing the training data into eo parts, namely training data of 1800 tweets and validation data of 200.
The experiment aims to find an optimal CNN model where the model will not underfitting or overfitting. Some experiments conducted with test data of 400 tweets then obtained the highest accuracy value of 87.5%. The second trial by comparing the word embedding model. In this study, the model we proposed uses GloVe word embedding, but we also experiment using Word2Vec CBOW word embedding. This experiment aims to find out the optimal word embedding model to be used in the CNN layer embedding. The first trial parameter using output dimensions of 100 and 300. The second trial parameter used additional training data from 379,557 documents in Indonesian Wikipedia. The final test parameter is that the embedding layer's value can be trained or not during the CNN model training phase.  Table 15 shows when using pre-trained embedding layers that use additional Wikipedia training data, it can increase the accuracy of the CNN model built when the embedding layer cannot be trained during the training phase of the CNN model.
From the two experiments conducted, we chose the CNN model using GloVe word embedding, which was trained with only tweet datasets with an output vector length of 100. Obtained an optimal accuracy score of 87.5% for the CNN model architecture that was built.
The selection of the most optimal threshold value for the sarcasm detection engine is made by finding the highest f1-score value for each entered interval value. In this experiment, the increased interval value is set to 0.01 in the range [0,1].  Table 17 shows the results of the four threshold values with the highest f1-score value. The best detection engine treshold values sarcasm range from 0.37 to 0.38 with an f1-score of 87.59%, a precision of 90.58%, recall of 84.80%, respectively. To validate the model built, we answered all of the tweets that were approved by the model for three expert approval. Table 18 is a sample tweet that was tested by an expert and from the sarcasm label system. The sample shows the result of the labeling of sarcasm by the system and the expert's judgment, which is used as the ground truth, where label 1 indicates that emojis in the tweet match the sentiment label. Based on the sample, the proposed system has worked well. This is indicated by the similarity of the system label with the ground truth label.

Conclusion
This research has made a sarcasm detection engine for Indonesian tweets with the motivation to detect sarcasm based on textual and emoji features. We proposed a supervised machine learning approach using the Convolutional Neural Network to calculate the value of sentiment polarity and emoji weighting to calculate the emojis polarity score. Furthermore, the method we propose focuses on textual features and emojis for finding sarcastic tweets. We also conducted experiments on the parts of the detection engine sarcasm, namely the Convolutional Neural Network.
The Convolutional Neural Network architecture that we built consists of an embedding layer using GloVe with a vector length of 100 and has been trained using tweets dataset. The accuracy of the Convolutional Neural Network model built was 87.5%. The accuration shows that the model of the Convolutional Neural Network that was build can determine the value of sentiment polarity very well.
The sarcasm detection engine that we have built has an f1-score of 87.59%. Thus sarcasm detection engine that we built in this research has a good level of accuracy. This is proven by validating the expert directly and having results that match the expert's judgment.
From research conducted that with the textual and emoji features, we can determine whether an expression is a sarcasm or not.
In our research, we realized the model that was built was not perfect. Therefore, it is necessary to do further research on the sarcasm detection engine that has been built. In the future, we can integrate the engine that we have built with sarcasm detection based on textual features only, where a word in a tweet has a different polarity value far from its closest neighbor. It can be categorized as an expression of sarcasm. This needs to be done so that the results of the engine will be more accurate. Expert linguists should annotate the dataset so the dataset is more valid annotated.