Bot Spammer Detection in Twitter Using Tweet Similarity and TIME Interval Entropy

The popularity of Twitter has attracted spammers to disseminate large amount of spam messages. Preliminary studies had shown that most spam messages were produced automatically by bot. Therefore bot spammer detection can reduce the number of spam messages in Twitter significantly. However, to the best of our knowledge, few researches have focused in detecting Twitter bot spammer. Thus, this paper proposes a novel approach to differentiate between bot spammer and legitimate user accounts using time interval entropy and tweet similarity. Timestamp collections are utilized to calculate the time interval entropy of each user. Uni-gram matching-based similarity will be used to calculate tweet similarity. Datasets are crawled from Twitter containing both normal and spammer accounts. Experimental results showed that legitimate user may exhibit regular behavior in posting tweet as bot spammer. Several legitimate users are also detected to post similar tweets. Therefore it is less optimal to detect bot spammer using one of those features only. However, combination of both features gives better classification result. Precision, recall, and f-measure of the proposed method reached 85,71%, 94,74% and 90% respectively. It outperforms precision, recall, and f-measure of method which only uses either time interval entropy or tweet similarity.


Introduction
Due to rapid development in internet connection, the number of user in Online Social Networking (OSN) websites are also increasing.Nowadays, OSN has been part of many people's daily routine.People may spend significant amount of time on popular OSN where they store and share personal information.Among various types of OSN, Twitter is considered as one of the most popular OSN.In last quarter of 2012, Twitter has been reported by Global Web Index as the fastest-growing website with a growth rate in active users of 714% since July 2009 [1].Moreover, Twitter belongs to top 10 most viewed websites in Nov-ember 2014 [2].Twitter is micro-blogging service that was founded in 2006.Twitter users are facilitated to communicate with each other by produc-February 2015 ing text-based post better known as tweet.The tweet size is limited to 140 characters.In total, there are 500 million tweets published by Twitter users per day.Its simplicity has attracted huge amount of people to join.Currently it has up to 284 million monthly active users [3].
However, the popularity of Twitter has also attracted many spammers to use it for disseminating large amount of spam messages.They try to exploit the network of trust among Twitter users for their own benefit, which are promoting personal blogs, spreading advertisements, phishing, and scam.The number of Twitter misuse can be worsening since the use of automated programs.
Automated program or better known as bot, short for robot, do not require human operator to execute its job.Preliminary studies had indicated that most of spam messages in Twitter are generated automatically by bot [4] and only very few of them are manually posted by humans [5].Bot spammer can automatically generate spam message at given interval time using job scheduler [6].Bot usage can reduce high cost of manually managing spam accounts, thus it is easier for spammer to generate more spam messages in Twitter.
The increasing number of spam message can deteriorate legitimate user experience in Twitter.It can pollute real time sharing information in Twitter and waste extra resource of legitimate user [7].Therefore more rigorous efforts are required to stop further development of spammer in Twitter.Twitter itself has provided mechanism to stop spam development by inviting user to actively report spam message and account.However, it takes much time and resources due to several fake reports.Mistakenly labeling legitimate user account as spam can harm user's reliance toward Twitter [8].Several researches have been conducted regarding automation (bot) and spam detection, to help fighting spam particularly in Twitter.This paper proposes novel approach which combines entropy and tweet similarity to identify bot spammer.Time interval entropy is used to capture regularity of tweeting behavior which indicates automation.Entropy has been widely used to detect automation.Therefore several researches [5,10] utilize it to distinguish between bot and human behavior.In addition, tweet similarity is used to show the likelihood of Twitter account to be considered as bot spammer.Since many spammers tend to repeatedly tweet the same or similar post in order to increase the probability of successfully alluring legitimate users' visits.Their tweets used to have high homogeneous characterristics [4,9].Instead of using cosine similarity as presented in [4], in this paper we prefer to use unigram matching-based similarity to overcome shortage of cosine similarity in short text as [11].
The rest of the paper is organized as follows.Related work is briefly reviewed in Section 2. The proposed method is elaborated in Section 3. Section 4 covers experiment section which includes not only data collection, but also result and discussion.Whereas, last section presents conclusion and future work.

Related Work
Several researches have been conducted regarding automation (bot) and spam detection, to help fighting spam particularly in Twitter.
Chu et al. [5] propose to classify Twitter users into several categories, which are human, cyborg, and bot.Entropy, spam detection, and account properties are used to identify bot and other categories.Among those features, the use of entropy produces the highest accuracy in classification.Entropy effectively captures timing behavior which distinguishes each category.
Zhang and Paxson [6] utilize Pearson  2 algorithm to identify automation in Twitter using timestamp collection of users.Among observed users, 16% of them exhibit highly automation behavior.In addition, keywords which are associated with spam generally have higher automation rates than other keywords.
Amleshwaram et al. [4] introduce CATS which stands for Characterizing Automation of Twitter Spammers.They use various features to detect spam account, including tweet similarity by using cosine similarity.
Rather than detecting spam account, Stringhini et al. [9] create honey-profile in three popular OSN websites (Facebook, MySpace, and Twitter) to lure spam accounts and analyze their behavior.In the end, various features are utilized to identify spam account in aforementioned OSN, including message similarity.Since observed spammers tweet very similar messages, both in size and content as well as advertised websites.

Methods
In this paper, we propose novel approach to distinguish between bot spammer and legitimate user account.Due to its importance, spam detection has been widely researched, however few researches have focused in bot spammer detection.Even though preliminary studies have indicated that most spam messages are produced by bot.
Our proposed method utilizes not only behavior-based feature (time interval entropy) but also content-based feature (tweet similarity).For each user-k, its time interval entropy (  ) and tweet similarity (  ) will be calculated and combined to determine class which represents each user ac-count.Flow mechanism of overall system is presented in Figure 1.
First, we collect timestamp of each user account which shows time interval needed by an account to post tweet.Time interval entropy () is calculated using equation(1) and equation( 2) as used in [5]. (1) Time interval between tweet is represented by ∆, whereas ∆(∆  ) denotes the probability of observing time interval ∆  .The entropy component can detects periodic or regular timing which is strong indication of automation.Lower entropy value indicates regular behavior.Since spammers tend to tweet similar message, we calculate tweet similarity using uni-gram matching-based similarity as presented in equation(3) and equation (4).
Whereas |  ∩   | represents matching words between tweets.|  | is defined as the number of words in tweet-i.Thus,   equals to the average value of pairwise tweet similarity within user-k.The number of tweet for each user-k is represented by .Before being calculated, each tweet has to be preprocessed.Preprocessing covers 4 steps, which are cleaning, stop-word removal, tokenizing, and stemming.Cleaning step aims to omit several parts of tweet including URL, mentioned user account, hash-tag, and RT.Moreover, stop-word will be removed by using stop-list which is implemented from [13].Afterwards, each tweet will be tokenized and turned into root words using stemming algorithm which is proposed by Arifin and Setiono in [14].
Last, both values are combined using equation(5) to classify each user account into its designated class.
For each user-k, its time interval entropy (  ) and tweet similarity (  ) value should be multiplied by weighting factor to retrieve final value.Variable  and  denote weighting factor for time interval entropy and tweet similarity, respectively.Sum of both weighting factors should be equal to 1. Final score of user-k (  ) equals to sum of weighted time interval entropy and tweet similarity divided by sum of weighted maximum time interval entropy and tweet similarity.

Results and Analysis
In this section, we first describe the data collection.Detailed experiment is presented afterwards.

Data Collection
Datasets are crawled from Twitter using Twitter Streaming API.It facilitates third party to access Twitter's global stream of tweet data [12].In total, there are 56 accounts which are written in Bahasa Indonesia to be observed containing both normal and spam accounts.Approximately 2000 tweets are collected from each account.Due to lack of ground-truth, we manually check each profile account and classify them into bot spammer or legitimate user.User is classified as spammer after checking its tweet content.Tweet which contains unsolicited advertisement is considered as spam.In addition, we also check following and follower ratio of each user profile account.According to preliminary studies in [5,7,9], spammer tends to follow many user accounts and have few number of follower.In total, dataset consists of 38 bot spammers and 18 legitimate user accounts.

Discussion
In order to quantitatively evaluate performance of the proposed method, precision, recall, and f-measure are utilized.Precision or positive predictive value is the fraction of retrieved instances which are relevant.Recall or sensitivity is the fraction of relevant instances which are successfully retrieved.F-measure is an accuracy measurement which considers both precision and recall value.Precision, recall, and f-measure are presented in equation( 6), equation (7), and equation (8).According to equation( 6), equation (7), and equation( 8), we calculate precision, recall, and fmeasure using combination of true positive, false negative, and false positive.In this paper, true positive refers to the number of correctly classified bot spammer.False positive represents the number of legitimate user which is incorrectly classified as bot spammer.Whereas, false negative is bot spammer which is incorrectly classified as legitimate user.
In the first experiment, we try to classify each user account using time interval entropy.Low entropy value indicates regular behavior.Therefore, user which has lower entropy than threshold will be classified as bot spammer.In this experiment we use 0,2 as threshold.Threshold is determined using exhaustive search algo-rithm by maximizing the f-measure which is not reported here.The initial value of threshold is 1 and being increased in steps of 0,5.
In the second experiment, instead of using time interval entropy, we utilize tweet similarity of each user account for classification.If user has higher value than threshold, it will be classified as bot spammer.In this experiment we use 0,6 as threshold.The same exhaustive search algorithm is utilized to determine threshold.
In the last experiment, we implement the proposed method which combines time interval entropy and tweet similarity to classify Twitter user account.Series of experiments are conducted beforehand to determine ratio of  and  which are not reported here.According to aforementioned experiment, the best ratio of  and  is 1:1.Optimum threshold value for this experiment is derived using exhaustive search algorithm, which is 0,75.User account will be classified as bot spammer if its combined value is higher than threshold.The classification result of all methods is presented in Table 1.
According to classification result which is presented in Table I, several legitimate users are misclassified as bot spammer since they have lower entropy value than threshold.It can be inferred that legitimate user can also exhibit regular behavior in posting tweet.Thus, the use of time interval entropy is inadequate to distinguish bot spammer and legitimate user.Precision, recall, and fmeasure for classification using time interval entropy are 86.49%,84.21%, and 85.33% respecttively.
As presented in Table 1, even though most bot spammers are correctly classified, several legitimate users are misclassified as bot spammer.Those legitimate users tend to post tweet with similar topic, thus they have high value of tweet similarity.On contrary, several bot spammers are found to publish tweets which are quite heterogeneous.Even though they promote similar link, they use different wording.Therefore, those aforementioned bot spammers cannot be detected.Precision, recall, and f-measure for classification using tweet similarity are 75%, 63.16% and 68.57%, respectively.
The proposed method can produce better precision, recall, and f-measure which are 85.71%, 94.74% and 90%, respectively.Comparison among overall experiments is presented in Figure 2.
TIE, TS, and PM are abbreviation of Time Interval Entropy, Tweet Similarity, and Proposed Method, respectively.As presented in Figure 2, the proposed method has better performance than classification using tweet similarity only.However, it has slight lower precision than classification method which uses time interval entropy.The proposed method can increase the number of true positive and decrease the numbe of false negative.Thus, in general the proposed method still outperforms classification method which uses either time interval entropy or tweet similarity.

Conclusion
In this paper, a novel approach to detect bot spammer using combination of time interval entropy and tweet similarity has been proposed.Series of experiments has been conducted to evaluate performance of the proposed method.
It can be inferred from experimental results that the use of time interval entropy as behavioral feature is not sufficient to identify bot spammer.Even though entropy can capture automation behavior of Twitter account, however it cannot differrentiate between bot spammer and legitimate user account.
Therefore, tweet similarity as content-based feature could be good match to complement it.The use of both features improves the overall system performance.
Further researches are needed to investigate the use of URL and URL shortening in spammer detection.Since Twitter limit each tweet to no more than 140 characters, spammer may use shorten website URL to lure legitimate user.In addition, several bot spammer also found to utilize trending topic in twitter to spread spam messages.They put trending topic into their published tweet, even though their tweet has no relation with trending topic.

22 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information
), Volume 8, Issue 1, February 2015 Figure 1.Flow mechanism of proposed method.

24 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information),
Volume 8, Issue 1, February 2015 Figure 2. Performance evaluation of proposed method in comparison with other methods.