SPAMMER DETECTION BASED ON ACCOUNT, TWEET, AND COMMUNITY ACTIVITY ON TWITTER

Spammers are the activities of users who abuse Twitter to spread spam. Spammers imitate legitimate user behavior patterns to avoid being detected by spam detectors. Spammers create lots of fake accounts and collaborate with each other to form communities. The collaboration makes it difficult to detect spammers' accounts. This research proposed the development of feature extraction based on hashtags and community activities for the detection of spammer accounts on Twitter. Hashtags are used by spammers to increase popularity. Community activities are used as features for the detection of spammers so as to give weight to the activities of spammers contained in a community. The experimental result shows that the proposed method got the best performance in accuracy, recall, precision and g-means with are 90.55%, 88.04%, 3.18%, and 16.74%, respectively. The accuracy and g-mean of the proposed method can surpassed previous method with 4.23% and 14.43%. This shows that the proposed method can overcome the problem of detecting spammer on Twitter with better performance compared to state of the art.


Introduction
Twitter is one of online social media which develops rapidly. Established in 2006, Twitter has appeared as the most popular microblogging platform in which the users can share news, media, meme, point of view, and update in the form of tweet. Tweet is the writing containing text and limited URL HTTP until 280 characters [1]. Unfortunately, the growth of Twitter social interaction has attracted the cyberspace criminals who exploit the trust relationship among the users to distribute evil content to big number of victims in the network. The most well known spamming type in Twitter is catching hot recent topics [2]. Whenever the event occurs, the users try to express the opinion or information about the event, by using hashtag or same keywords. If the topic is the most tweeted topic in that day, then it will be seen by all Twitter users in their home as the hot recent topic. Spammer uses the same hashtag to be seen by users basis in big scale after certain trend event but with URL that is not asked and led to unrelated web site. Because of 280 characters limitation on Twitter, spammer usually shares URL using URL shortener service.
Spammer usually imitate the behavior pattern of official user to avoid detected by spam 98 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information), volume 13, issue 2, June 2020 detection technique. Spammer develops the device and technique to avoid the existed detection technique. Besides that, the research trend nowadays about spam detection has complexity obstacle or owns some warnings that can be passed by spammer. In this case, it is extremely necessary to detect and block spammer from social network such as Twitter to save resources and human efforts from unwanted users. Included the stronger feature and more difficult to be imitated. And the usage of user interaction in and out of the community structure which can be used to build spam classification model which will make the spammer difficult. Spammer makes many fake account, and collaborate one and another forming tight community to increase their credibility. Therefore, spammer account tends to connect socially to the highest classification coefficient [3]. Various methods have been conducted to detect spammer in Twitter. There are 5 characteristics of bot spammer according to [4] such as spam containing active link, spam containing certain product, owning the same similarity between the tweet before and after, new account and spam frequently uses hashtag. The research from aditya et. al. [5] conducted bot spammer detection by looking at the characteristics of posting time and the sentiment of the tweet done. Another research from Inuwa-Dutse et. al. [6] conducted spammer detection by optimizing a series of feature from tweet history and information of users' account. From the analysis result conducted, it can be seen that spammer tends to be selective in following other users, until forming spammer connection. Beside that, mostly spam account automatically posted at least 12 tweets per day at the period which is well determined. Bhat and Abulaish [7] conducted spammer identification in Facebook by using community feature. The community feature used in this research are total out-degree, total reciprocity, total in/out ratio, community memberships, foreign out-degree and foreign in/out ratio. From that research obtained conclusion that by combining the community and non-community feature can increase significant result of spammer detection. Sarlati et. al [8] adopted community feature to detect spammer and uses the feature selection of Principal Component Analysis for decreasing the feature volume used. Chen et. al. [9] found that the coordination of spammers makes detection difficult. Bindu et. al. [10] found that there is spammer community which works collectively for spreading the spam and avoid spammer detection technique in Twitter. Spammers collaborate and coordinate with the hashtag information on the tweet. Therefore, detecting spammers using hashtag and community activities features will increase success.
This research proposed the development of feature extraction based on hashtag and community activity for detecting spammer account on Twitter. Hashtag is used by spammer members for improving popularity. The community activity is used as the feature for spammer detection, until it can give weight towards spammer activity obtained in certain community. The community activity done such as tweet with hashtag usage, URL, and others.

Related Work
Perdana et. al. [11] conduct spammer detection by using consideration of tweet similarity done and interval time of doing the tweet. The level of tweet similarity is considered because spammer does sufficiently high tweet similarity, until disturbing the information spread in Twitter. However, spammer is getting smarter in doing his action until they make the tweet which is different from one and another. Spammer will string certain words in their action until making tweets that look good. Time interval entropy is considered because spammer tends to conduct their action in the time which approaches togetherness, or its interval is almost the same. But there is also spammer doing their action without managing the interval time, until it seems like a natural tweet.
Priyatno et. al. [12] conducted spammer detection by using Time Interval Entropy feature and global vector for word representation (Glove). The classification process uses convolution neural network. The tweet feature without omitting hashtag used as the input because spammer makes the tweet with hashtag for achieving certain purpose. Time Interval Entropy Feature is used because spammer does tweet with managed time until the range is not too far. However, there is also spammer who does rarely spam until it complicates the detection.
Aditya et. al. [5] conduct spammer detection by using sentiment analysis feature and time interval entropy (TIE). Sentiment analysis is used to detect the expression or opinion contained in the tweet. Sentiment analysis used combination of knowledge method-based and machine learningbased to obtain neutral tweet or the one which does not have social sentiment in which frequently appear at spam tweets. TIE was used to catch the regularity of posting the tweet which shows the tweet is posted automatically.
Inuwa-Dutse et. al. [6] conducted spammer detection by utilizing User Profile Feature (UPF), Account Information Feature (AIF), and Tweet Feature. User Profile Feature (UPF) included information about users such as username, screen name, location, and user description. Account Information Feature (AIF) consists of information such as time of creating the account (account age), and account verification sign (verified or not verified).
Chen et. al. [9] explains about three spammer intelligences in doing spam such as coordinated spam, machine base spam template or passive spam. The behavior of coordinated spam complicates spammer detection process.

Proposed Method
The proposed method consists of several steps, namely: community detection on Twitter, feature extraction, feature selection using recursive feature elimination (RFE), and classification using multi-layer perceptron. These stages can be seen in Figure 1.

Community Detection on Twitter
The process of community detection as showed in Figure 2 is started from data collection on Twitter on August 1st until September 10 th , 2019. This process obtained tweets at home of Corruption Eradication Commission (KPK) @kpk_ri. This tweet is not only from KPK account only, but also tweet from the account that does mention to @kpk_ri. After the process of tweet collection with certain time interval, the next stage is collecting username that interrelated with the tweet. Username is obtained then do the process of taking the following account from each username as seen in Figure 3 and the example of taking following process is showed in Figure 4. The result of taking the following from each username saved in csv format. The document of csv has 2 headers such as source and target. After the process of obtaining the following, then the process done was uniting the data at one csv list containing the source of username in which its following is taken, and the target contained obtained following. This one csv list is called as edge list. The process of community detection aims to know existed community in the account. After the process of obtaining the following list at all accounts, then the process of forming the community by using louvain method from this research was used [13].

Feature Extraction
Feature extraction process is conducted by obtaining three big groups consisting of account feature, tweet feature, and community feature which can be seen in Table 1. Account feature is the feature which gives the description about the account information and activity information of users [6], [14], [10]. Tweet feature is the feature which gives information about tweet activities The unique ratio of community spam [15], [16], [6], [14], [10]. Community feature [14], [7], [8] is the feature which gives information related to joint activities done by Twitter users such as total hashtag unique community (JHUK) which is total hashtag unique in one community (K), total account hashtag (HA) is total hashtag (H) at one account. Total community hashtag (JHK) is total hashtag used by all community members. Community hashtag ratio (RHK) is the quotient between ratio of account hashtag of one community and total community members (JAK). Account hashtag ratio (RHA) is the quotient of hashtag character length (kH) with total tweet character (JKT). Unique ration of account hashtag (RUHA) is the quotient of total unique hashtag and total account hashtag. Unique Ratio of Community hashtag (RUHK) is the quotient between total unique ratio of account hashtag and total community members. Total URL of unique community (JUUK) is the quotient between total URL of unique account and total community members. Total community URL (JUK) is the number of URL in the community. Ratio of community URL (RUK) is total ratio in the community. Unique ratio of community URL (RUUK) is the quotient between total unique of community URL and total community URL. Total community eigenvector (JEK) is the quotient of total community eigenvector and total community members. Words ratio of spam account (RKSA) is the quotient of total spam character (kS) and total tweet character. Total words of spam account (JKSA) is total spam words obtained in an account. Ratio of unique words of spam account (RUKSA) is the quotient between spam unique words and total words of spam account. The next step is data cleaning on all features. Cleaning process is conducted to omit empty data feature and less complete one. After cleaning data process, normalization is conducted towards the data. Normalization process is conducted to equalize the feature range owned becomes range 0.1 until 0.9. After normalization, recursive feature elimination process is conducted towards the data to obtain optimal feature.

Feature Selection using Recursive Feature Elimination (RFE)
Feature selection process is conducted to obtain optimal features. Feature selection process uses support vector machine-recursive feature elimination which is adopted from the research [17], [18]. Support vector machine-recursive feature elimination (SVM-RFE) does feature selection in a backward way.
The process of SVM-RFE is started by conducting the training process of support vector machine, until the training result gains training weight. Then, weight calculation is conducted on the training result such as towards the length of dataset dimension. Then, we find the smallest criteria, then the result is used for feature improvement process. If feature improvement process has been done, then the process is continued with conducting update of rank orders of the existed features. Then, the process is continued by deleting the feature which has smallest criteria until the best features obtained. The feature is stated optimal if the value change is insignificant.
The results of feature selection as seen in Figure 5 by using the percentage of training data distribution and test data with the ration 70:30. The feature consist of account age, length of screen name, username and screen name similarity, following ratio and account activeness features from optimal account features. Average tweet length, URL ratio, mention ratio, lexirichoutuu, URL unique ratio, total account hashtag, account hashtag ratio, hashtag unique ratio and total words of spam account features from optimal tweet features. Total indegree, unique ratio of community hashtag, ratio of URL community, and unique ratio of URL community features from optimal community features.
The result of feature selection in Figure 6 uses training data percentage and test data with the ration 80:20. The features consist of eigen vector, account age, length of screen name, username similarity and screen name, following ratio, and account activeness features from optimal account features. Average length, unique URL ratio, total account hashtag, account hashtag ratio, unique ratio of account hashtag, and total words of spam account features from optimal tweet features. Total community member, ratio of community URL, unique ratio of community URL, and total eigen community features from optimal community features. The result of feature selection as seen in Figure 7 with percentage of training data and test data with the ration 90:10. The features consist of eigenvector, account age, length of screen name, username similarity and screen name, follower ratio, interestingness, account activeness, name ratio and indegree features from optimal account features. Average length, URL unique ratio, total account hashtag, ratio of account hashtag, unique ratio of account hashtag and total words of spam account features from optimal tweet features. Total indegree, total community members, ratio of community URL, and unique ratio of community URL features from optimal community features. The result of feature selection using recursive feature elimination thoroughly is eigenvector, account age, length of username, length of screen name, username similarity and screen name, following ratio, interesting, account activeness, name ratio and indegree features from optimal account features. The average of tweet length, URL ratio, mention ratio, lexrichoutuu, unique ratio of URL, total account hashtag, ratio of account hashtag, unique ratio of account hashtag and total words of spam account features from optimal tweet features. Total indegree, total community members, ratio of community URL, unique ratio of community URL, unique ratio of community hashtag and total community eigen features from optimal community feature. The list of optimal features can be seen in Table 2.

Detection Spammer
Spammer detection process is conducted by using multi-layer perceptron (MLP). We adopted research from Hans et.al. [19] that use MLP as classifier. The process of multi-layer perceptron has three big stages such as forward process, backward process, and process of weight change. Multi-layer perceptron uses some inputs in line with total features the result of feature selection process. Total hidden layers are 2 hidden layers with node hidden (15,15). Learning rate 0.1, 0.01 and 0.001. Maximum epoch used is 1000. The lowest error level is 0.0001. The process multilayer perceptron uses input from features obtained from feature selection process. Then forward process was done towards input to hidden layer until output layer. The result of forward process is conducted activation function by using activation function of sigmoid biner. Then the next process is backpropagation. Backpropagation is conducted to count the error value obtained from the difference of output layer and ground truth. Backpropagation process is conducted for all layers, backward is started by finding the error in the layer. After backward obtains error value on all layers, MLP process is conducted, the process of weight change which is counted based on mistake value in each layer. This process is conducted continuously until stop value point is determined, either error minimal value or maximum iteration. If the training process has been done, then multi-layer perceptron obtains the model from the training result. The model is used for testing. Testing data are the data resulted from the distribution of main data divided to be two parts such as training data and testing data. The testing process of multi-layer perceptron is conducted at forward propagation phase.
The merging process is done by adding the multiplication result from multi-layer perceptron output with each weight. Those multiplication are such as account feature weight (α) * the result of multilayer perceptron of account feature (A), tweet feature weight (β) * the result of multilayer perceptron of tweet feature (B), and weight of community feature (γ) * the result of multilayer perceptron of community feature (C). Total weight of and γ is one. Total weight of α weight and β weight is δ. The result of merging process then conducted classification by using threshold to obtain the classification result. The result of merging process is considered as spammer if score smaller from threshold and not spam if score bigger than threshold.

Experiment and Analysis
This research used Twitter data which were collected from the account of Corruption Eradication Commission (KPK) @kpk_ri with tweet interest target is about "corruption". Started data collection on Twitter on August 1st until September 10th, 2019. Data collection from Twitter did not use official API from Twitter but used python library GetOldTweet3 because if the process of taking tweet used official API from Twitter, data obtained will be only the last 7 days. Total of tweets obtained is 22.281 tweets. an example of a tweet is Figure 8. After the process of tweed data collection about corruption at KPK account, then the next process is taking username involved in the tweet interest of "corruption". The total username is 10.961 usernames. From the username obtained then conducted taking the following at each username. The example of taking following process in username can be seen in Figure 4. The Process of taking the following on username by using python twint library. The total username from the following is 4.995.357 usernames. The total unique username is 1.392.841 usernames. The total unique username is done by the process of retrieving tweets, account information and the process of getting the community. Account information attributes that will be taken are name, username, bio, join date, total tweets, total following, total followers and verified. The tweet attributes that will be taken are username, date, time, tweet, mentions, URLs, hashtags, and retweet. Tweet used is Indonesian. so that accounts using tweets other than Indonesian are deleted. Total accounts obtained are 575.851 accounts. The total spammer accounts are 2.312 accounts and the total legitimate accounts are 573.539 accounts. The evaluation of success level from the proposed strategy is by using accuracy, recall, precision, and g-mean [20]. The calculation of accuracy, recall, precision, and g-mean used confusion matrix as showed in Table 3. Accuracy is the measurement of success level in detecting spammer (True Positive) and legitimate (True Negative) in all data. The accuracy calculation is done by using Equation 3. Recall is the measurement of success level in detecting spammer (True Positive) in all spammer data (actual positive). Recall is counted by using Equation 4. Precision is the accuracy level of information obtained. The precision calculation is conducted by using Equation 5. G-mean [21] conducts the calculation for the relative balance from the classification performance in positive and negative class. G-mean uses recall and precision. G-mean is counted by using Equation 6. Table 4 is the evaluation result obtained. The data percentage of 70:30 gains the best results in accuracy, recall, precision, g-mean respectively 90,55%, 91,21%, 3,14%, and 16,74%. All best result obtained by proposed strategy. This shows that the success level in recall, precision, g-mean, and accuracy of proposal can improve spammer detection. In data distribution with percentage 80:20 obtained the result of accuracy, recall, precision, and g-mean respectively are 89,35%, 88,96%, 3,14%, and 16,37%. Proposed has the success for detecting spammer account and legitimate account based on accuracy, recall, precision and g-mean. At percentage 90:10 obtains the best result of accuracy, recall, precision and g-mean respectively 89,24%, 88,74%, 3,08%, and 16,17%. The best recall is at percentage 90:10 obtained by account feature. This shows that account feature also can detect spammer in overall spammer data. However, account feature decreases in g-mean, precision, and accuracy ability. For the success of spammer and legitimate detection, proposed is the best based on accuracy and g-mean. This also prevails for precision and recall obtained. The result of experiment shows that the method proposed obtains the best performance in accuracy, recall, precision, and g-means and the value for each respectively are 90,55%, 88,04%, 3.18%, and 16.74%. Accuracy and g-mean from the proposed method can exceed the previous method with 4.23% and 14,43%. This shows that the method proposed can overcome spammer detection problem on Twitter with better performance. The best account feature in spammer detection based on g-mean is 9,90%. The evaluation result of accuracy, recall, and precision are 69,89%, 86,15%, and 1,14%. The features used are account age, length of screen name, username similarity and screen name, and following ratio, account activeness, eigenvector, follower ratio, interestingness, name ratio, and indegree. All those features appear in each data distribution. Account feature at all data distribution are account age, length of screen name, username similarity, and screen name, following ratio, account activeness, and eigenvector. This shows that the account feature selected is the precise feature to be used. Tweet feature successfully detect spammer based on gmean is 15,61%. The evaluation result of accuracy, recall, and precision are 88,01%, 86,31%, and 2,82%. The features used are tweet length, URL ratio, mention ratio, lexrichoutuu, URL unique ratio, total account hashtag, account hashtag ratio, unique ratio of account hashtag, and total words of spam account. The average feature of tweet length, URL unique ratio, total account hashtag, account hashtag ratio, unique ratio of account hashtag, and total words of spam account will appear in each data distribution. In tweet feature appears three features related to hashtag such as feature of total account hashtag, account hashtag ratio, and unique ratio of account hashtag. This shows that feature based on hashtag has effect in detecting the spammer. Community feature succeeds in detecting spammer with gmean measurement is 5,71%. The evaluation result of accuracy, recall, and precision are 43,40%, 67,87%, and 0,48%. Optimal features used were indegree, unique ratio of community hashtag, ratio of community URL and unique ratio of community URL. Community ratio for all data distributions are ratio feature of community URL and unique ratio of community URL. Optimal feature in another data distribution is total community members and total eigen communities. Community feature has one optimal hashtag aspect. This fact strengthen more and more that hashtag has effect in spammer detection. Therefore, development of feature extraction based on hashtag and community activity for spammer account detection on Twitter with this detection strategy can increase the success and accuracy. volume 13, issue 2, June 2020

Conclusion
This research proposes the development of feature extraction based on hashtag and community activities for detecting spammer account on Twitter. Hashtag is used by spammer members to increase their popularity. Community activity is used as the feature for spammer detection until it can give weight towards spammer activity obtained in certain community. The experimental result shows that the proposed method got the best performance in accuracy, recall, precision and gmeans with are 90,55%, 88,04%, 3.18%, and 16.74%, respectively. The accuracy and g-mean of the proposed method can surpassed previous method with 4.23% and 14,43%. This shows that the proposed method can overcome the problem of detecting spammer on Twitter with better performance compared to state of the art.