Coverage, Diversity, and Coherence Optimization for Multi-document Summarization

A great summarization on multi-document with similar topics can help users to get useful in¬for¬ma¬tion. A good summary must have an extensive coverage, minimum redundancy (high diversity), and smooth connection among sentences (high coherence). Therefore, multi-document summarization that con¬siders the coverage, diversity, and coherence of summary is needed. In this paper we pro¬pose a novel method on multi-document summarization that optimizes the coverage, diversity, and co¬her¬ence among the summary's sentences simultaneously. It integrates self-adaptive differential evo¬lu¬tion (SaDE) al¬gorithm to solve the optimization problem. Sentences ordering algorithm based on top¬ic¬al closeness ap¬proach is performed in SaDE iterations to improve coherences among the summary's sen¬tences. Ex¬pe¬ri¬ments have been performed on Text Analysis Conference (TAC) 2008 data sets. The ex¬perimental re¬sults showed that the proposed method generates summaries with average coherence and ROUGE scores 29-41.2 times and 46.97-64.71% better than any other method that only consider coverage and di¬versity, re-spect¬ive¬ly.


Introduction
The contents of a document can be long.It presents several information with specified topic.Current technological developments makes people can find related documents with similar topic easier than before.The other documents can be had a long contents too.It means there is a massive quantity of data or information with similar obtainable topic.
The massive quantity of data available in the Internet today has reached such a huge volume.It becomes humanly unfeasible to get efficiently useful information from the Internet [1].Thus, automatic methods are needed in order to get useful information from the documents efficiently.
Document summarization is one of methods to process information automatically.It creates compressed version of documents that provides useful information that covers all information in 2 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information), Volume 8, Issue 1, February 2015 the original documents relevantly.Document summarization can be classified based on the number of document processed simultaneously, i.e. single-document and multi-document summarization.Single-document summarization processes only one document into a summary, whereas multi-document summarization processes more than one document with similar topic into a summary.
Various kinds of algorithms are proposed on multi-document summarization problem.These algorithms include ontology-based, clustering, and heuristic approach.The example of document summarization method that uses ontology-based approach is the proposed method in [2].It can perform multi-document summarization by utilizing Yago ontology to capture the intent and context of sentences in documents.It can choose the exact meaning of sentences that has ambiguous word based on Yago ontology scores.
Multi-document summarization methods based on clustering approach have been also proposed.For example, the method proposed in [3].It generates a summary from sentences set that have been clustered based on similarity between sentences.In the other multi-document summarization method that has been proposed in [1] also there is a clustering stage.
Whereas the multi-document summarization methods based on heuristic approach are methods that utilize optimization algorithm in order to select the summary's sentences properly.One of multi-document summarization methods that use this approach is Optimization of Coverage and Diversity for Summarization using Self-adaptive Differential Evolution (OCDsum-SaDE) method that proposed in [4].In the method, an optimal summary is searched by considering the coverage and diversity of summary's sentences.
Multi-document summarization cannot be separated from sentences ordering process.The process is needed to be performed in order to obtain the composition of the summary's sentences that allows users to get information easily.Several summary's sentences ordering methods had been proposed in [5][6][7].It considers a variety of approaches, i.e. chronological, probabilistic, topical closeness, precedence, succession, semantic, and text entailment approaches.The process is generally carried out after the document summarization process completes.Thus, the results of sentences ordering depend on the summary.
A good summary is expected to meet three factors.These factors are: 1) an extensive coverage; 2) high diversity or minimum redundancy; 3) high coherences among summary's sentences [4].Summary that have an extensive coverage indicates it has summarized all information from origin-nal documents.Summary's sentences with high diversity or minimum redundancy indicate the summary able to presents information without any convoluted.On other hand, the smooth connectivity between summary's sentences may help the users to understand and absorb information from summary easily.
Process to obtain the best summary can be considered as an optimization problem [8].Therefore, the process to generate a summary with high level of coverage, diversity, and coherences among the sentences also can be considered as an optimization problem.Thus, a multi-document summarization method that considers optimizing those factors simultaneously is needed to study in order to generate a good summary.
In this paper, we propose a novel method for multi-document summarization that considers the coverage, diversity, and coherence of the summary.This method is inspired by self-adaptive differential evolution (SaDE) algorithm from [4] and sentences ordering algorithm using topical closeness approach in [6].SaDE algorithm is used to solve the coverage, diversity, and coherence optimization problem.Whereas the topical closeness approach that integrated to SaDE iterations helps to find the solution of summary with optimal coherences.Thus, this method can generates summary with an extensive coverage, minimum redundancy, and high coherence among the summary's sentences.

Summary's Quality Factors
In this section, we describe three factors of summary's quality (i.e.coverage, diversity, and coherence) that optimized in our proposed method.

Coverage
Let N denotes the number of sentences from documents that will be summarized, M denotes the number of distinct terms in documents,   denotes the nth sentence from documents which has normalized form    ,   denotes the m-th distinct term from documents,   denotes the number of occurrences of   in    ,   denotes inverse sentence frequency of   , and   denotes the number of sentences containing   .Term's weight of   in    (  ) can be calculated using term frequency inverse sentence frequency (TF-ISF) scheme in equation(1) and equation (2): The    is represented as a vector which has M components such that    = [ 1 , … ,   ].The similarity between sentences can be calculated using cosine measure formulation in equation(3): Summary's coverage value reflects the coverage of summary's contents towards contents in original documents.It can be calculated by considering similarity between main content in original documents with main content in candidate summary [4].Radev et al. [9] describes that main content of documents set is reflected by its centroid or its term's weight means.
Centroid of original documents and candidate summary are represented as a vector with M components.Let   () denotes set of sentences in p-th candidate summary on t-th generation and    () denotes the number of sentences in   ().
Each component   of the original documents's centroid  and each component  ,  () of the p th candidate summary's centroid on current generation    () can be calculated using equation(4) and equation (5), respectively: Alguliev et al. [4] also describes that by considering the similarity between main content of original documents and main content of summary, we will know the importance of summary towards original documents.Moreover, by considering the similarity between main content of original documents with each summary's sentence, we will know the importance of each summary's sentence towards its original documents.The greater similarity between main content of original documents with a summary's sentence reflects the more importance of the sentence towards original docu-ments.Therefore, greater summary's coverage value reflects better summary.The formulation to calculate summary's coverage value   is shown in equation (6).In the equation( 6),    () denotes binary form of vector solution for the pth candidate summary on t-th generation and  ,  () denotes the nth component of    ().Process to generate this vector will be described in the next section.

Diversity
Summary's diversity value reflects the diversity of summary's sentences.It can be considered by calculating similarity between each summary's sentences.If the summary has high total value of sentences similarity, then it has low diversity.Otherwise, if the summary has low total value of sentences similarity, then it has high diversity between its sentences [4].
Summary with low diversity between its sentences tends to present a poor summary because its sentences tend to discuss redundant information.Therefore, in order to get a good summary, the combination of summary's sentences that has high diversity have to be found.In other words, the combination of summary's sentences with low total value of its sentences similarity have to be found, because it can present the information with minimum redundancy.
In this paper, the summary's diversity value is defined as total value of its sentences similarity.Therefore, its diversity value is related with diversity of its sentences inversely.The lower its diversity value reflects the more diversity in its sentences and also the better summary.
The formulation to calculate summary's diversity value   is shown in equation (7).Equation (7) only sums similarity between summary's sentences and ignores sentences which not in the summary [4].

Coherence
Summary's coherence value reflects the summary's sentences coherences degree.It corresponds with smooth connectivity between summary's sentences.Thus, it also corresponds with readability of information in summary by readers.A summary with higher coherences degree is expected to be easier for reader in order to understand the information which presents by the summary.
Generally a summary can simplify the read- ers to understand the information if its sentences are ordered such as two adjacent sentences discuss similar content or topic.It has same principle with topical closeness approach that has been presented in [6].The closeness between sentence's topics can be considered using similarity value between the sentences.The greater similarity between adjacent sentences reflects that they have similar contents or topics.
Based on the description we can make conclusion that a good summary is a summary with high coherences degree between its adjacent sentences.However, a good summary have to presents the information about its original documents contents to readers in simple form (i.e. the summary has a little number of sentences).Therefore, the summary's coherence value  ℎ in this paper is formulated as mean value of adjacent summary's sentences similarities as shown in equation (8).The    () in equation( 8) denotes the ordered form of vector solution for the p-th candidate summary on t-th generation and  ,  () denotes the n-th component of    ().Process to generate this vector will be described in the next section.
In order to improve the coherences among summary's sentences, the sentences ordering pro-cess is performed.In this paper we proposed two types of sentences ordering algorithm as described in Algorithm 1 and 2. The proposed algorithms are inspired from topical closeness approach that had been presented in [6].The first type (Type A) is an algorithm that maximizes similarity between adjacent sentences.Whereas the second type (Type B) is an algorithm which emphasizes two sentences with most similar topic should be at the beginning of summary's paragraph.
Example.Let S1, S2, S3, S4, and S5 as five summary's sentences which will be ordered by sentences ordering algorithm type A and B. Assume that they have similarities as shown in Figure 1.Their ordering processes using sentences ordering algorithm type A and B are shown in Table 1.
Based on the algorithms, pair of sentences which have highest similarity are chosen as the initial of sentences ordering result.Therefore pair of S3 and S5 which has the highest similarity (0.9) is chosen on the first iteration in each algorithm.In algorithm type B, after the initial sentences are chosen, each sentence is labeled as head and tail.Therefore on this iteration S3 and S5 are labeled as head and tail, respectively.
On the second iteration, S1 is chosen to pair

Summary's Quality Factors Optimization
The coverage, diversity, and coherence optimization process in our proposed method consists of preprocessing and main process phase.The main process implements self-adaptive differential evolution (SaDE) algorithm inspired from [4] with the additions of the sentences ordering phase.For the convenience, we denote our proposed method as CoDiCo method which stands for three factors that we would be optimized i.e. coverage, diversity, and coherence.Figure 2 depicts the flowchart of CoDiCo method.

Preprocessing Phase
Preprocessing phase is a step to prepare the data which would be used in main process.In this step there are some processes, i.e.: 1) sentences extraction; 2) sentences normalization; 3) distinct terms extraction; 4) term weights matrix preparation; 5) sentences similarity matrix preparation.
Sentences extraction is a process to take each sentence from documents that have same topic in dataset.The process will produce N sentences.Each extracted sentence   is represented as a single line of data in sentences list D such that After the extraction process, each sentence   is normalized into    using stop-word removal, punctuation removal, and stemming process.We use 571 stop-words from Journal of Machine Learning Research stop-word list 1 for the February 2015 stop-word removal process.For the stemming process, we use Porter Stemmer algorithm 2 .
On the next step we perform distinct terms extraction from each    .This process produces M distinct terms.Each extracted term   is stored into terms list T such that  = [ 1 , . . .,   ].
Based on N normalized sentences and M distinct terms, we generate a terms weight matrix W which has  ×  dimensions.Each component in W stores the term's weight of   in normalized sentence    (  ).The weights calculation is conducted using TF-ISF scheme in equation(1) and equation (2).Each term's weight then used to calculate the sentences similarity.Similarity value between    and    for ,  = [1, . . ., ] can be calculated using cosine measure scheme in equation(3).This process will produce a sentences similarity matrix that has  ×  dimensions.

Main Process Phase
The main steps in main process phase of CoDiCo method as shown in Figure 2 consist of initialization, binarization, ordering, evaluation, mutation, crossover, stopping criterion, and output steps.Binarization and ordering steps can be divided into two phases, i.e. binarization and ordering for target vectors (i.e.solution vectors which generated by initialization and selection steps) and bina-rization and ordering for trial vectors (i.e.solution vectors which generated by crossover step).The brief descriptions for each step is describes in the next subsection.

Initialization
Initialization is a step to provide a set of solutions U that would be used to find the optimal solution of summarization.Let P and t denote the number of generated solutions and the current generation, respectively, such that () = [ 1 (), . . .,   ()] for  = 0.Each solution in U is referred as a target vector.Each target vector   () for  = [1, . . ., ] is represented as a vector which has N components such that   () = [ ,1 (), . . .,  , ()] where  , () denotes the n-th component in p-th target vector.

Binarization
Binarization is a step to encode real-value of  , () into binary-value.The binary-values are used to indicate the sentences from D which used as sentences in the p-th candidate summary   ().If  , () = 1, then it indicates that the   in D is selected as sentence in the   ().Otherwise, if  , () = 0, then it indicates that the   in D is not a sentence in the   ().
Alguliev et.al.[4] describes that encoding pro-cess of real-value  , () into binary-value  ,  () can be performed by comparing  , value with sigmoid value of  , ().The formulation for this process is shown in equation (10) and equation( 11): (11) The  , in this step has same value with the one which have been used in initialization step.

Ordering
In this step,    () sentences for each   () derived from   () solution are ordered using sentences ordering algorithm which described in Subsection 2.3.Ordered form of   ()

Solutions Evaluation
The evaluation step is used to calculate fitness value for each summarization solution.Evaluations are performed for each   () which has been encoded to binary form (   ()) and ordered form (   ()).Based on our purpose in this paper, calculation for fitness value of   ()( (  ())) is conducted by considering the three factors of summary's quality, i.e. the its coverage, diversity, and coherence values.The formulation is shown in equation( 12).
The best and the worst solutions on current generation can be determined using each solution's fitness value.The best solution on current generation (local best)   () is a target vector that has the highest fitness value.Otherwise, the worst solution on current generation   () (local worst) is a target vector that has the lowest fitness value.In this step we also can update the global best   () i.e. the best solution until current generation using the rule which formulated in equation( 13).

Mutation
Mutation is a step to generate mutant vectors set V from target vectors set U. Mutation process of   () is conducted by involving   () vector,   () vector, a randomly selected vector  1 ()  where 1 = [1, . . ., ] and 1 ≠ , and a mutation factor for current generation ().The formulation to generate p-th mutant vector on current generation   () is shown in equation( 14), whereas the formulation to calculate the () value is shown in equation( 16): In equation( 16)   denotes maximum generateon which specified in initialization step [4].
One or more   () components  , () have a probability to violate the boundary constraints.Its values can be less than   or greater than   .Each  , () which its value violates the boundary constraints have to been reflected back.The rules to reflect back the  , () value is formulated in equation(15).

Crossover
Crossover is a step to generate trial vectors set Z. Each trial vector   () has N components  , (), which its value is derived from the value of  , () or  , ()  [4].The purpose of this operation is to increase the diversity of solution vectors in order to expand the search space.
Alguliev et.al.[4] describes that to generate the   () vector, relative distance between   ()  vector and   () vector   () has to be calculated first.The   () then used to calculate the crossover rate   ().Equation(17-19) shows the formulation to calculate the   () and   ().
The rule to determine trial vector component  , () value is formulated in equation( 20).In equation(20) k is a randomly selected integer value for  = [1, … , ].It ensures that at least one component of trial vector is obtained from the mutant vector.It will ensure that the solutions on the next generation have differences with the solutions on current generation [4].( () = 2ℎ (2  ())

Stopping Criterion
In this step, the iteration of optimal solution searching process is determined to be stopped or not.The stopping criterion in this paper is uses a specified number of generation.If the iteration has reached the maximum generation   , then the iteration is stopped.Otherwise, if the iteration has not reached the   , then the iteration is continued.

Output
This step is the final step in main process of Co-DiCo method.In this step, the global best solution of summarization on the last generation   (  ) has been acquired.Its binary form    (  ) denotes the sentences index in D which has selected as summary's sentences, whereas its ordered form    (  ) stores the order of the summary's sentences index.Furthermore a sentences set which indicated in   (  ) are returned as the summary.
It is computed by divides the maximum number of n-grams co-occurring in candidate summary and set of reference summary with total sum of the number of n-grams occurring at the reference summary.ROUGE-L is a ROUGE method that considers about longest common subsequence (LCS) between candidate summary and reference summary.It is computed as the ratio between LCS's length with reference summary's length.In other hand ROUGE-SU considers the unigram value on candidate summary and reference summary as counting unit [4,10].The formulas and complete explanation about usage of ROUGE method can be read in [10].
ROUGE score for CoDiCo-A, CoDiCo-B, and OCDsum-SaDE methods are presented in Table 2.It shows the comparison of ROUGE scores among each tested methods.The highest ROUGE scores for each ROUGE type are indicated by bolded text.We also evaluate our proposed method by considering the averages of summary's coherence value which generated by each tested method.The comparison of averages coherence value from tested methods is shown in Table 3.The highest value is indicated by bolded text.

Discussion
Series of experiment has been conducted to evaluate our proposed method (i.e.CoDiCo-A and Co-DiCo-B) in comparison with compared method (OCDsum-SaDE).Based on the evaluation results as shown in Table II, we know that the CoDiCo-A method using   = 0.7 has higher ROUGE score on ROUGE-1, ROUGE-L, and ROUGE-SU than the other methods.Whereas in ROUGE-2 can be shown that CoDiCo-B method using   = 0.8 has higher score than the others.From Table II we also know that the lowest average ROUGE score of CoDiCo method is 0.3370 which obtained by CoDiCo-A using   = 0.9 and the highest average ROUGE score is 0.3777 which obtained by CoDiCo-A using   = 0.7.Whereas the averages ROUGE score of OCDsum-SaDE method only reached 0.2293.It means all of CoDiCo method variants have better performance than the compared method.
It should be noted that in CoDiCo method, by considering the coherences of sentences while selecting the best solution will adjust the coverage and diversity factors simultaneously to find the optimal solution.It will produce different summary compared with method that only considers the coverage and diversity factors.But the summary is more similar with summary that created manually by human.It causes the ROUGE scores of CoDiCo method are greater than ROUGE scores of compared method.
By comparing the averages ROUGE score for each CoDiCo method with the average ROU-GE score of OCDsum-SaDE method, we know that CoDiCo methods have averages RO-UGE score in range 46.97-64.71%higher than the averages ROUGE score of compared method.It shows that the multi-document summarization method that considers coverage, diversity, and coherence simultaneously can produce better summary than summarization method that considers coverage and diversity only.
Based on the evaluation of averages of summary's coherence value that shown in Table 3, we know that CoDiCo-B method without threshold reaches the average coherence value higher than the others do.We also know that the lowest average coherence value among the variants of CoDiCo method is 0.145 which obtained by CoDiCo-B method using   = 0.7.Nevertheless, the value is higher than the average coherence value of OCDsum-SaDE.If we compare it with OCDsum-SaDE method, CoDiCo-B without threshold can produce summary with average coherence value about 41.2 times higher than the OCDsum-SaDE method, whereas CoDiCo-B method using   = 0.7 can reach average coherence value about 29 times higher than OCDsum-SaDE method.It shows that the CoDiCo method, which involves ordering step in optimization process, can produce summary with better coherences or smoother connectivity among sentences than the other method which does not consider the ordering of summary's sentences.
The comparison of two proposed sentences ordering algorithm with same threshold value using their ROUGE scores shows that CoDiCo-B is better than CoDiCo-A.As shown in Table 2, Co-DiCo-B has higher average ROUGE score than CoDiCo-A when do not using a threshold, using   = 0.9, and using   = 0.8.Whereas CoDi-Co-A only has higher average ROUGE score than

) 4 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information
From sentences in the candidate summary, choose two sentences (   and   ) which has highest similarity ((  ,   )) and then make it as initialization of ordering result →  = [  ,   ]. 2. Change   and   status to be head and tail, respectively.3.For each sentence which has not in ordering result, choose a sentence (  ) which has highest similarity if paired with head or tail.4. Do one of following conditional: a.If (ℎ,   ) ≥ (,   ), then put   in front of the head and change   status to be head →  = [  ,   ,   ]. b.If (ℎ,   ) < (,   ), then put   behind the tail and change the   status to be tail →  = [  ,   ,   ]. 5. Repeat steps 3-4 until the entire sentences are in the ordering result.From sentences in the candidate summary, choose two sentences (   and   ) which has highest similarity ((  ,   )) and then make it as initialization of ordering result →  = [  ,   ]. 2. Choose the other sentence (  ) which has highest similarity if paired with one of sentence in ordering result (  or   ). 3. Do one of following conditional: a.If (  ,   ) ≥ (  ,   ) , then put   beside   and set   and   status as head and tail, respectively →  = [  ,   ,   ]. b.If (  ,   ) < (  ,   ) , then put   beside   and set   and   status as head and tail, respectively →  = [  ,   ,   ]. 4. For each sentence which has not in ordering result, choose a sentence (  ) which has highest similarity if paired with tail. 5. Put   behind the tail and change the   status as tail.6. Repeat steps 4-5 until the entire sentences are in the ordering result.