Download Report

SUMMARIZAION of JEWISH LAW ARTICLES in HEBREW
Yaakov HaCohen-Kerner
Eylon Malin
Itschack Chasson
Department of Computer Science
Jerusalem College of Technology (Machon Lev),
21 Havaad Haleumi St., P.O.B. 16031, 91160 Jerusalem, Israel
[email protected]
[email protected]
Abstract
With the explosive growth of online information,
there is a growing need for summaries. This paper
describes a summarization model for Jewish law articles
in Hebrew by selection of the most relevant sentences.
There are a few unique aspects in our research. The first
was checking the relevance of all the traditional methods
on texts that differ from magazines articles, both in style
and length. Another unique aspect is that our system is
probably the first developed for summaries of Hebrew
articles. We have developed a hybrid summarization
method that achieves better results than all other
traditional summarization methods.
Keywords: Hebrew, Jewish law articles, sentence
extraction, text summarization
1
INTRODUCTION
Summaries can be read with limited effort in a
shorter reading time. Therefore, people prefer to read
summaries rather than the entire text, before they decide
whether they are going to read the whole text or not.
Humans have an incredible ability to condense
huge amounts of information and they are known as
excellent summarizers. However, creation of summaries
by people requires expensive time and money. Therefore,
there has been an increase in research and development in
the domain of automatic text summarization.
[email protected]
summarization is given by Zechner [13]. Various text
summarization sources, e.g.: books, papers, conferences,
workshops, projects, and systems are available at several
web-sites,
e.g.:
http://www.summarization.com
and
http://www.site.uottawa.ca/tanka/ts.html,
Many summarization systems have used either
one of two main approaches: Natural Language
Processing (NLP) [1], and sentence extraction [2].
The NLP approach is based on understanding the
sentence of the documents. NLP has some very
sophisticated models, which require large databases and a
very large processing time. On the other hand, the
sentence extraction approach is based on gathering the
most relevant sentences from the original text. These
sentences are presented by the order of their appearance
in the original text.
Our model belongs to the second approach. In
contrast to many summarization models that were
designed and checked mostly for English articles found in
magazines and newspapers, our model deals with articles
referring to Jewish law written in Hebrew. These articles
discuss different religious problems that have awakened
up over the last years due to sociological and
technological developments. Examples for such problems
are:
1. When is a person considered dead ?
2. Are animals that are not mentioned
in ancient Jewish writings kosher ?
Text summarization is the process of
distilling
the
most
important
information from a source (or sources)
to produce an abridged version for a
particular user (or users) and task (or
tasks) [9].
The purpose of these articles is not only to give
answers to these questions. Each answer must be based on
both ancient Jewish writings and answers given by
previous rabbinical authorities over the years. More so,
arguments contradicting the author’s answer should also
be referred to. The author should give an acceptable
explanation to solve such arguments.
Basic and classical articles in text summarization
appear in “Advances in automatic text summarization”
[9]. A literature survey on information extraction and text
This paper is arranged as specified hereinafter.
Section 2 gives background concerning text
summarization based on sentence extraction. Section 3
described the model we designed for our article corpus.
Section 4 presents experiments that have been carried out,
followed by various results. Section 5 will give a quick
survey of the research and a few proposals for future
research.
2
TEXT SUMMARIZATION BASED on
SENTENCE EXTRACTION
A study made by Kupiec et. al. [4] has shown
that 79% of the sentences in man-made abstracts in their
corpus are extremely similar to sentences from the
original article. In fact, some of the sentences were even
an extracted verbatim from the original article. Therefore,
sentences extracted directly from the original text without
being revised or rephrased can make quite an appropriate
abstract. Summarization systems that work on the basis of
sentence extraction usually rate sentences according to
various features. We will now present a quick survey of
the most frequent models for rating sentences.
2.1
Proposed and baseline methods
1) Term frequency (TF): This method scores a
sentence according to the amount of key words that
appear in the sentence. First, in order to distinguish
between significant key words terms and other terms, the
system will pass through the text, scoring each term
according to the number of occurrences in the text.
Words and terms that have a grammatical role
for the language (e.g.: of, the, I, am, etc.) will be excluded
from the key words list according to a ready-made stop
list. Once the system has a database of the key words and
the number of their occurrences, the score of each
sentence is calculated by the frequency of the key words
that occur in it: TF ( s )
f (t ) where {t} s is the set of
{t} s
terms in a certain sentence s, and
number of occurrences of the term
whole text [7, 2].
f (t )
refers to the
t throughout the
2)
Cue words: This method scores a
sentence according to the appearance of words and terms
that indicate the importance of the sentence, e.g. “the
meaning of this is”, “for conclusion”, etc. the more cue
words occur in the sentence, the higher score the sentence
will be given: CW ( s)
{c} s
1
c
where
{c}
s
refers to the set
of terms in a certain sentence s, and C refers to the
number of terms defined as cue words [2].
3)
Sentence length: It is most probable that
sentences that are very short are not included in a
summary [12]. This method scores each sentence by
dividing the number of its words by the number of words
in the longest sentence (in order to normalize the score):
SL( s )
length ( s )
length ( s max )
where s is the current sentence,
smax
is
the longest sentence [5].
4)
Negative score: There are phrases
which indicate clearly that the sentences in which they
occur are not belonged to the summary. These phrases are
defined as negative phrases, and will grant the sentences
in which they appear a negative score. Examples for such
phrases could be: “for example” or “it could be that”. The
negative score is calculated as follows: CW ( s)
{N} s
1
N
where N is the number of negative words [10].
5)
Sentence position: This method scores a
sentence by its position relative to its paragraph, and
according to the relative position of its paragraph in the
article. The sentence position is calculated as follows:
sp ( s ) val ( pos , par ) where pos is the position of the
sentence in the paragraph, par is the paragraph number in
the article, and val is a function that returns the score
taking into consideration these two parameters. Return
values of val are determined by statistical results [2, 8, 6].
6)
Centrality: It is assumed that a sentence
that has a big probability of being part of the summary,
summarizes few sentences. Taking this into consideration,
the sentence is scored by the number of sentences it
resembles divided by the number of sentences in the
article. The centrality score is calculated as follows:
res( s i , s j )
C ( si )
s j {S si }
S
1
where
res ( s i , s j )
is a function that
checks the resemblance between the sentences
si and
s j according to various parameters [11].
7)
Resemblance to title: This method
scores a sentence according to its resemblance to the title.
Sentences that resemble the title will be granted a higher
score. The resemblance to title score is calculated as
follows: TR ( s) res ( s, t ) where res( s, t ) is a function that
checks resemblance between a sentence s and the title t
[2, 11].
8)
Term frequency inverse sentence
frequency (TF-ISF) : Key words occurring in fewer
sentences are much more probable to belong to the
summary. This method extends the TF method (no. 1) and
takes The ISF property also into consideration. The ISF
property is calculated as follows: ISF (t )
{s
t}
1
{s
t}
where
is the number of sentences containing the term t.
This method gives a higher rank to keywords
appearing in fewer sentences. The less sentences the
keyword occurs in, the higher rank the keyword will get.
Since this feature is a weaker indicator than the term
frequency, the keyword is multiplied by log 2 ( ISF ) and not
by the ISF score itself. The TF-ISF score is finally
calculated as follows: TF ( s)
f (t ) * log 2 ISF (t ) [11]. We
{t } s
have also developed other methods, which will be
discussed on section 3.
2.2. The Hebrew language
Most of the models that were designed in this
field were developed for the English language. In this
sub-section we would like to point out six properties of
the Hebrew language, which make the implementation of
the model much harder:
1)
Tenses – most verbs in the English
language differ from the base form only by one or more
letters added at the end of the word. This makes words
much easier to compare. Truncating all characters after
the fifth [3] or sixth [12] character of the word would
quite do the trick.
In Hebrew, however, such a simple process may
not be so helpful since the various forms change the basic
form of the word in various ways. In some cases the same
base form can have over 7000 (!!!) forms for different
tenses and bodies. This feature of the Hebrew language
makes it nearly impossible to compare two words without
making a morphological analysis. For example, the
Hebrew word
(mesukam meaning summarizedpassive), and the Hebrew word
(sikamty meaning I
summarized) are both words from the same root.
2)
Word suffices – there are 5 letters in
Hebrew, which are written differently when they appear
at the end of the word. This feature of the Hebrew
language also making it harder to compare two words. In
the previous example, the Hebrew letter in the Hebrew
word
, and the Hebrew letter from the Hebrew
word
are both derived from the same Hebrew
letter in the Hebrew root .
. Although, in the first
word it is written by another character since it is
positioned at the end of the word.
3)
Preposition letters – Unlike English that
has unique words dedicated to express relations between
objects (such as: in, at, and, from, since, etc…), Hebrew
has 8 letters concatenated in the beginning of the word
where each letter expresses another relation. For example,
the Hebrew word
(mehasikum) means “from the
summary”. The Hebrew letter ‘ 'expresses the determiner
‘the’, and the letter ‘ 'expresses the preposition ‘from’.
4)
Pronoun letters –English has unique
words dedicated to ownership (such as: her, his, etc…).
Whereas Hebrew has letters concatenated to the end of
the word to express such ownership. For example the
Hebrew word ‘
’ (maamar) means ‘article’, whereas
the Hebrew word ‘
’ (maamari) means ‘my article’.
5)
Words in Hebrew that can be written in
different ways are very frequent. That is words that are
written in plene spelling and in deficient spelling. For
Instance the Hebrew word ‘
’ (o’hel meaning tent) can
be written also in deficient spelling as ‘ ’ (o’hel). The
Hebrew word ‘
’ (limed meaning taught) can be
written also in deficient spelling as ‘ ’ (limed).
6)
Initials – initials are much more
frequent in Hebrew than in English. Due to their
frequency, ambiguous initials are not so rare. For
example, the initials " have many interpretations, e.g.
‘ ’ (ea efshar meaning impossible), ‘
’ (ani
omer meaning I says), ‘
’ (amar Avraham
meaning Abraham said).
3
Our SUMMARIZAYION MODEL
Our model summarizes Jewish law articles
written in Hebrew. The generated summaries should be
conclusive. We implement most of the methods
mentioned on section 2.1. Implementing these methods on
articles in the Hebrew language was much harder. The
difficulties arise mostly in methods that are based on
words comparison (e.g. TF, centrality, title resemblance).
Since it is hard to identify two words that have different
forms on the one hand, but same root on the other hand.
The TF method works on the basis of words
comparison. Therefore, some features of the Hebrew
language make comparison rather difficult. Many terms in
this corpus jargon are written by initials. The pronoun and
preposition letters concatenated to words in Hebrew cause
numerous problems as well. Comparison between terms is
far more complicated under these circumstances.
Even more so, such problems occur when
implementing the methods based on sentences similarity.
For the implementation of these, there was a need to cope
with tenses and forms as well.
After performing experiments with the various
methods, we found the optimal method combination for
getting the best results. During the experiment phase we
checked the results of both each method individually, and
the results of various combinations of different methods.
We have also tested a new method. This method
is based on associative words. At first, the system finds
the text domain by seeking the most frequent key words,
and then determines which domain they belong to (we
have built a word list for each domain for that purpose).
Once the domain is determined, key words belonging to
this domain will receive higher score. For example, under
the domain ‘constitution and government’ keywords such
as: democracy, liberality, and president, will be given
higher score than other keywords.
After running each method individually we, we
have taken the 5 best methods, and combined them into a
hybrid method (defined below). Each method was given a
different weight. Each sentence was given a weighted
score calculated by all these methods, where each score
was multiplied by its weight.
The final score was given according to the
following methods multiplied by their weights: TF-ISF,
position, cue words, section title and domain by the
following
hybrid
equation:
TF _ ISF ( s )
POS ( s )
CUE ( s )
ST ( s )
D(s)
where TF_ISF(s) is the score of TF-ISF method, is the
weight of the TF-ISF method, pos(s) is the score of
position method ,
is the weight of position method, etc.
note that 0 , , , , 1 and that
1.
4
EXPERIMENTAL RESULTS
In order to measure the success of the
summarization methods mentioned above, a few
comparison functions were suggested. Mani and Bloedorn
[8] suggested an automatic procedure for generation of
reference summaries, for articles with author-provided
summaries. The main idea of the procedure is taking the
sentences having the closest resemblance (according to
the cosine measure) to the sentences in the authorprovided summary, in order to present them as abstracts.
It is quite obvious that one of the most significant
components of such procedure is the sentence comparison
function.
Our results have been evaluated by the recall and
precision measures. The recall measure is defined by the
number of the correct sentences in the generated summary
divided by the total number of sentences in the reference
summary. The precision measure is defined by the
number of correct sentences in the generated summary
divided by the total number of sentences in the generated
summary.
Our corpus contains 60 articles. Each one of
them has its own author-provided summary. We
compared between the summaries of our system and the
reference summaries of the procedure that was assisted by
author-provided summary. The results were awfully low.
The highest precision/recall result was 0.14/0.25. But as
we read both of the summaries, we have found that ours
was much more indicative.
It seems that the cosine measure does not take
into consideration some very significant factors. The main
problem of this way of measuring is taking into
consideration the similarity of any two words (excluding
stop-list words), without regarding their importance to the
text and its domain. Therefore, we needed to develop a
new method for checking the resemblance between
sentences. The method we have developed is taking into
consideration the following factors:
Similar words that belong to the text issue will
be given a higher matching score. This factor will be
calculated this way:
wi
wi {s1}, wi {s 2}
s1
s2
where s1 and s2 are
2
the sentences that are compared and
wi is a word that
belongs to the issue.
We defined special Jewish Rabbinical conclusive
cue words, as words that indicate a conclusion e.g.
must, forbidden, etc… conclusive key words that
belong to the text issue will be given a higher
matching score. This factor will be calculated this
way:
wi
wi {s1}, wi {s 2}
s1
s2
where s1 and s2 are the sentences
2
that are compared and
wi is a conclusive key word.
Regular cue words that that indicate importance
of the sentence will be also given a higher matching
score. This factor will be calculated this way:
wi
wi {s1}, wi {s 2}
s1
s2
where s1 and s2 are the sentences that are
2
compared and
wi is a normal cue word.
Besides the aforementioned similarity factors,
we have also taken in consideration the cosine measure.
Each of these factors was multiplied by a coefficient the
following way:
I
C R D where I is the issue words factor,
C is the Jewish Rabbinical conclusive cue word, R is the
regular cue words factor, and D is the dry cosine measure
method. The coefficients make the following equation:
1 and 0
, , ,
1.
This comparison function yielded not only much
higher similarity between the summaries of our system
and the reference summaries of the procedure that was
assisted by author-provided summaries. It yielded even
more indicative summaries for the latter as well.
TF
Table 1 presents the recall and precision results
of different summarization methods tested on our corpus.
The length of the summary is 10% of the length of the
article, a common summarization ratio. Our hybrid
method has given the best recall/precision results
0.42/0.21, since it is the best method that suits our
conclusive summarization task. These results regarded as
reasonable comparing to those of Neto [11] 0.42/0.4.
These recall and precision results are presented again by
histograms in Figures 1 and 2, respectively.
Table 1. Recall and precision results of summarization
methods on our corpus
Method
TF
Cue words
Length
Position
Centrality
Title
Section
Title
Tf-isf
Domain
words
Hybrid
Recall
0.3
0.18
0.09
0.19
0.04
0.04
0.17
Precision
0.16
0.09
0.18
0.1
0.02
0.02
0.1
0.3
0. 37
0.16
0. 18
0.42
0.21
TF
0.4
Length
0.3
Position
0.2
Centrality
Section Title
Tf-isf
Domain w ords
recall
0.1
Title
1
0.25
Length
0.2
Position
0.15
Centrality
0.1
Title
0.05
Section Title
Tf-isf
Domain words
Hybrid
precision
1
0
summarization
methods
Fig. 2. Precision results of summarization methods
5
SUMMARY and FUTURE RESEARCH
In this paper we have described a new method
for conclusive summarization using sentence extraction.
This method gave the optimal results for the texts we have
dealt with. We have also developed a system that is
compatible with the Hebrew language. As far as we know
no existing program is able to summarize Hebrew articles.
Future directions for research are: (1) Some
rabbinical authorities are taken more seriously by all
authors than others. We suggest giving higher scores to
sentences where those rabbinical authorities are cited. (2)
It is known that certain authors take into consideration
some rabbinical authorities rather than others. Therefore
the importance of different rabbinical authorities should
be computed relatively to the discussed author.
More general research proposals are: (1)
Developing automatic learning technique concerning
tuning coefficients of the similarity and hybrid functions.
This might improve the results of the extraction, (2)
Elaborating the model for summarizing other kinds of
Hebrew articles, (3) Experimenting the model on larger
data-set of various kinds of Hebrew articles.
0.5
Cue w ords
Cue words
0
sum m arization
m ethods
Hybrid
Fig. 1. Recall results of summarization methods
ACKNOWLEDGEMENTS
Thanks to Ari Cirota and two anonymous
referees for many valuable comments on earlier versions
of this paper.
6
REFERENCES
[1]
C. Aone, M. E. Okurowski, J. Gorlinsky and B.
Larsen, “A Scalable Summarization System
Using Robust NLP,” Proc. of the ACL Work
shop
on
Intelligent
Scalable
Text
Summarization, pp. 66-73, 1997.
[2]
[3]
H.P. Edmundson, “New Methods in Automatic
Extraction,” Journal of the ACM 16(2): pp. 264285, 1969. Reprinted in Advances in Automatic
Text Summarization, I. Mani and M.T. Maybury
(eds.), Cambridge, Massachusetts: MIT Press,
pp. 21-42, 1999.
Y. HaCohen-Kerner, “Automatic Extraction of
Keywords from Abstracts,” Proc. of the Seventh
International Conference on Knowledge-Based
Intelligent Information & Engineering. Lecture
Notes in Artificial Intelligence 2773, Berlin:
Springer-Verlag, pp. 843-849, 2003.
[4]
J. Kupiec, J. Pederson and F. Chen, “A trainable
document summarizer,” Proc. of the 18th Annual
International ACM SIGIR, pp. 68–73, 1995.
[5]
C-Y. Lin, “Training a Selection Function for
Extraction,” Proc. of the 8th International
Conference on Information and Knowledge
Management (CIKM 99), Kansa City, Missouri,
pp. 55-62, 1999.
[6]
C-Y. Lin and E.H. Hovy, “Identifying Topics by
Position,” Proc. of the Applied Natural Language
Processing Conference (ANLP-97), pp. 283-290,
1997.
[7]
H. P. Luhn, “The automatic creation of literature
abstracts,” IBM Journal of Research and
Development, 2(2): pp. 159-165, 1958.
Reprinted in Advances in Automatic Text
Summarization, I. Mani and M.T. Maybury
(eds.), Cambridge, Massachusetts: MIT Press,
pp. 15-21, 1999.
[8]
I. Mani and E. Bloedorn, “Machine Learning of
Generic and User-Focused Summarization,”
Proceedings of AAAI-98, pp. 821-826, 1998.
[9]
I. Mani and M. T. Maybury, “Advances in
automatic text summarization,” Cambridge, MA:
MIT Press., pp. ix-xv, 1999.
[10]
S. Myaeng and D. Jang, “Development and
evaluation of a statistically based document
summarization system,” In Mani and Maybury.
Advances in Automatic Text Summarization.
MIT Press, Cambridge, Massachusetts, pp. 137154, 1999.
[11]
J. L. Neto, A. A. Freitas and C. A. A. Kaestner,
“Automatic Text Summarization Using a
Machine Learning Approach,” Proc. of the 16th
Brazilian Symposium on Artificial Intelligence,
SBIA-2002, Porto de Galinhas/Recife, Brazil,
pp. 205-215, 2002.
[12]
K. A. Zechner, “Fast Generation of Abstracts
from General Domain Text Corpora by
Extracting Relevant Sentences,” Proc. of the
16th international Conference on Computational
Linguistics, pp. 986-989, 1996.
[13]
K. A. Zechner, “A Literature Survey on
Information
Extraction
and
Text
Summarization,” Term Paper, Carnegie Mellon
University, 1997.