Download Report

UDC 81:004.93
V. Vysotska
Lviv Polytechnic National University
Information Systems and Networks Department
GENERATIVE REGULAR GRAMMARS APPLICATION TO MODELING
THE SEMANTICS OF SENTENCES IN NATURAL LANGUAGE
ЗАСТОСУВАННЯ ПОРОДЖУВАЛЬНИХ РЕГУЛЯРНИХ ГРАМАТИК
ДЛЯ МОДЕЛЮВАННЯ СЕМАНТИКИ РЕЧЕННЯ ПРИРОДНОЮ
МОВОЮ
© Vysotska V., 2014
This paper presents the generative grammar application in linguistic modelling.
Description of syntax sentence modelling is applied to automate the processes of analysis and
synthesis of texts in natural language.
Key words: generative grammar, structured scheme sentences, computer linguistic
system.
Подано застосування породжувальних граматик у лінгвістичному моделюванні,
застосовано моделювання синтаксису речення для автоматизації процесів аналізу та
синтезу природномовних текстів.
Ключові слова: породжувальні граматики, структурна схема речення, комп’ютерна лінгвістична система.
Introduction
At the present stage of development the need of development of general and specialized linguistic
systems make active use of applied linguistics and computer information technology. Development of
mathematical models for computer speech language systems support allows to realize such tasks of applied
linguistics as analysis, synthesis of oral or written text content, description and indexing of text content,
texts translation, creation of lexicographical databases, etc. [1-2, 6-8]. Linguistic analysis of the text
content consists of several processes of grapheme, morphological, syntactic and semantic analysis [3-5, 917]. For each of these stages appropriate models and algorithms were developed [3-5, 9-17]. An effective
tool for linguistic modeling at syntactic and semantic level of language is a main part of combinatorial
linguistics – theory of generative grammars, the beginning of which lays in the work of American linguist
Noam Chomsky [9-17]. He used the method of formal analysis of the grammatical structure of phrase to
allocate syntactic structure (components) as the structure of the phrase, regardless of its value. The ideas of
Noam Chomsky developed the Soviet linguist A. Gladky [3-5], using the concept of dependency trees and
components systems for modeling language syntax. He proposed a method of syntax modelling using
syntax groups that produce phrases components as units of dependency tree building – such representation
combines the advantages of direct constituents method and dependency trees [3-5].
The advantages of using the simulation of generative grammars is the ability to describe not only the
language syntax (rules of sentence formation forms words), but also morphemic (rules of words formation
from morphs) and semantic (rules of meaningful sentences and texts formation) levels. It is used to
automate the process of inflection/derivation, categorization or key words identification and forming text
content digests. For example, when using automatic morphological synthesis computer system creates the
necessary linguistic word forms based on requirements to word forms and morphemes databases.
Relation of the presented issue with the important scientific and practical tasks
The methods and tools development for automatic processing of text of commercial content in
modern information technology are important and topical [1-2, 6-8] (for example, systems of information
43
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
retrieval, machine translation, semantic, statistical, optical and acoustic analysis and synthesis of speech,
automated editing, knowledge extracting from the text content, text content abstracting and annotation,
textual content indexing, training and didactic, linguistic buildings management, instrumental means of
dictionaries conclusion of various types, etc.). Specialists actively seeking new models of description and
methods for automatic processing of text content [1-2, 6-8]. One of these methods is the development of
general principles of lexicographic systems of syntactic type. It is important by these principles these
systems construction of text content processing for specific languages [1-2, 6-8].
Research linguists in the sphere of morphology, morphonology, structural linguistics have identified
different patterns for the word forms description [1-17]. With beginning of the development of generated
grammars theory linguists have focused not only on the description of the finished word forms, but also the
process of their synthesis. In Ukrainian linguists is fruitful research in functional areas such as theoretical
problems of morphological description, the classification of morpheme and word creative structure of
derivatives in Ukrainian language, regularities for affix combinatorics, modeling word-formative
mechanism of the modern Ukrainian language in dictionaries of integral type, the principles of internal
organization in words, structural organization of different verbs and nouns suffix, word creative motivation
problems in the formation of derivatives, the laws of implementing morphological phenomena in Ukrainian
word formation, morphological modifications in the verb inflection, morphological processes in word
formation and adjectives inflection of modern Ukrainian literary language, textual content analysis and
processing, etc.
This dynamic approach of modern linguistics in the analysis morphological level of language with
focused attention researcher on developing morphological rules allows to effectively use the results of
theoretical research in practice for the computer linguistic systems construction of textual content
processing for various purposes [1–17]. One of the first attempts to apply generated grammars theory for
linguistic modeling belongs to . Gladky and I. Melchuk [3–5]. Experience and research of Noam Chomsky
[9–17], A. Gladky [3-5], M. Hross, A. Lanten, A. Anisimov, Y. Apresyan, N. Bilhayeva, I. Volkova,
T. Rudenko, E. Bolshakova, E. Klyshynsky, D. Lande, A. Noskov, A. Peskova, E. Yahunova,
A. Herasymov, B. Martynenko, A. Pentus, M. Pentus, E. Popov, V. Fomichev [2, 8] are applicable to the
tools developing for textual content processing as information retrieval systems, machine translation,
textual content annotation, morphological, syntactic and semantic analysis of textual content, educationaldidactic system of textual content processing, linguistic support of specialized linguistic software systems,
etc. [1–17].
Recent research and publications analysis
Linguistic analysis of the content consists of three stages: morphological, syntactic and semantic
[2–5]. The purpose of morphological analysis is to obtaining basics (word forms without of inflections)
with the values of grammatical categories (eg, part of speech, genus, number, case) for each word forms
[2–5]. There are the exact and approximate methods of morphological analysis [2]. In the exact methods
use dictionaries with the basis of words or word forms. In the approximate methods use experimentally
established links between fixed letter combinations of word forms and their grammatical meaning [2–5]. A
dictionary using with word forms in the exact methods simplifies using the morphological analysis. For
example, in the Ukrainian language solve the problem of the vowels and consonants letters alternation by
changing the conditions of using the word [2]. Then for finding the words basics and grammar attributes
use algorithms of search in the dictionary and selecting appropriate values. And then use morphological
analysis provided the failure to locate the desired word forms in the dictionary. At sufficiently complete
thematic dictionaries speed of textual content processing is high, but using the volume of required memory
in several times more than using a basics dictionary [2–5]. Morphological analysis with the use of the
basics dictionary is based on inflectional analysis and precise selection of the word bases. The main
problem here is related to homonymy the words basis. For debugging check the compatibility of dedicated
bases in words and its flexion [2–5].
As the basis of approximate methods in morphological analysis determines the grammatical class of
words by the end letters and letter combinations [2]. At first allocate stemming from basis words. From
44
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
ending word sequentially take away by one letter after another and obtained letter combinations are
compared with a inflections list of appropriate grammatical class [2–5]. Upon receipt of the coincidence of
final part with words is defined as its basis [2]. In conducting morphological analysis arise ambiguity of
grammatical information determination, that disappear after parsing [2]. The task of syntactic analysis is
parsing sentences based on the data from the dictionary [2–5]. At this stage allocate noun, verb, adjective,
etc., between which indicate links in the form of dependency tree. Any tools of syntactic analysis consists
of two parts: a knowledge base about a particular natural language and algorithm of syntactic analysis (a
set of standard operators of text content processing on this knowledge) [2–5]. The source of grammatical
knowledge is data from morphological analysis and various filled tables of concepts and linguistic units
[2]. They are the result of the empirical processing of textual content in natural language of experts in order
to highlight the basic laws for syntactic analysis. Table-based of linguistic units constitute configurations
or valences sets (syntactic and semantic-syntactic dependencies) [2]. This is a lexical units list/dictionaries
as instructions for every of them all possible links with other units of expression in natural language [2–5].
In implementing of the syntactic analysis should be achieved full independence of rules of tables data
transform from their contents. This change of this content does not require algorithm restructuring.
The vocabulary V consists of finite not empty set of lexical units [2]. The expression on V is a finitelength string of lexical units with V. An empty string does not contain lexical items and is denoted by Λ .
The set of all lexical units over V is denoted as V ′ . The language over V is a subset V ′ . The language
displayed through the set of all lexical units of language or through definition criteria, which should satisfy
lexical items that belong to the language [2]. Another is one important method to set the language through
the use of generative grammar. The grammar consists of a lexical units set of various types and the rules or
productions set of expression constructing. Grammar has a vocabulary V, which is the set of lexical units
for language expressions building. Some of lexical units of vocabulary (terminal) can not be replaced by
other lexical units.
Generative grammar G – is four G = (V, T, S, P), where V – is finite not empty set, the alphabet; T –
the subset of V, its elements are terminal (main) lexical units, terminals; S – the initial symbol (S∈V); P – is
a finite set of productions (conversion rules) ξ→η, where ξ and η – strings over V. The set V \ T is denoted
by N, its elements are non-terminal lexical units, non-terminals [1–17]. Grammars are classified by the
types of productions, which imposed certain restrictions.
1. Grammar G0 is unlimited. Here ξ – random string that contains at least one non-terminal symbol,
η – random string over V.
2. Context-sensitive grammar G1 . In the set of productions P exists production γξδ→γηδ, ξ ≤ η
(but not in the form ξ→η), then you can substitute ξ with ν only surrounded by strings γ…δ, i.e. in
appropriate context.
3. Context-free grammar G2 . A non-terminal A on the left side of A→η production can be replaced
by string in random environment whenever it occurs, i.e. regardless of the context.
4. Regular grammar G3 . May occur productions A→aB, A→a, S→λ, only, where, B – non-terminal,
a – terminal, λ – empty string.
Terminal lexical units are word forms of natural language, nonterminal lexical units are syntactic
categories, and terminal strings are correct expressions of the language [9-17]. Then production of the
expression naturally interpreted as its syntactic structure, which is given in terms of generative grammar.
The set of natural language expressions has a number of specific properties. Analyzing natural language
expressions in the theory of formal grammars, they are considered as strings of word forms / morphemes as
terminal lexical items. To set expression recognition algorithm exists, or submitted string is an expression
of the language. Sets, for which recognition algorithms exists are recursive. But for the generation of
natural language expressions and only them for grammar impose restrictions on production: in production
A → B string B is no shorter than string A; then while production strings are not reduced. Grammar G0
does not meet the specified limit – there are productions that reduce strings [3–5]. However, language
L(G0 ) is recursive. Languages generated by not contractile grammar, are easily recognizable. In the
45
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
context-sensitive grammar G1 exists a production γAδ→γηδ, wherein at least one of the strings γ, different
from Λ , and nonterminal A is substituted by string η surrounded only by γ and δ, i.e. in context. Language
is context-sensitive, if there is at least one context-sensitive grammar that generates the language. The term
rules formation is borrowed from mathematical logic, where it refers to the rules of correct formulas
construction. In logics a different type of rules is considered – the rules of conversion. They set certain
correlation between valid formulas. Production rules are needed in the natural languages description.
Setting transformation rules means a transition to a higher level language examination, namely the
semantic level. Knowing language necessarily implies the ability to not only build the right phrase, but the
switch from one sentence to another, or completely synonymous to it, or that differ from it by the sense of
a certain amount, for example, to make an affirmative sentence negative or interrogative, to change active
voice into passive, change the stylistic color of text, to express the same thought in different ways and so
on. These features can not be stated in terms of grammars, and therefore raises the question of developing a
formal system for transformation rules regarding natural languages. The corresponding problem was first
stated in works of Noam Chomsky [9–17]. His concept quickly gained the fame under the name of
transformational grammar: an introduction of semantic level of language description. In fact, all invariant
transformations usually makes sense, transformation – it is a change that preserves meaning. Thus, the
transformation theory is a theory language synonymy. Description of synonymy in linguistics must take a
central place. Hence the primary role of transformation appears. However, transformation does not belong
to the same level as that of grammar: G0 grammar is related to syntax and transformation – to semantic.
That is insufficient to describe the grammar G0 of language meaning is not true in the sense of grammar
G0 coverage semantic level. At the syntactic level grammar is fundamentally quite sufficient. Generative
grammar is viewed within the formal theory. For transformations level of formalization is not achieved:
transformation rules are not formulated in terms of a simple operation. The task of further formalization of
transformations is very important. In the works of Noam Chomsky and several other authors [1-17] term
generative grammar is used in two senses: in the broad – to refer to any system of formal rules describing
the language, including transformation and morphonological components and narrow – to describe exactly
grammars. In this work the term is always in a narrow sense. With such discourse transformation rules are
beyond the generative grammar.
Statement of purpose
Submission of syntactic structure in terms of generative grammar is often used in linguistics and
studied many times in many different ways. It has won the right to exist both in theoretical terms and in
experimental works (automatic translation or abstracting, etc.). Grammars while generating terminal
strings, such as expressions of natural language, also giving their structure. Unlimited not shortens
grammar has no longer the property of expressions comparison with their context-sensitive structure. In
this grammar each time more than one lexical unit is replaced, but a group of them. In the derivation, the
parent of each lexical unit cannot be uniquely, and so production rule applying is not converted in a
context-dependent structure.
Linguistic software is used in many information systems. Improving machine-human communication is an important current task, which is solved through the texts analysis and synthesis on the
linguistic level. For this purpose, consider the process of linguistic phrases modelling in natural language
by generative grammars.
For automated content retrieval and processing of textual content is of great importance to the
presence/absence and frequency of occurrence of a particular category of linguistic units in the studied
array of content. Quantitative calculation allows to draw objective conclusions about the direction of the
content by the number of analysis units (key quotes) in the investigated arrays, for example, the number of
positive/negative reviews on a certain type of product. Qualitative analysis allows to draw objective
conclusions about the availability of desired linguistic units in the array of content and the direction of its
context. Content search is performed not on the text content, but for its brief characteristics – the retrieval
images (RIm), where the main text of content is served in terms of specialized information retrieval
46
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
language. The procedure for the RIm provides indexing, semantic analysis of the main text content and
translating it into information retrieval language (table 1).
Table 1
The main stages in content-search operation
Operation name
RIm Formation
Query generation and
SA
Content search
Content analysis
Result formation
Decision making
Content presentation
Operation description
Create, input, storage in RIm module
Create, input and storage in the queries module and SA of a user's query.
Comparison RIm of content with SA of user request
Quantitative and qualitative analysis of text content.
The result of applying content analysis of positive content in range (0,7; 1] or (0,5; 1].
The decision on issuing of the content according to the result of applying content
analysis.
The issue of the content that corresponds to the information request of the user.
Table 2
The main elements of an information retrieval language
Element name
Alphabet
Vocabulary
Grammar
Paradigm
Paradigmatic
relations
Syntagmatic
relations
Identification
rules index
Statements unity
Interphrase unit
Blocks-fragments
Language unit characteristics
Set of graphic characters to commit the words and expressions of the language.
Set of interrelated linguistic units (words used in speech).
Set of the rules of association of linguistic units in word, which is most effective means of
building sentences.
Lexico-semantic group of words with subject-logical links based on semantic criteria.
Basic and analytical relation between words, with no depenence on the context in which they
are used, generated with not linguistic, but logical relationships.
Linear relationships between words that are settled when combining words into phrases and
phrases.
Paradigms (vocabulary) and syntagmatics (grammar) of the language.
A statement is a sentence of natural language, but the reverse is not true.
United semantically and syntactically in the fragment. The core interphrase unity is a
statement that is not subordinated to any other statement and saves sense when selecting from
the context.
Many interphrase unities that ensure the integrity of the text by semantic and thematic links.
In the module are stored no texts content, but its RIm. To search the indexed content is used content
analysis in information requests. Information request translated into information retrieval language and
supplemented to search for additional data, is a search available (SA). The degree of detail in the
presentation of content in RIm of its central theme/subject and related topics/subjects is the depth of
indexing. Automating this process allows you to ensure its unification, dismissing the part of staff from
unproductive labour indexing content. Content search is provided by a set of semantic tools: information
retrieval language, methods, content indexing/query and search. Based semantic tools is information
retrieval language – specialized artificial language designed to describe the Central themes/subjects and
formal characteristics of the content, as well as to describe queries and search. In practice, one language is
used for indexing content and the other for indexing information requests. Content formatting is the
process of indexing, semantic analysis, the basic definition of the content and convert it into XML format.
The formatting of the content is performed manually by a moderator or automatically by means of content
analysis. While indexing, examine the text content, determine its central theme and describe it in terms of
information retrieval language . In the content section titles, as a rule, reveal a Central theme and subject,
but the name is not always possible to identify the content. Natural language is not used as information
retrieval through numerous grammatical inclusions, lack of structuring, ambiguity and greater redundancy,
47
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
in particular, for the Ukrainian language 75–80 %. In information retrieval language among the major
elements (table 2) do not use synonyms and homonyms through their semantic ambiguity.
The feasibility of using information retrieval languages depends on destination of search tools,
technical tools, automation of information procedures and management. When designing information
retrieval languages pay attention to the following points: the nature of the industry/theme for which you are
developing language features of texts from the search array content; the nature of the information needs of
users of electronic content commerce.
Research results analysis
The text content (article, comment, book, etc.) contains a lot of data in natural language, some of
which are abstract. The text is considered as a sequence of iconic pieces, unified by content, the basic
properties of which are informational, structural and communicative coherence/integrity that reflects the
content / structural nature of the text. The method of text content processing is a linguistic analysis of
content (eg, comments, forums, articles, etc.). The process of the text content elaboration divides content
on tokens using finite automates of linguistic analysis of natural language texts (Fig. 1). As functional,
semantic and structural unity text content has rules of construction, detects patterns of meaningful and
formal connections of constituent units. Connectivity of text content is shown by exterior structural
indicators and formal dependence of the text components, and text content integrity – through thematic,
conceptual and modal dependence of textual information . The integrity of text content leads to meaningful
and communicative organization of the text, and text content connectivity – to form, structural organization
of textual information. Let us consider not reductive grammar G with linguistic units and string length of
the terminal linguistic units p of this grammar. Language L(G) is easily recognizable set by algorithm (alg.
1), which builds any constructions in the grammar starting with initial symbol S. The number of
productions occurrences in the derivation is its length.
Fig. 1. Block diagram of the linguistic analysis of text content in the formation content
Algorithm 1. Language L(G) recognition
Phase 1: Alternately apply S to grammar productions.
Phase 2: Built string x verifying.
Step 1: If yes, then go to step 3.
Step 2: If not, then go to step 1.
Phase 3: String x of S output.
48
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
In this algorithm, process of output can be infinite. To avoid this algorithm derived finite set is
defined (alg. 2).
Algorithm 2. Recognition of L(G) language using the set of derivations M.
Phase 1: Analysis of string x with lengths n, derived from the initial symbol S of grammar G.
Step 1: Calculate the number of strings from S to x (no string is repeated). Since the grammar G is not
contractile, whereas none of the strings in this sequence is no longer than the string (length ≤n). Number of
different strings length ≤n of p lexical units
≤ p n + p n −1 + p n −2 + ... + p 2 + p1 + p 0 =
p n +1 − 1
< p n +1
p −1
when p > 1 (number of strings of length n of p lexical
items equals to p n , length of (n-1) of p equals to p n −1 , and so on; number of strings of 0 symbols is equal
to p 0 = 1 ). At p n +1 = P of various strings such sequences is not more than
P!+C 1P ⋅ ( P − 1)!+C P2 ⋅ ( P − 2)!+... + C PP −2 ⋅ 2!+ C PP −1 ⋅ 1!
Here the sum consists of p summands, its k-th summand
equals to
C Pk ( P − k )! = P!
k!( P − k )!
( P − k )! = P!
k!
≤ P! , where C Pk = P! k! ( P − k )! . The total sum is no more than
P!⋅P < ( P + 1)! = ( p n +1 + 1)! < ( p n +2 )! .
Step 2: Generate a sequence of strings from S to x. From the ( p n + 2 )! derived sequences set M is obtained.
Phase 2: Construction of derivation set M of random string x.
Phase 3: Check the resulting finite set of derivations M.
Step 1: Find an appropriate derivation I set M or prove such derivation does not exist. If the derivation does
not end in a string x, then go to step 3.
Step 2: If the derivation ends in a string x, then go to step 4.
Step 3: If the end of the derivation set, then go to step 5, otherwise move to step 3.
Phase 4: Formulating a positive answer: x is derived from S. Go to step 5.
Phaase 5: Formulating negative answer: x is not derived from S.
Phase 6: Display the results.
Number of steps to form the set M to fins an appropriate derivation does not exceed ( p n + 2 )! and is
great for natural languages. This calls large capacity for such a resource-intensive algorithm. It is therefore
necessary to choose a set where the number of steps in the recognition is in that depending on the length of
the string, and identify rather narrow class class of recursive sets. And assuming that the set of expressions
is infinite, their processing rules are more homogeneous, which can reveal significant patterns of
derivation.
For more precise derivation of string x of S must still add an additional constraint: each production
X → Y on tne left side is Z1CZ 2 (C – a lexical unit), and the right part (Y) – form Z1WZ 2 (W – is a nonempty string). Then at every step of derivation is allowed to replace only one lexical unit. For any not
contractility grammar an equivalent context-dependent grammar can be constructed, for example,
P1 = { AB → BA} ≈ P2{ AB → 1B, 1B → 12, 12 → B 2, B 2 → BA} Consistent use of these rules is equivalent to the use
of rule AB → BA and replace them last not lead to the appearance of extra reduced, as lexical units 1 and 2
are new. In grammar G rules replace only one lexical unit (C). The left part of the production (X) does not
necessarily consist only of a lexical unit. Around C may attend other lexical items (context), ie X = Z1CZ 2 .
Then the production Z1CZ 2 → Z1WZ 2 means a permit to replace C into W only in terms of context Z1 ...Z 2
and without changing its location.
The grammar has the following important property in the linguistic aspect. The terminal symbols as
word forms from natural language are interpreted. The supporting characters as syntactic categories (eg, V
~
~
verb, S noun, A adjective, V verb group, S group noun), the initial character as R (sentence) and
displayed terminal strings as a correct sentence of the language are interpreted. Unlimited grammar of type
0 are a special case of the general concept in generative grammar. But, they are certainly adequate for
describing any natural language in full. Every natural language (set of correct sentences) is easily
authentication set. This means the existence of fairly simple recognition algorithm of phrases correctness.
49
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
If natural language is recognized algorithm with with the specified restriction on the amount of
memory, then it can be the generative grammar. Here, for displayed any terminal string of n length is a
conclusion which none intermediate chain does not exceed the length of Kn ( K is some constant). This
grammar is grammar with limited stretching where capacitive signal function do not more than linear. For
any grammar with limited stretching can build its equivalent grammar G0 . It is able to describe the set of
correct sentences for any natural language, ie generate any correct phrases of this language, while do not
generating any false phrases. Both structures, presented as examples unsuitability of context-free
grammars, G0 grammar easily is described. The method disadvantages of generative grammar in three
points are described.
1) With their help, it is not possible naturally to describe phrases with discontinuous constituents .
2) The grammar G0 contains only rules of linguistic expressions formation, such as word forms or
phrases. The grammar specifies the correct expressions in contrast to incorrect.
3) Grammar G0 are building sentences just with exactly certain word order (with the fact that this
sentence should have in final form). In this case generated sentence is matched syntactic structure in the
form of an ordered tree, ie a tree where between nodes (except subordination relation, defined by a tree)
there is also a relation of linear order (to the right – to the left). Thus, in the syntactic structure of G0
grammar are not separated two absolutely different by nature, though interconnected relationships:
syntactic subordination and linear relative position. But possible describe the syntactic structure showing
relation of syntactic subordination. As for the relation of linear order, it describes not the structure, but a
phrase.
The words order depends on syntactic structure. It necessarily from its accounting is determined and
thus is in relation to its somewhat derivative, secondary. It is expedient to modify the concept of generative
grammar so that the left and right parts of substitution rules were not linearly ordered strings, eg as trees
(no linear ordering), depicting the syntactic relation [3–5]. Then the rules are as follows:
A→ B
x
C
A→ B
x
C
y
D
or
B→ B
z
z
C
A
w
D
→
A
x
B
y
C
D
z
x y
B C E
Lines with symbols represent the syntactic relations of different types. Letters A, B, C ,... represent
syntactic category. NB : relative positions of symbols for one level of subordination does not play any role
x
y
y
x
and in this scheme is randomly; B ←

A
→
C means the same thing as C ←

A
→
B [3-5, 9-17].
As a result the syntactic structures (not phrases) calculation in language is obtained. This calculation is part
of the generative grammar [3-5]. The rest of this grammar is the calculation that sets all possible its linear
sequences of words for any given syntactic structure (including any other factors, eg with the mandatory
accounting logical selection in the Ukrainian language, etc.). Then is removed the problem of
discontinuous constituents [3-5]. From the output sentence in a regular grammar can not get the natural
presentation of immediate constituents structure with this sentence. That is, regular grammars provide
some structure constituents like all grammars immediate constituents, however, these components are
usually formal. In the analysis explore multilevel structure for textual content: a linear sequence of
symbols; linear sequence of morphological structures; linear sequence of sentences; network of
interconnected unities (Alg. 3).
Algorithm 3. The linguistic analysis of the textual content.
Stage 1. The grammatical analysis of textual content.
Step 1. Separation of commercial textual content on sentences and paragraphs.
Step 2. String separation of symbols into words.
Step 3. Bold of digits numbers, dates, constant phrases and abbreviations.
Step 4. Removing the non-text symbols.
Step 5. Formation and analysis of a linear sequence of words with the special characters.
50
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
Stage 2. Morphological analysis of the textual content.
Step 1. Getting the basics (word forms from cut off their end).
Step 2. Each word form is assigned a value of grammatical categories (the set of grammatical meanings:
gender, case, declension, etc.).
Step 3. Linear sequence formation of morphological structures.
Stage 3. Parsing textual content.
Stage 4. Semantic analysis of textual content.
Step 1. Words are correlated with semantic classes from dictionary.
Step 2. Selection of needed morphologically semantic alternatives for the sentence.
Step 3. Linking words into a single structure.
Step 4. An ordered set formation of superposition records from basis lexical functions and semantic
classes. The result accuracy is determined by dictionary completeness/correctness.
Stage 5. Referential analysis to form between phrasal unities.
Step 1. Contextual analysis of content. According to his help is realized local reference resolution (this,
that, it) and expression selection as the unity core.
Step 2. Thematic analysis. Expression separation on the topic identifies thematic structure using, for
example, content categorization and the digest formation.
Step 3. Definition of regular frequency, synonyms and re-nomination of keywords; reference
identification, ie the words ratio with the image object; implication availability, based on situational
relations.
Stage 6. Structural analysis of text. Prerequisites for using a high degree of unity terms coincidence,
discursive unit, sentence of semantic language, expression and elementary discourse unit.
Step 1. Basic set identification for rhetorical relations between unities content.
Step 2. Nonlinear network construction for unities. A links set openness implies its enlargement and
adaptation for the textual structure analysis.
The text realizes structural submitted activities through provides subject, object, process, purpose,
means and results that appear in content, structural, functional and communicative criteria and parameters.
The units of internal organization of the text structure are alphabet, vocabulary (paradigmatics), grammar
(syntagmatic) paradigm, paradigmatic relations, syntagmatic relation, identification rules, expressions,
unity between phrasal, fragments and blocks. On the compositional level are isolated sentences,
paragraphs, sections, chapters, under the chapter, page etc. that (except the sentence) indirectly related to
the internal structure because are not considered.
Content analysis for compliance thematic requests to СCt = Categorization
tion(KeyWords(C ,U K ),U Ct ) ,
where KeyWords(C ,U K ) is the keywords identify operator, Categoriza tion is content categorize operator
according to the keywords identified, U K is keywords identify conditions set, U Ct is categorization
conditions set, CCt is rubrics relevant content set. Digest set C D formed by such dependence as
С D = BuDigest (CCt ,U D ) , where BuDigest is digests forming operator, U D is conditions set for the digests
formation, CCt is rubrics relevant content set. With the help of a database (database for terms/morphemes
and structural parts of speech) and defined rules of text analysis searching terms. Parsers operate in two
stages: lexemes content identifying and a parsing tree creates (alg. 4).
Algorithm 4. Parser of textual commercial content.
Stage 1. Content C lexemes identification from set V.
Step 1. Terms string definition over V as a sentence.
Step 2. Nouns groups identification with bases dictionary V ′ .
Step 3. Verbs groups identification with bases dictionary V ′ .
Stage 2. Creating of a parsing tree from left to right. Each step of output is the deployment as one
symbols of the previous string or it to others replacing, while other symbols are rewritten without change.
It is obtained the component tree, or syntactic structure, if process of deployment, replacement or rewriting characters (fathers) connect lines directly with symbols that come out as a result of the deployment,
replacement or re-writing (descendants).
51
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
Step 1. Nouns group deployment. Verbs group deployment.
Step 2. The implementation of syntactic categories of word forms.
Stage 3. The keywords set determination.
Step 1. The Noun terms determination (nouns, nouns word combinations or adjective with the noun)
among the words set of commercial textual content.
Step 2. The Unicity uniqueness calculation for Noun terms.
Step 3. The NumbSymb value calculation (the characters number with no spaces) for Noun terms at
Unicity ≥ 80 .
Step 4. The UseFrequen cy value calculation (frequency of keywords using). The UseFrequen cy frequency
for terms with NumbSymb ≤ 2000 is within the limits (6;8] %. Frequency for terms with NumbSymb ≥ 3000
is within the limits [2;4 ) %. Frequency for terms with 2000 > NumbSymb < 3000 is within the limits [4;6] %.
Step 5. The values calculation of BUseFreque ncy (the frequency of keywords using at the beginning in
the text), IUseFreque ncy (the frequency of keywords using in the middle of the text), EUseFreque ncy (the
frequency of keywords using at the end in the text).
Step 6. Comparison of values BUseFreque ncy , IUseFreque ncy and EUseFreque ncy for priorities
definition. Keywords with higher values BUseFreque ncy have higher priority than keywords with a higher
value IUseFreque ncy .
Step 7. Keywords sorting according to their priorities.
Stage 4. The database filling of search image for content, i.e. attributes of KeyWords (keywords), Unicity
(the keywords uniqueness ≥ 80 ), Noun (term), NumbSymb (the number of characters without spaces),
UseFrequen cy (frequency of keywords using), BUseFreque ncy (frequency of keywords using at the
beginning in the text), IUseFreque ncy (the frequency of keywords using in the middle of text),
EUseFreque ncy (frequency of keywords using at the end in the text).
Based on the rules of generative grammar perform term correction under the rules of its use in
context. The sentence define action limits of punctuation marks and links. The text semantics is due
communicative task of information transfer. The textual structure is determined by internal organization of
textual units and their relationship regularities. While parsing the text drawn in a data structure (eg, tree
which corresponds the syntactic structure of the input sequence, and is best suited for further processing).
After analysis textual block and term is synthesized a new term as a keyword of content topics by using
base of terms and their morphemes. Next is synthesized terms for a new keyword formation by using base
of structural parts of speech. The principle of keywords detection in content (terms) is based on Zipf's law.
It is reduced to words choice with an average frequency of occurrence. The most used words are ignored
through the stop-dictionaries. And the rare words do not include text. According a meaningful analysis of
the content corresponds to the process grammatical data extraction from the word by grapheme analysis
and the results correction of morphological analysis through the grammatical context analysis of linguistic
units (Alg. 5).
Algorithm 5. The textual commercial content categorization
Stage 1. The division of commercial content on the blocks.
Step 1. Block presentation to the input of tree construction with commercial content blocks.
Step 2. New block creation in the blocks table.
Step 3. The newline characters accumulation.
Step 4. Checking on point availability before a newline character. If there is, then go to step 5. If do not,
then begin the sequence saving in the table, the new block parsing and transition to step 3.
Step 5. Checking on availability of the end in the text. If the end of the text is, then go to step 6. If do not,
then start the accumulated sequence saved in the table, the new block parsing and transition to step 2.
Step 6. Blocks tree getting on the output as a table.
Stage 2. The block division on sentence with structure preservation.
52
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
Step 1. The input is a table of blocks. The sentences table creation with link for field ID_section of n-to-1
type of blocks table.
Step 2. A new sentence creation in sentences table.
Step 3. The symbols accumulation to point, semicolon or newline character.
Step 4. Checking on availability of cuts. If the cut exists, then go to step 5. If do not, then start the
sequence saving in the table, a new sentence parsing and transition to step 2.
Step 5. Checking on availability of the end in the text block. If the end of the text exists, then go to step
6. If do not, then begin a sequence saving in the table, new sentences parsing and transition to step 2.
Step 6. The sentences tree getting on the output as a table.
Step 7. Checking for the end of the text. If the end of the text exists, then go to step 8. If do not, then start
the new block parsing and transition to step 1.
Step 8. The sentences tree getting on the output in the form of tables.
Stage 3. The sentences division for lexemes with indication of belonging to sentences.
Step 1. The lexemes table formation based on the sentences table with fields of ID_lexemes (unique
identifier), ID_sentence (number equal to the code of the sentence with lexeme), Lexemes_number
(number equal to the lexemes number in the sentence), Text (lexeme text).
Step 2. A sentence presentation to the input from the sentences table for parsing on lexemes.
Step 3. A new lexeme creation in the lexemes table.
Step 4. The symbols accumulation to point, spaces or end of a sentence and the seving in the lexemes
table.
Step 5. Checking for the end of the sentence. If yes, then go to step 6. If not then accumulated sequence
saving in the table, new lexeme parsing and transition to step 3.
Step 6. Conducting parsing based on data obtained on the output (Alg. 4).
Step 7. Conducting morphological analysis based on data obtained at the output.
Stage 4. The topics determination for the commercial content.
Step 1. The hierarchical structure construction of properties for each lexical unit with text that includes
grammatical and semantic information.
Step 2. The lexicon formation with a hierarchical organization of properties types, where each typedescendant inherits and overrides the ancestor properties.
Step 3. Unification as a basic mechanism for constructing syntactic structures.
Step 4. The KeyWords set identification for the commercial content (Alh.4).
Step 5. The values set definition as TKeyWords (thematic keywords in the KeyWords set for commercial
content), Topic (the theme for commercial content) and Category (commercial content category).
Step 6. The values set definition as FKeyWords (the frequency of keywords using in the textual
commercial content) and QuantitativeryTKey (frequency of thematic keywords using in the textual
commercial content).
Step 7. The values set definition as Comparison (the keyword using comparison with various topics). The
values set calculation as CofKeyWord s (coefficient of thematic content keywords), Static (coefficient of the
statistical terms importance), Addterm (coefficient of the additional terms availability). Comparison of the
content keywords set with the key concepts with topics. If there is a match, then go to step 9. If not, then go
to step 8.
Step 8. A new category formation with a set of key concepts of the analyzed content.
Step 9. Assignment designated section of the analyzed commercial content.
Step 10. The values set calculation as Location (the coefficient of content location in the thematic
section).
Stage 4. Filling the search images base for attributes as Topic (the theme of content), Category (content
category), Location (the coefficient of content location in the thematic section), CofKeyWord s (the
coefficient of thematic keyword of textual content), Static (coefficient of statistical significance for terms),
53
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
Addterm (the coefficient of the additional terms availability), TKeyWords (the thematic keywords),
FKeyWords (the frequency of using keywords), Comparison (the keywords using comparison of the
different themes), QuantitativeryTKey (frequency of thematic keywords using in the text of commercial
content).
The process of categorization through automatic indexing component in commercial content is
divided into sequential blocks: a morphological analysis, a syntactic analysis, a semantic and a syntactic
analysis of the linguistic structures and meaningful writing variation in the textual content.
Text construction is determined theme expressed information, terms of communication, task of
messages and presentation style. With grammatical, semantic and compositional structure of content is
related his stylistic characteristics that depend on individuality author and subordinated thematic/stylistic
dominants of text. The main stages of the morphological characters determining for textual units:
grammatical classes definition for words (speech parts) and principles of their classification allocation; part
separation of the words semantics as morphology unit, set justification for morphological categories and
their nature; the set description for formal tools that are attached to parts of speech and their morphological
categories. This are used following methods for grammatical meaning expression: synthetic, analytical,
analytical and synthetic.
Grammatical values are summarized through the same type of characteristics and are subject on
partial values separation. For classes designation of similar grammatical meanings is used grammatical
categories concepts. By the morphological values include the category of gender, number, case, person,
time, method, class, type, united in paradigm for the text classification. The object of morphological
analysis is the structure of words, inflection forms, ways for grammatical meanings expression.
Morphological features for textual units are the research tools of communication between lexicon,
grammar, using them in speech, paradigms (case forms words) and the syntagmatic (linear relationships of
words, expression). The implementation of automatic coding of words in text (ie assigning them codes of
grammatical classes) is associated with grammatical classification. Morphological analysis includes the
following steps: bases localization in word form; searching base in the dictionary of bases; the word forms
structure comparison with data in dictionaries of the bases, roots, prefixes, suffixes, inflections. In an
analysis identify the meaning of words and syntagmatic relations between content words. Analysis tools
are a dictionary-based/inflections/ homonyms and statistical/syntactic combinations of words removing
lexical homonymy, semantic analysis for nouns with non-prepositional constructions, tables for semantic
syntactic combination of nouns/adjectives and component of prepositional structures, analysis algorithms
for determination of the checks sequence and the appeals in the dictionary and the tables, words division
system in text on inflection and base, thesaurus equivalences fot replacing equivalent words one/several
new numbers ofconcepts that serve as identifiers for content instead of words based, thesaurus as a
hierarchy of concepts for searching total/associated concepts for this concept, dictionaries service system.
Conclusion
Research of the mathematical methods application for textual information the analysis and synthesis
in natural language for the development of mathematical algorithms and software for textual content
processing is required. Theory of generative grammar (proposed by Noam Chomsky) is modeling
processes at the syntactic level of language. The structural elements in sentences describe syntactic
constructions in textual content regardless of their content. The article shows the features of the sentences
synthesis indifferent languages of using generative grammars. The paper considers norms and rules
influence in the language on the grammars constructing course. The use of generative grammars has great
potential in the development and creation of automated systems for textual content processing, for
linguistic providing linguistic computer systems, etc. In natural languages there are situations where the
phenomenon that depend on context and described as independent of context (ie, in terms of context-free
grammars). In this case description is complicated due to the formation of new categories and rules. The
article describes features in the process of introducing new restrictions on data classes through the new
grammar rules introduction. If the symbols number on the right side in the rules are not lower than the left
54
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
then got not reduced grammar. Then at replacement only one symbol got a context-sensitive grammar. In
the presence only one symbol in the left side of the rule got a context-free grammar. None these natural
constraints on the left side rules apply is not possible. The theory application of generative grammars for
solving problems of applied and computational linguistics at the morphology and syntax level allows to
create a system of speech and texts synthesis, to create practical morphology textbooks and inflection
tables, to conclude the morphemes lists (affixes, roots), to determine the performance and frequency for
morphemes and the frequency of different grammatical categories realization in texts (genus, case, number,
etc.) for specific languages. Developed models on the basis of generative grammars for linguistic
functioning computer systems designed for analytical and synthetic processing of textual content in
information retrieval systems, etc. are used. Is useful to introduce all new and new restrictions on this
grammar, getting more narrow their classes. In describing the complex range of phenomena limit used
means set of description, and the considering these features, which are served in general obviously
insufficient. Research begin with minimum means. Whenever their are not enough (smaller portions) new
means gradually are introduced. Thus possible to determine exactly what means can or can not use in the
description of a phenomenon for understanding its nature.
By introducing a new restriction (right side of any production containing not more than two lexical
units) a new class of context-free grammars is formed, where the syntactic structure of expressions in
constructing a tree with each structure is obtained not more than two branches. That expression always
divided into two parts (eg, nominal group + verb group), each of these halves again divided in half and so
on. But the binary representation of natural language expressions are not always satisfactory and natural in
terms of meaningful linguistic interpretation. Criteria for selecting the appropriate descriptions are beyond
theory: the choice is made on the basis of considerations relating to the specific objectives and the nature of
the task. Since the number of lexical items in the right part of production is already minimal, impose
restrictions on the nature of lexical units that replace (if the right side of each product consisted of one
lexical unit or has the form bB, where – b terminal lexical unit, and B – syntactic category). This restriction
specifies the string output, but requires a lot of power for computation.
In the thesis known methods and approaches to solving the problem of automatic processing of
textual content and selected advantages and disadvantages of existing approaches and results in the field of
the syntactic aspects of computational linguistics is considered. In this paper the general conceptual
framework of modeling inflectional processes of the text arrays creation is formed. The syntactic model
and inflectional classification of lexical structure of sentences is proposed. Also in the theses
lexicographical rules of syntactic type for automated processing of these sentences is developed. The
proposed technique allows to achieve the highest parameters of reliability in comparison with known
analogues. They also demonstrate the high efficiency of applied applications in the linguistics construction
of new information technologies and research inflectional effects in natural language. The work has
practical value, since the proposed models and rules can effectively organize the process of creating a
lexicographical systems of textual content processing of syntactic type. The commercial content formation
model implement in the form of content-monitoring complexes to content collection from data various
sources and provide a content database according to the users information needs. As a result, content
harvesting and primary processing its lead to a single format, classified according to the categories and he
is credited tags with keywords.
1. Berko А. Electronic content commerce systems / A. Berko, V. Vysotska, V. Pasichnyk. – L.: NULP,
2009. – 612 p. 2. Большакова Е. И. Автоматическая обработка текстов на естественном языке и
компьютерная лингвистика: учеб. пособие / Е. И. Большакова, Э. С. Клышинский, Д. В. Ландэ,
А. А. Носков, О. В. Пескова, Е. В. Ягунова. – М.: МИЭМ, 2011. – 272 с. 3. Gladky A. Sintaksicheskie
struktury estestvennogo yazyka v avtomatizirovannyh sistemah obscheniya / A. Gladky. – M.: Nauka Publ.,
1985. – 144 p. 4. Gladky A. Elementy matematicheskoy lingvistiki / A. Gladky, I. Melchuk. – M.: Nauka
Publ., 1969. – 192 p. 5. Gladky A. Formalnye grammatiki i yaziki / A. Gladky. – M.: Nauka Publ., 1973. –
368 p. 6. Lande D. Osnovy modelirovaniya i otsenki elektronnyh potokov / D. Lande, V. Furashev,
S. Braychevsky, O. Grigorev. – K.: Іnzhinіring Publ., 2006. – 348 p. 7. Lande D. Osnovy integratsii
55
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua
informatsionnyh potokov / D. Lande. – Kiev: Іnzhinіring Publ., 2006. – 240 p. 8. Pasichnyk V.
Mathematical linguistic / V. Pasichnyk, Y. Scherbyna, V. Vysotska, T. Shestakevych. – L: Novy svit, 2012. –
359 p. 9. Chomsky N. Three models for the description of language / N. Chomsky. – I.R.E. Trans. PGIT 2,
1956. – P. 113-124. 10. Chomsky N. On certain formal properties of grammars, Information and Control
2 / N. Chomsky // A note on phrase structure grammars, Information and Control 2, 1959. – Р. 137–267,
393–395. 11. Chomsky N. On the notion “Rule of Grammar” / N. Chomsky // Proc. Symp. Applied Math.,
12. Amer. Math. Soc., 1961. 12. Chomsky N. Context-free grammars and pushdown storage /
N. Chomsky // Quarterly Progress Reports, № 65, Research Laboratory of Electronics, M.I.T., 1962.
13. Chomsky N. Formal properties of grammars / N. Chomsky // Handbook of Mathemati-Mathematical
Psychology, 2, ch. 12, Wiley, 1963. – Р. 323–418. 14. Chomsky N. The logical basis for linguistic theory /
N. Chomsky // Proc. IX-th Int. Cong. Linguists, 1962. 15. Сhоmskу N. Finite state languages / N. Сhоmskу,
G.A. Miller // I nformation and Control 1, 1958. – Р. 91–112. 16. Сhоmskу N. Introduction to the formal
analysis of natural languages / N. Сhоmskу, G.A. Miller // Handbook of Mathematical Psychology 2, Ch.
12, Wiley, 1963. – Р. 269–322. 17. Chomsky N. The algebraic theory of context-free languages /
N. Chomsky, M.P. Schützenberger // Computer programming and formal systems, North-Holland,
MR152391. – Amsterdam, 1963. – Р. 118–161.
UDC 681.3:656.1
V. Mazur
Lviv Polytechnic National University, CAD Department
IMPROVEMENT OF CITY TRAFFIC NETWORK BASED
ON AN ANALYSIS OF ITS FEATURES
УДОСКОНАЛЕННЯ ТРАНСПОРТНОЇ МЕРЕЖІ МІСТА
НА ОСНОВІ АНАЛІЗУ ЇЇ ОСОБЛИВОСТЕЙ
 Мazur V., 2014
The article deals with the features of city transport network and there is proposed
measures on its improvement.
Key words: city transport network, 3-D model, typical element.
Розглянуто особливості транспортної мережі міста та запропоновано заходи з її
вдосконалення.
Ключові слова: транспортна мережа міста, 3-D модель, типовий елемент.
Introduction
A further increase of the traffic flow intensity in conditions of the limited space and inadequate city
transport network leads to the exasperation of traffic problems. The solution to these problems is
particularly difficult in the central historical part of the cities of Western Ukraine, sated by a large number
of intersections and narrow streets. Dense housing system and historical value of the buildings virtually
eliminates the modernization of the road network on the basis of known standard approaches. Therefore,
the development of measures to improve the transport network should be based primarily on the
identification and consideration of its specific features. The development of specific methods and
approaches to improving transport networks of the ancient city, is given, in our opinion, insufficient
attention that causes the relevance of this work [1].
The objective of this work is the improvement of the transport network on the basis of its specific
features. To achieve this goal the following tasks are solved: the identification of transport network models
and determination of its critical cross-sections; the development of measures to improve the transport
56
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua