The topic of this book is the theoretical foundations and the implementation and the results of a system for text analysis and understanding called GETARUN, developed at the University of Venice, Laboratory of Computational Linguistics, Department of Language Sciences. The main tenet of the theory supporting the construction of the system is that it is possible to reduce access to domain world knowledge by means of contextual reasoning, i.e. reasoning triggered independently by contextual or linguistic features of the text under analysis. In other words, it adopts what could be termed the Shallow Processing Hypothesis: access to WordNet is reduced and substituted whenever links are missing through inferences on the basis of hand-coded lexical and grammatical knowledge given to the system, which are worked out in a fully general manner. In exploring this possibility we make one fundamental assumption and it is that the psychological processes needed for language analysis and understanding are controlled by a processing device which is completely separated from that of language generation with which it shares a common lexicon though. In our approach there is no statistical processing, but only algorithms based on symbolic rules – even though we use FSA (finite state automata) to help tag disambiguation. The reason for this is twofold: an objective one, statistical language models need linguistic resources which in turn are very time-consuming to produce and highly error-prone activities. On more general level, one needs to consider that highly sophisticated linguistic resources are always language and genre dependent, besides the need to comply with requirements of statistical representativeness. No such limitations can be deemed for symbolic algorithms which on the contrary are more general and easily portable from one language to another. Differences in genre can also be easily accounted for by scaling the rules adequately. It is sensible to assume that when understanding a text a human reader or listener does make use of his encyclopaedia parsimoniously. Contextual reasoning is the only way in which a system for Natural Language Understanding should tap external knowledge of the domain. In other words, a system should be allowed to perform an inference on the basis of domain world knowledge when needed and only then. In this way, the system could simulate the actual human behaviour in that access to extralinguistic knowledge is triggered by contextual factors independently present in the text and detected by the system itself. This would be required only for implicit linguistic relations as can happen with bridging descriptions, to cope with anaphora resolution phenomena. It is also our view that humans understand texts only whenever all the relevant information is supplied and available. Descriptive and narrative texts are usually self-explanatory - not so, literary texts - in order to allow even naive readers to grasp their meaning. Note that we are not here dealing with spoken dialogues, where a lot of what is meant can be left unsaid or must be implicitly understood. In the best current systems for natural language, the linguistic components are kept separate from the knowledge representation, and work which could otherwise be done directly by the linguistic analysis is duplicated by the inferential mechanism. The linguistic representation is usually mapped onto a logical representation which is in turn fed onto the knowledge representation of the domain in order to understand and validate a given utterance or query. We shall comment and discuss some such systems in the book. Thus the domain world model must be priorly built, usually in view of a given task the system is set out to perform. This modelling is domain and task limited and generality can only be achieved from coherent lexical representations, as will be discussed in the book. In some of these systems, the main issue is how to make the two realms interact as soon as possible in order to take advantage of the inferential mechanism to reduce ambiguities present in the text or to allow for reasoning on linguistic data, which otherwise couldn't be understandable. We assume that an integration between linguistic information and knowledge of the world can be carried out at all levels of linguistic description and that contextual reasoning can be thus performed on the fly rather than sequentially. This does not imply that external knowledge of the world is useless and should not be provided at all: it simply means that access to this knowledge must be filtered out by the analysis of the linguistic content of surface linguistic forms and their abstract representations of the utterances making up the text.

Computational Linguistic Text Processing – Lexicon, Grammar, Parsing and Anaphora Resolution

DELMONTE, Rodolfo
2008-01-01

Abstract

The topic of this book is the theoretical foundations and the implementation and the results of a system for text analysis and understanding called GETARUN, developed at the University of Venice, Laboratory of Computational Linguistics, Department of Language Sciences. The main tenet of the theory supporting the construction of the system is that it is possible to reduce access to domain world knowledge by means of contextual reasoning, i.e. reasoning triggered independently by contextual or linguistic features of the text under analysis. In other words, it adopts what could be termed the Shallow Processing Hypothesis: access to WordNet is reduced and substituted whenever links are missing through inferences on the basis of hand-coded lexical and grammatical knowledge given to the system, which are worked out in a fully general manner. In exploring this possibility we make one fundamental assumption and it is that the psychological processes needed for language analysis and understanding are controlled by a processing device which is completely separated from that of language generation with which it shares a common lexicon though. In our approach there is no statistical processing, but only algorithms based on symbolic rules – even though we use FSA (finite state automata) to help tag disambiguation. The reason for this is twofold: an objective one, statistical language models need linguistic resources which in turn are very time-consuming to produce and highly error-prone activities. On more general level, one needs to consider that highly sophisticated linguistic resources are always language and genre dependent, besides the need to comply with requirements of statistical representativeness. No such limitations can be deemed for symbolic algorithms which on the contrary are more general and easily portable from one language to another. Differences in genre can also be easily accounted for by scaling the rules adequately. It is sensible to assume that when understanding a text a human reader or listener does make use of his encyclopaedia parsimoniously. Contextual reasoning is the only way in which a system for Natural Language Understanding should tap external knowledge of the domain. In other words, a system should be allowed to perform an inference on the basis of domain world knowledge when needed and only then. In this way, the system could simulate the actual human behaviour in that access to extralinguistic knowledge is triggered by contextual factors independently present in the text and detected by the system itself. This would be required only for implicit linguistic relations as can happen with bridging descriptions, to cope with anaphora resolution phenomena. It is also our view that humans understand texts only whenever all the relevant information is supplied and available. Descriptive and narrative texts are usually self-explanatory - not so, literary texts - in order to allow even naive readers to grasp their meaning. Note that we are not here dealing with spoken dialogues, where a lot of what is meant can be left unsaid or must be implicitly understood. In the best current systems for natural language, the linguistic components are kept separate from the knowledge representation, and work which could otherwise be done directly by the linguistic analysis is duplicated by the inferential mechanism. The linguistic representation is usually mapped onto a logical representation which is in turn fed onto the knowledge representation of the domain in order to understand and validate a given utterance or query. We shall comment and discuss some such systems in the book. Thus the domain world model must be priorly built, usually in view of a given task the system is set out to perform. This modelling is domain and task limited and generality can only be achieved from coherent lexical representations, as will be discussed in the book. In some of these systems, the main issue is how to make the two realms interact as soon as possible in order to take advantage of the inferential mechanism to reduce ambiguities present in the text or to allow for reasoning on linguistic data, which otherwise couldn't be understandable. We assume that an integration between linguistic information and knowledge of the world can be carried out at all levels of linguistic description and that contextual reasoning can be thus performed on the fly rather than sequentially. This does not imply that external knowledge of the world is useless and should not be provided at all: it simply means that access to this knowledge must be filtered out by the analysis of the linguistic content of surface linguistic forms and their abstract representations of the utterances making up the text.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/19380
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact