Скачать книгу

the dependency relationships between the words in a sentence. It processes the text one sentence at a time, and thus only needs a sentence splitter as a prerequisite. It works on the basis of identifying linguistic constructions and parts-of-speech like apposition, relative clauses, subjects and objects of verbs, and determiners, and how they relate to each other. Apposition is the construction where two noun phrases next to each other refer to the same thing, e.g., “my brother John,” or “Paris, the capital of France.” Relative clauses typically start with a relative pronoun (such as “who,” “which,” etc.) and modify a preceding noun, e.g., “who was wearing the hat” in the phrase “the man who was wearing the hat.”

      In contrast to dependency relations, constituency parsers are based on the idea of constituency relations, and may involve a number of different Constituency Grammar theories such as Phrase-Structure Grammars, Categorial Grammars and Lexical Functional Grammars, amongst others. The constituency relation is hierarchical and derives from the subject-predicate division of Latin and Greek grammars, where the basic clause structure is divided into the subject (noun phrase) and predicate (verb phrase). Further subdivisions of each are then made at a more finegrained level.

      A good example of a constituency parser is the Shift-Reduce Constituency Parser which is part of the Stanford CoreNLP Tools.8 Shift-and-reduce operations have long been used for dependency parsing with high speed and accuracy, but only more recently have they been used for constituency parsing. The Shift-Reduce parser aims to improve on older constituency parsers which used chart-based algorithms (dynamic programming) to find the highest scoring parse, which were accurate but very slow. The latest Shift-Reduce Constituency parser is faster than the previous Stanford parsers, while being more accurate than almost all of them.

      Figure 2.6 shows a parse tree generated using a dependency grammar, while Figure 2.7 shows one generated using a constituency grammar for the same sentence.

      Figure 2.6: Parse tree showing dependency relation.

      Figure 2.7: Parse tree showing constituency relation.

      The RASP statistical parser [18] is a domain-independent, robust parser for English. It comes with its own tokenizer, POS tagger, and morphological analyzer included, and as with Minipar, requires the text to be already segmented into sentences. RASP is available under the LGPL license and can therefore be used also in commercial applications.

      The Stanford statistical parser [19] is a probabilistic parsing system. It provides either a dependency output or a phrase structure output. The latter can be viewed in its own GUI or through the user interface of GATE Developer. The Stanford parser comes with data files for parsing Arabic, Chinese, English, and German and is licensed under GNU GPL.

      The SUPPLE parser is a bottom-up parser that can produce a semantic representation of sentences, called simplified quasilogical form (SQLF). It has the advantage of being very robust, since it can still produce partial syntactic and semantic results for fragments even when the full sentence parses cannot be determined. This makes it particularly applicable for deriving semantic features for the machine learning–based extraction of semantic relations from large volumes of real text.

      Parsing algorithms can be computationally expensive and, like many linguistic processing tools, tend to work best on text similar to that on which they have been trained. Because it is a much more difficult task than some of the lower-level processing tasks, such as tokenization and sentence splitting, performance is also typically much lower, and this can have knock-on effects on any subsequent processing modules such as Named Entity recognition and relation finding. Sometimes it is better therefore to sacrifice the increased knowledge provided by a parser for something more lightweight but reliable, such as a chunker which performs a more shallow kind of analysis. Chunkers, also sometimes called shallow parsers, recognize sequences of syntactically correlated words such as Noun Phrases, but unlike parsers, do not provide details of their internal structure or their role in the sentence.

      Tools for chunking can be subdivided into Noun Phrase (NP) Chunkers and Verb Phrase (VP) Chunkers. They vary less than parsing algorithms because the analysis is at a more coarsegrained level—they perform identification of the relevant “chunks” of text but do not try to analyze it. However, they may differ in what they consider to be relevant for the chunk in question. For example, a simple Noun Phrase might consist of a consecutive string containing an optional determiner, one or more optional adjectives, and one or more nouns, as shown in Figure 2.8. A more complex Noun Phrase might also include a Prepositional Phrase or Relative Clause modifying it. Some chunkers include such things as part of the Noun Phrase, as shown in Figure 2.9, while others do not (Figure 2.10). This kind of decision is highly dependent on what the chunks will be used for later. For example, if they are used as input for a term recognition tool, it should be considered whether the possibility of a term that contains a Prepositional Phrase is relevant or not. For ontology generation, such a term is probably not required, but for use as a target for sentiment analysis, it might be useful.

image

      Figure 2.9: Complex NP chunking excluding PPs.

image

      Figure 2.10: Complex NP chunking including PPs.

      Verb Phrase chunkers delimit verbs, which may consist of a single word such as bought or a more complex group comprising modals, infinitives and so on (for example might have bought or to buy). They may even include negative elements such as might not have bought or didn’t buy. An example of chunker output combining both noun and verb phrase chunking is shown in Figure 2.11.

      Figure 2.11: Complex VP chunking.

      Some tools also provide additional chunks; for example, the TreeTagger [21] (trained on the Penn Treebank) can also generate chunks for prepositional phrases, adjectival phrases, adverbial phrases, and so on. These can be useful for building up a representation of the whole sentence without the requirement for full parsing.

      As we have already seen, linguistic processing tools are not infallible, even assuming that the components they rely on have generated perfect output. It may seem simple to create an NP chunker based on grammatical rules involving POS tags, but it can easily go wrong. Consider the two sentences I gave the man food and I bought the baby food. In the first case, the man and food are independent NPs which are respectively the indirect and direct objects of the verb gave. We can rephrase this sentence as I gave food to the man without any change in meaning, where it is clear these NPs are independent. In the second example, however, the baby food could be either a single NP which contains the compound noun baby food, or follow the same structure as the previous example (I bought food for the baby). An NP chunker which used the seemingly sensible

Скачать книгу