Скачать книгу

discourse information per tweet, and threaded structure is fragmented across multiple documents, flowing in multiple directions. NERC from social media will be discussed explicitly in Chapter 8.

      In general, NERC performance is lower than performance of NLP pre-processing tasks, such as POS tagging, but can still reach F1 scores above 90%. NERC performance depends on a variety of factors, including the type of text (e.g., newswire, social media), the NE type (e.g., PER, LOC, ORG), the size of the available training corpus and, most notably, how different the corpus the NER was developed on is from the text the NERC is applied to [69]. In the context of NERC evaluation campaigns, the task is typically to train and test systems on different splits of the same corpus (also called in-domain performance), meaning the test corpus is very similar to the training corpus.

      To give an indication of such in-domain NERC performance, the current state-of-the-art result on ConLL 2003, the most popular newswire corpus with NERC annotations, is an F1 of 90.10%. The best-performing system is currently [70].11 On the other hand, the winning tool for NERC for the social media domain in the 2015 shared task WNUT [71, 72] only achieved 56.41% F1, and 70.63 for NER. It is clear that NERC is much more difficult than NER, and that NERC for existing social media corpora is more challenging than for newswire corpora. Notably, the corpora also differ in size, which is fairly typical. Large NERC-annotated corpora exist for the newswire genre, but these are still largely lacking for the social media genre. This is a big part of the reason that performance on social media corpora is so much worse [69].

      In real-world or application scenarios, such an in-domain setting as described above typically does not apply. Even if a hand-annotated NERC corpus is created for the specific application at some point, the test data might change. Typically, the greater the time difference between the creation time of a training corpus and test data, the less useful it is for extracting NEs from that test corpus [69]. This is particularly true for the social media genre, where entities change very quickly. In practice this means that after a couple of years, training data can be rendered almost useless.

      In this chapter, we have described the task of Named Entity Recognition and Classification and its two subtasks of boundary identification and classification into entity types. We have shown why the linguistic techniques described in the previous chapter are required for the task, and how they are used in both rule-based and machine-learning approaches. Like most of the following NLP tasks we describe in the rest of the book, this is the point at which tasks begin to get more complicated. The linguistic pre-processing tasks all essentially have a very similar goal and definition which does not vary according to what they will be used for. NE recognition and other tasks, such as relation extraction, sentiment analysis, etc., vary enormously in their definition, depending on why they are required. For example, the classes of NEs may differ widely from the standard MUC types of Person, Organization, and Location to a much more fine-grained classification involving many more classes and thus making the task very different. From there one can also go a stage further and perform a more semantic form of annotation, linking entities to external data sources such as DBpedia and Freebase, as will be described in Chapter 5. Despite this, methods for NERC are typically reusable (at least to some extent) even when the task itself varies substantially, although for example some kinds of learning methods may work better for different levels of classification. In the following chapter, we look at how named entities can be connected via relations, such as authors and their books, or employees and their organizations.

       1 http://nerd.eurecom.fr/ontology

      Конец ознакомительного фрагмента.

      Текст предоставлен ООО «ЛитРес».

      Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.

      Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.

/9j/4RwERXhpZgAATU0AKgAAAAgABwESAAMAAAABAAEAAAEaAAUAAAABAAAAYgEbAAUAAAABAAAA agEoAAMAAAABAAIAAAExAAIAAAAeAAAAcgEyAAIAAAAUAAAAkIdpAAQAAAABAAAApAAAANAALcbA AAAnEAAtxsAAACcQQWRvYmUgUGhvdG9zaG9wIENTNiAoV2luZG93cykAMjAxNjoxMjoxMyAxNToy MTozNgAAA6ABAAMAAAABAAEAAKACAAQAAAABAAAIx6ADAAQAAAABAAAK2QAAAAAAAAAGAQMAAwAA AAEABgAAARoABQAAAAEAAAEeARsABQAAAAEAAAEmASgAAwAAAAEAAgAAAgEABAAAAAEAAAEuAgIA BAAAAAEAABrOAAAAAAAAAEgAAAABAAAASAAAAAH/2P/tAAxBZG9iZV9DTQAB/+4ADkFkb2JlAGSA AAAAAf/bAIQADAgICAkIDAkJDBE

Скачать книгу