World Library  
Flag as Inappropriate
Email this Article

Named-entity recognition


Named-entity recognition

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text that highlights the names of entities:

[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.[1][2]


  • Problem definition 1
    • Formal evaluation 1.1
  • Approaches 2
  • Problem domains 3
  • Current challenges and research 4
  • Software 5
  • See also 6
  • References 7
  • External links 8

Problem definition

In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind terms like biological species and substances.[3]

Full named-entity recognition is often broken down, conceptually and possibly also in implementations,[4] as two distinct problems: detection of names, and

  • Named entity recognition for Arabic - Issues and challenges in morphologically rich languages such as Arabic
  • CoNLL Language-independent NER shared tasks (2002) and (2003): NER data sets and methods for Spanish, Dutch, English and German
  • Chemical compound and drug name recognition - Community challenge on the recognition of chemical compound and drug entity mentions in text

External links

  1. ^ Elaine Marsh, Dennis Perzanowski, "MUC-7 Evaluation of IE Technology: Overview of Results", 29 April 1998 PDF
  2. ^ MUC-07 Proceedings (Named Entity Tasks)
  3. ^ [1]Nadeau, David; Sekine, Satoshi (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes. 
  4. ^ Carreras, Xavier; Màrquez, Lluís; Padró, Lluís (2003). A simple named entity extractor using AdaBoost. CoNLL. 
  5. ^ a b Tjong Kim Sang, Erik F.; De Meulder, Fien (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. CoNLL. 
  6. ^ Named Entity Definition. Retrieved on 2013-07-21.
  7. ^ Brunstein, Ada. "Annotation Guidelines for Answer Types". LDC Catalog. Linguistic Data Consortium. Retrieved 21 July 2013. 
  8. ^ a b Sekine's Extended Named Entity Hierarchy. Retrieved on 2013-07-21.
  9. ^ Ritter, A.; Clark, S.; Mausam; Etzioni., O. (2011). Named Entity Recognition in Tweets: An Experimental Study (PDF). Proc. Empirical Methods in Natural Language Processing. 
  10. ^ Esuli, Andrea; Sebastiani, Fabrizio (2010). Evaluating Information Extraction (PDF). Cross-Language Evaluation Forum (CLEF). pp. 100–111. 
  11. ^ a b Lin, Dekang; Wu, Xiaoyun (2009). Phrase clustering for discriminative learning (PDF). Annual Meeting of the  
  12. ^ Nothman, Joel; et al. (2013). "Learning multilingual named entity recognition from  
  13. ^ Jenny Rose Finkel; Trond Grenager; Christopher Manning (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling (PDF). 43rd Annual Meeting of the  
  14. ^ Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers 37 (1): 144–157. 
  15. ^ Krallinger, M; Leitner, F; Rabal, O; Vazquez, M; Oyarzabal, J; Valencia, A. "Overview of the chemical compound and drug name recognition (CHEMDNER) task". Proceedings of the Fourth BioCreative Challenge Evaluation Workshop vol. 2. pp. 6–37. 
  16. ^ Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 384-394). Association for Computational Linguistics. PDF
  17. ^ Ratinov, L., & Roth, D. (2009, June). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (pp. 147-155). Association for Computational Linguistics.
  18. ^ Frustratingly Easy Domain Adaptation.
  19. ^ Fine-Grained Named Entity Recognition Using Conditional Random Fields for Question Answering.
  20. ^ Web 2.0-based crowdsourcing for high-quality gold standard development in clinical Natural Language Processing
  21. ^ Eiselt, Andreas; Figueroa, Alejandro (2013). A Two-Step Named Entity Recognizer for Open-Domain Search Queries. IJCNLP. pp. 829–833. 
  22. ^ Linking Documents to Encyclopedic Knowledge.
  23. ^ Learning to link with WorldHeritage.
  24. ^ Local and Global Algorithms for Disambiguation to WorldHeritage.


See also

  • GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API
  • NETagger includes the Java based Illinois Named Entity Recognition tool, trained for the standard 4 types, as well as for an extended set of entities.
  • OpenNLP includes rule based and statistical named entity recognition
  • Stanford CoreNLP includes a Java-based CRF named entity recognition tool


 Michael Jordan  is a professor at   Berkeley 

A recently emerging task of identifying "important expressions" in text and cross-linking them to WorldHeritage[22] [23][24] can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual WorldHeritage pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system:

Despite the high F1 numbers reported on the MUC-7 dataset, the problem of Named Entity Recognition is far from being solved. The main efforts are directed to reducing the annotation labor by employing semi-supervised learning,[11][16] robust performance across domains[17][18] and scaling up to fine-grained entity types.[8][19] In recent years, many projects have turned to a crowdsourcing, which is a promising solution to obtain high-quality aggregate human judgments for supervised and semi-supervised machine learning approaches to NER.[20] Another challenging task is devising models to deal with linguistically complex contexts such as Twitter and search queries.[21]

Current challenges and research

Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities. The most common entity of interest in that domain has been names of genes and gene products. There has been also considerable interest in the recognition of chemical entities and drugs in the context of the CHEMDNER competition, with 27 teams participating in this task.[15]

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.[14] Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Problem domains

Many different classifier types have been used to perform machine-learned NER, with conditional random fields being a typical choice.[13]

NER systems have been created that use linguistic grammar-based techniques as well as statistical models, i.e. machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches have been suggested to avoid part of the annotation effort.[11][12]


Evaluation models based on a token-by-token matching have been proposed.[10] Such models are able to handle also partially overlapping matches, yet fully rewarding only exact matches. They allow a finer grained evaluation and comparison of extraction systems, taking into account also the degree of mismatch in non-exact predictions.

It follows from the above definition that any prediction that misses a single token, includes a spurious token, or has the wrong class, is a hard error and does not contribute to either precision or recall.

  • Precision is the number of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. I.e. when [Person Hans] [Person Blick] is predicted but [Person Hans Blick] was required, precision for the predicted name is zero. Precision is then averaged over all predicted entity names.
  • Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions.
  • F1 score is the harmonic mean of these two.

In academic conferences such as CoNLL, a variant of the F1 score has been defined as follows:[5]

To evaluate the quality of a NER system's output, several measures have been defined. While accuracy on the token level is one possibility, it suffers from two problems: the vast majority of tokens in real-world text are not part of entity names as usually defined, so the baseline accuracy (always predict "not an entity") is extravagantly high, typically >90%; and mispredicting the full span of an entity name is not properly penalized (finding only a person's first name when their last name follows is scored as ½ accuracy).

Formal evaluation

Certain hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes.[7] Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.[8] More recently, in 2011 Ritter used a hierarchy based on common Freebase entity types in ground-breaking experiments on NER over social media text.[9]

Temporal expressions and some numerical expressions (i.e., money, percentages, etc.) may also be considered as named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context it is used.[6]

. chunking). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that "Bank of America" is a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem is formally similar to [5]

This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from World eBook Library are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.