The Bulgarian Sense-annotated Corpus (BulSemCor) (Bulgarian: Български семантично анотиран корпус (БулСемКор)) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics[1] at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.
Structure
editBulSemCor was created as part of a nationally funded project titled "BulNet – A lexico-semantic network for the Bulgarian Language" (2005–2010). It follows the general methodology of SemCor[2] combined with some specific principles.[3] The corpus for annotation consists of 101,791 tokens covering an excerpt from the Bulgarian "Brown" Corpus[4] modelled on the Brown Corpus.Francis Kucera An important feature of BulSemCor is that the samples are selected using heuristics that provide optimal coverage of ambiguous lexis.
BulSemCor is manually sense-annotated according to the Bulgarian WordNet. Its size is comparable to that of other contemporary semantically annotated corpora or pool of acceptable linguistic components. The semantic annotation consists in associating each lexical item in the corpus with exactly one synonym set (synset) in the Bulgarian WordNet that best describes its sense in the particular context. The selection of the best match among the suggested candidates is based on a set of procedures, such as the other synset members, the synset gloss (explanatory definition) and the position of a given candidate in the WordNet structure.
Scale
editThe number of annotated tokens is 99,480 (the difference in the number of tokens compared to the initial corpus is due to the fact that some of them are not linguistic items). The simple word count is 86,842 and multiword expressions (MWE) are 5,797 (12,638 tokens).
Specific features
editAll words in BulSemCor are assigned a sense, while according to established practice only simple content words or content word classes (typically nouns and verbs) are annotated. Since 2000 the development of language resources, has broadened to include annotation of function words and multiword expressions covering particular senses or types of words and expressions. In this respect, BulSemCor's annotation is more exhaustive and hence provides greater opportunities for linguistic observations and non-linear programming (NLP) applications.
Annotated items inherit the linguistic information associated with the corresponding synset, which along with morphological and semantic tags may include annotation on one or more of the following additional levels:[5]
- Partial information about the syntactic structure of MWE types – particularly, information about syntactic heads and their dependents;
- Information about the category of the named entities – names, locations, organisations, dates, numbers, etc.;
- Information about the taxonomic category of adverbs, such as time, place, manner, degree, quantity, etc.;
- Information about the type of the syntactic relationships – coordination or subordination – expressed by conjunctions;
- Information about the original part-of-speech of substantivised words (non-nouns that act as nouns in a particular context);
- Stylistic/register, grammatical and other information about synsets or individual synset members;
See also
editReferences
edit- ^ Department of Computational Linguistics Archived May 18, 2015, at the Wayback Machine
- ^ Miller 1995.
- ^ Koeva 2010.
- ^ Bulgarian "Brown" Corpus Archived May 18, 2015, at the Wayback Machine Koeva, Leseva & Todorova 2006
- ^ Todorova, Kukova & Leseva 2014.
- Koeva, Svetla (2010). "Balgarskiyat semantichno anotiran korpus" [The Bulgarian Sense-annotated Corpus].
- Koeva, Svetla; Leseva, S.; Todorova, M. (May 23, 2006). Bulgarian Sense Tagged Corpus. 5th SALTMIL Workshop on Minority Languages: Strategies for Developing Machine Translation for Minority Languages. pp. 79–87.
- Miller, G. A. (1995). "Building Semantic Concordances: Disambiguation vs. Annotation AAAI Technical Report SS-95-01" (PDF): 92–94.
{{cite journal}}
: Cite journal requires|journal=
(help) - Todorova, M.; Kukova, H.; Leseva, S. (2014). Semantichno anotirani resursi za balgarskiya ezik – BulSemCor (Semantically-annotated Resources for Bulgarian – BulSemCor) [Language Resources and Technologies for Bulgarian]. Academic Publishing House. pp. 80–104. ISBN 978-954-322-797-6.
{{cite book}}
:|work=
ignored (help) - Francis, N.; Kucera, H. (1979), Manual of Information to Accompany a Standard Sample of Present-day Edited American English, for Use with Digital Computers, Providence, Rhode Island: Department of Linguistics, Brown University, archived from the original on May 18, 2014, retrieved July 7, 2013