言語資源の登録件数: 3330件
2023 件中 2021 - 2023 件目
-
C-005083: NUM 5M Mongolian written corpus
This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws.
The collected raw texts was reduced from 5 to 4.8 million words after cleaning. The cleaned corpus comprises:
- 144 texts from laws,
- 278 stories,
- 8 novelettes,
- 4 novels from literature;
- 597 news,
- 505 interviews,
- 302 reports,
- 578 essays,
- 469 stories,
- 1,258 editorials from newspaper.
Part of this corpus, about 2,800 sentences with 100,000 words, has been POS-tagged manually and stored in TEI format. -
C-005084: Metalogue Multi-Issue Bargaining Dialogue
*Introduction*
Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.
The goal of the Metalogue project was to develop a dialogue system with flexible dialogue management to enable the system's behavior in setting goals, choosing strategies and monitoring various processes. Participants were involved in a multi-issue bargaining scenario in which a representative of a city council and a representative of small business owners negotiated the implementation of new anti-smoking regulations. The negotiation involved four issues, each with four or five options. Participants received a preference profile for each scenario and negotiated for an agreement with the highest value based on their preference information. Negotiators were not allowed to accept an agreement with a negative value or to share their preference profiles with other participants.
*Data*
Six unique subjects (undergraduates between 19 and 25 years of age) participated in the collection. The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Speech signal files are of two types: full dialogue session; and segmented speech signal, cut per speaker and roughly per turn.
Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction.
Seven types of annotation were performed manually using the Anvil tool: dialogue act annotations; discourse structure acts; contact management acts; task management dialogue acts; negotiation moves; rhetorical relations; and disfluencies in speech production. More information about the annotation process is included in the documentation.
All text is presented in UTF-8 as either plain text or XML. -
C-005085: UCLA High-Speed Laryngeal Video and Audio
*Introduction*
UCLA High-Speed Laryngeal Video and Audio was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings from nine subjects collected between April 2012 and April 2013. Speakers were asked to sustain the vowel /i/ for approximately ten seconds while holding voice quality, fundamental frequency, and loudness as steady as possible.
In the field of speech production theory, data such as contained in this release may be used to study the relationship between vocal folds vibration and resulting voice quality.
*Data*
None of the subjects had a history of a voice disorder. There was no native language requirement for recruiting subjects; participants were native speakers of various languages, including English, Mandarin Chinese, Taiwanese Mandarin, Cantonese and German.
Audio data is presented as 16kHz 16-bit flac and video is in avi format at 5 fps (frames per second).