Language Resource Search - SHACHI: Language Resource Metadata Database

Language resource #: 3330 Results 1811 - 1820 of 2023

C-004555: Spoken Portuguese Corpus
Desktop/Microphone
The Spoken Portuguese corpus was collected among sociolinguistically diverse speakers having Portuguese as mother tongue or as second language. In a total of 86 recordings, the texts exemplify the Portuguese spoken in Portugal (30), in Brazil (20), in the African countries with Portuguese as its official language: Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe (5 each), in Macao (5), in Goa (3) and in East-Timor (3), corresponding to a total of 8h44m of recording.
The corpus was recorded in a situation of spontaneous oral communication, on different themes of everyday life, with speakers of different ages and social and professional backgrounds.
The recordings cover a period that goes from 1970 to 2001, and approximately 70% of them fall within the nineties. The corpus contains 153,588 tokens.

The corpus consists of audio files in .wav format, aligned transcriptions in XML Exmaralda format and transcriptions in plain text. The plain text files also have automatically assigned POS-tag information. The transcriptions of the corpus are also available in html format. The characters have been encoded in UTF-8.
C-004556: Fundamental Portuguese Corpus
Desktop/Microphone
The Fundamental Portuguese Corpus is a corpus of spoken language, collected between 1970 and 1974, composed of 1800 recordings (500 hours) made in Continental Portugal and the Islands. Of these 1800 conversations, a sample was selected and transcribed.

The corpus consists of audio files in .wav format, aligned transcriptions in XML Exmaralda format and transcriptions in plain text. The plain text files also have automatically assigned POS-tag information. The transcriptions of the corpus are also available in html format. The characters have been encoded in UTF-8.
C-004576: Quaero Broadcast News Extended Named Entity corpus
Broadcast Resources
The Quaero Broadcast News Extended Named Entity corpus consists of the manual annotation of (i) the ESTER 2 corpus (see ELRA-S0338) and (ii) the Quaero Speech Recognition Evaluation corpus (manual and automatic transcriptions coming from 3 different ASR systems). The first part is the training corpus and the second one is the test corpus.

The corpus is fully manually annotated according to the Quaero extended and structured named entity definition, which differentiates entity "types" and "components". The training part of the corpus is only composed of broadcast news data and contains 188 shows, 1,291,225 words, 113,885 types and 146,405 components. The test corpus is composed of both broadcast news and broadcast conversations data and contains 18 shows, 108,010 words, 5,523 types and 8,902 components.

The Quaero Broadcast News Extended Named Entity Corpus consists of:
- a manually transcribed and fully annotated radio broadcast news and broadcast conversation corpus amounting to about 1.5 million words,
- a sub-corpus serving as a mini-reference corpus for quality evaluation purposes,
- tools developed for annotation and evaluation,
- guidelines.
C-004578: CHIL 2007+ Evaluation Package
Multimodal/Multimedia Resources
The CHIL2007+ includes 1) CHIL 2007 Evaluation Package (see ELRA-E0033) and 2) additional annotations which have been created within the scope of the Metanet4u Project (ICT PSP No 270893), sponsored by the European Commission.

The CHIL 2007 Evaluation Package was produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission's Sixth Framework Programme. The objective of this project is to create environments in which computers serve humans who focus on interacting with other humans as opposed to having to attend to and being preoccupied with the machines themselves. Instead of computers operating in an isolated manner, and Humans [thrust] in the loop [of computers] we will put Computers in the Human Interaction Loop (CHIL).

In this context, the CHIL project produced CHIL Seminars. The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. During the talks, videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speakers voice and ambient sounds were recorded.

The CHIL 2007 Evaluation Package consists of the following contents:
1) A set of audiovisual recordings of interactive seminars. The number of people present in the recording was fixed to be between 3 and 7. The recordings were done between June and September 2006 according to the CHIL Room Setup specification.
2) Video annotations: the 3-D coordinates of each participant.
3) Orthographic transcriptions of speech, identity of the speaker, and acoustic events.

The additional annotations have been designed as a complementary extension of the CHIL 2007 Evaluation Package. The list of annotation categories included in that database has been largely extended so that: 1) the database can also be used with other speech technologies, and 2) it includes richer information about human activity.

The set of additional annotations includes:
1) Movement
2) Individual focus of attention
3) Hand gestures
4) Head gestures
5) Spatial role labeling
6) Activity
7) Emotion
8) Named entities
9) Topics
10) Links between tiers, when more than one modality is required to resolve ambiguities.

The resultant extended database is called CHIL2007+.
C-004593: aGender
Telephone
aGender contains speech sample recordings over public telephone lines with read and (semi-)spontaneous speech. Native German speakers called a voice portal from their private phone, and read text + answered some open questions. The purpose of the corpus is the automatic detection of gender and/or age (7 mixed classes ranging from 7 - 80 years). The corpus contains the voices of 945 German speakers (approx. minimum of 100 speakers per class), each delivering 18 speech items in up to six different sessions. The time/date of the individual recordings sessions were not controlled, neither the total number of sessions per speaker.

The audio signal was recorded over standard cell phones (GSM standard) and landline connections in 8000 Hz, 8 bit alaw format. Data were then expanded to 8000Hz, 16bit PCM (all 16 bits are valid!).

The selection of speakers is approximately evenly distributed over the seven target classes, with class 1 also being balanced for gender. The read material consists of an altered version of the SpeechDat text material, containing short fixed and free text typical for automated call centers.

A typical utterance is about 2 seconds in length, but there are also some utterances are between 3 and 6 seconds. In total, the corpus consists of 47 hours of speech. Two sets were defined on that data: A training set (81.5%) and a test set (175 speakers, 25 per class, 18.5%), each with disjunctive speaker sets. For the test set no class information is given in this corpus.

Number of speakers in training/development set: 770
Number of speakers in test set: 175
Number of sessions in train/devel: 3625
Number of utterances: 65241
Number of training/development utterances: 53076
Number of test utterances: 12165

For a general information, see also:
Felix Burkhardt, Martin Eckert, Wiebke Johannsen, Joachim Stegmann (2010): A Database of Age and Gender Annotated Telephone Speech. In: Proceedings of the LREC 2010, Malta.
C-004597: LECTRA (LECture TRAnscriptions in European Portuguese)
Desktop/Microphone
This corpus is composed of the audio and the manual transcriptions of the LECTRA Corpus: classroom LECture TRAnscriptions in European Portuguese. The corpus includes seven 1-semester University courses. All lectures were taught at Technical University of Lisbon (IST), recorded in the presence of students, except IICT, recorded in another university and in a quiet office environment, targeting an Internet audience. The corpus contains a total of 28 hours of audio speech that were manually transcribed by several trained annotators.

The corpus is comprised of technical University lectures: Production of Multimedia Contents (PMC), Economic Theory I (ETI), Linear Algebra (LA), Introduction to Informatics and Communication Techniques (IICT), Object Oriented Programming (OOP), Accounting (CONT), Graphical Interfaces (GI).

Two files per lecture are provided:
a) a RAW file: audio file
b) a TRS file: containing the manual transcriptions. The TRS format is a kind of XML format that a standard transcription software such as Transcriber can open. Annotations in the TRS files are at word-level. They are fine-grained transcriptions that include disfluencies. The characters in the text files are encoded in ISO-8859-1 (Latin1).

The TRS files have a total of 220K word tokens (Training set: 179K word tokens, Development set: 21K word tokens, Test set: 20K word tokens). The whole resource occupies 3.3 GB.

For a complete description of the corpus and the report of Automatic Speech Recognition results, the reader may refer to:
(Trancoso et al., 2008) Isabel Trancoso, Rui Martins, Helena Moniz, Ana Isabel Mata da Silva, Maria do Céu Guerreiro Viana Ribeiro, The LECTRA Corpus - Classroom Lecture Transcriptions in European Portuguese, In
LREC 2008 - Language Resources and Evaluation Conference, Marrakesh, Morocco, May 2008.
(Pellegrini et al., 2012) Thomas Pellegrini, Helena Moniz, Fernando Batista, Isabel Trancoso, Ramon Fernandez Astudillo, Extension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese, In SPEECH AND CORPORA, Belo Horizonte, March 2012.
C-004598: CORAL Corpus
Desktop/Microphone
The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus in European Portuguese, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic.

- Linguistic Contents:
56 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some have slightly different names in the two maps (e.g. curvas perigosas vs. troço sinuoso). In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena:
o Sequences with /l/ favouring or not its velarization (e.g. sala malva, sal amargo)
o Sequences with /s/ in word final position followed by another coronal fricative (e.g. barcos salva-vidas)
o Sequences of plosives formed across word boundaries (e.g. clube de tiro)
o Sequences of obstruents formed within and across word boundaries (e.g. bairros degradados)

The last three items were designed to allow a more comprehensive study of consonant clusters formed within and across word boundaries and should, therefore, be jointly investigated.

- Number and Type of Speakers:
The original 32 speakers were divided into 8 quartets and, in each quartet, organized to take part in 8 dialogues. The available database contains 7 quartets, corresponding to 28 speakers. Given the reduced number of speakers, they were chosen to achieve an adequate balance of sexes, but were restricted in terms of age (under-graduate or graduate students) and accent (Lisbon area). Speakers were chosen in pairs who know each other, so that half of the conversations take place between "friends" and half between people who do not knew each other.

- Data Collection:
The recordings take place in a sound proof room, with no visual contact between the speakers. They wear close-talking microphones and the recordings are made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring is done once the dialogues start, after adjusting recording levels.

- Annotation:
Only orthographic transcription was done for the whole corpus. A pilot recording was annotated in several levels.
Four files per dialogue are provided:
a) two RAW files: audio file
b) two TRS files: containing the manual transcriptions. The TRS format is a kind of XML format that a standard transcription software such as Transcriber can open. Annotations in the TRS files are at word-level. They are fine-grained transcriptions that include disfluencies. The characters in the text files are encoded in ISO-8859-1 (Latin1).
The corpus consists of 112 TRS and corresponding WAV files, and contains about 57K word tokens. The disk size is about 1.5 MB for the TRS files and 1.2 GB for the WAV files.
C-004601: Nepali Spoken Corpus
Desktop/Microphone
The Nepali Spoken Corpus is one of the 3 resources that constitute the Nepali National Corpus. The Nepali National Corpus was produced in 2006 in the framework of the project Bhasha Sanchar (language communication), also known as Nelralec, for Nepali Language Resources and Localization for Education and Communication; funded by the EU Asia ITThe Nepali Spoken Corpus is one of the 3 resources that constitute the Nepali National Corpus. The Nepali National Corpus was produced in 2006 in the framework of the project Bhasha Sanchar (language communication), also known as Nelralec, for Nepali Language Resources and Localization for Education and Communication; funded by the EU Asia IT&C programme, reference number ASIE/2004/091-777.

The design of Nepali Spoken Corpus (NSC) is based on Goteborg Spoken Language Corpus (GSLC). The data are taken from spoken Nepali used in different social activities. The basic assumption of the NSC is that the spoken language differs from written language and it has also different genres as in written language.

NSC contains audio recordings from different social activities within their natural settings as much as possible, with phonologically transcribed and annotated texts, and information about the participants. A total of 17 types of activity were recorded. The total temporal duration of the recorded material is 31 hours and 26 minutes.

The description of the Nepali Spoken Corpus is provided below:
Recorded Activity types: 17
Recorded Activity occurrences (files): 115
Total time (duration): 31 hours 26 minutes
Total transcribed words (assumed): 260,000
Total transcribed files: 115
Completely checked: 115

As can be seen above, 115 activity occurrences have been recorded belonging to 17 activity types. For instance, the activity type shopping has four recorded occurrences and the activity type discussion has 16 recorded instances.
- isPartOf: Nepali National Corpus
C-004602: CLIPS_MT_MANUAL
Desktop/Microphone
CLIPS_MT_MANUAL is a sub-corpus of the original Italian CLIPS corpus (Corpora e Lessici dell'Italiano Parlato e Scritto). This corpus contains 3228 inspected and partially repaired WAV signal files, each containing one dialogue turn (*.wav), 3228 corrected original CLIPS annotation files (*.acs, *.phn, *.std, *.wrd), 3228 BAS Partitur files containing the annotation tiers ORT, KAN and SAP (*.par), 3228 EMU database annotation files (*.vot, *.hlb) covering 30 maptask dialogues performed by 30 speakers (each speaker pair performing two different map tasks) recorded in 15 different locations in Italy in 2000-2004.
C-004604: PortMedia French and Italian corpus
Telephone
The PortMedia French and Italian corpus was produced by ELDA, with the same paradigm and specifications as the MEDIA speech database (ELRA-S0272) but on a different domain.

The method chosen for the corpus construction process is that of a Wizard of Oz (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of touristic information and reservation (ticket reservation within the 2010 Festival dAvignon for French and hotel reservation for Italian).

The corpus contains 700 transcribed dialogues from about 140 French speakers and 604 transcribed dialogues from about 150 Italian speakers (several dialogues per speaker).

The database is formatted following the SpeechDat conventions and it includes the following items:
700 recorded sessions for French and 604 sessions for Italian. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (lohi or Intel format) as signed integers.
Manual transcription of each session in HTML format. Label files were created with the free transcription tool Transcriber (TRS files).
A manual semantic annotation of the corpus. It has been produced with Semantizer, which is also provided with the data.

SHACHI - Language Resource Metadata Database