Overview of SHACHI
The National Institute of Information and Communications Technology (NICT) and Nagoya University, for the purpose of developing LRs efficiently, have been constructing a large scale metadata database named SHACHI as their joint project by collecting detailed meta information on LRs in Western and Asian countries. This research project aims to extensively collect metadata such as tag sets, formats, and recorded contents of LRs existing at home and abroad and store them systematically.
SHACHI contains more than 2,000 compiled language resources such as corpora, dictionaries, thesauruses and lexicons, forming a large scale metadata of language resources archive. Its metadata, an extended version of OLACmetadataSet conforming to Dublin Core, which contain detailed meta information, have been collected semiautomatically. To that end, it is indispensable for us to work in cooperation with language resource consortia at home and abroad and to take the initiative in contributing to Asian language resources.
Purpose of the Construction of SHACHI
The purpose of the construction of the database is fourfold.
- To investigate the actual conditions of tags and format types of language resources existing at home and abroad.
- To systematically obtain and store metadata of international language resources according to the information obtained from the above-mentioned. (This leads to the construction of language resource ontology.)
- To conduct an investigation into the organic combination of language resources. (This leads to the strategic development of language resources.)
- To promote the distribution of language resources.（developed tools such as Facet Search）
Some 2,000 resources of metadata have already been collected in the database so far and they will be enlarged by a further 3,000 by December 2009.
Additionally, this database is obviously different from those of other language resource consortia since all of our detailed metadata are inputted manually. The database is notably characterized by the attempt to make an ontological construction of language resources throughout the world, as the affinity of language resource types and that of their tag sets are analyzed by applying natural language processing techniques to those detailed metadata. It seems certain that the realization of its ontological construction will contribute to a cutback in research and development costs, and to establishing a language resource infrastructure which meets present-day needs as an on-demand service.
Design for Collecting Metadata
Among organizations willing to store and distribute language resources, there exist some consortia fulfilling their function such as LDC, ELRA, OLAC and Chinese-LDC in Western countries and China, and mainly GSK in Japan.
As for websites, there are two attempting to systematically amass metadata of language resources and promote their distribution, such as Language Technology World (LTW 2007) owned by Deutsches Forschungszentrum for Kunstiliche Intelligencz (DFKI: http://www.dfki.de/lt//publications_show.php?id=148) and a page owned by OLAC (http://www.language-archives.org/).
To return the benefit of developed information processing technologies to society, it is highly desirable that the research be done in mutual cooperation among various language resource consortia and be enhanced by mutually exchanging information. SHACHI will make this possible as its metadata enables us to collect more detailed meta information in accordance with the OLAC metadata set by extending it. OLAC is creating a worldwide virtual library of language resources by developing consensus on the best current practice for the digital archiving of language resources, and by developing a network of interoperating repositories and services. OLAC metadata is based on the complete set of Dublin Core metadata set but a part of which was extended.
Specification of SHACHI
There is a possibilities in the future that this specification is revised.