Access to resources and services
Esta página em português
One of the goals of Linguateca is to improve significantly the conditions for NLP of Portuguese, namely
This page links the main resources, services or programs developed under the scope of Linguateca.
Main goals of the AC/DC project
- make the available resources more available
- foster development and public availability of others
- provide programs to get corpora on-the-fly on the Internet
- create sufficiently big corpora that can be used as a reference
- make Portuguese corpus processing in general easier
- create publicly available programs that can be reused by other researchers or developers
The corpora were annotated with Eckhard Bick's PALAVRAS parser, from the VISL project.
CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) is a corpus containing some 180 million words in European Portuguese, built by the project Computacional Processing of Portuguese following an agreement between the Portuguese Ministry for Science and Technology (MCT) and the newspaper PÚBLICO.
CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo) is a corpus containing some 24 million words in Brazilian Portuguese, built by the project Computacional Processing of Portuguese from the texts of Folha de S. Paulo belonging to the corpus NILC/São Carlos, compiled by Núcleo Interinstitucional de Lingüística Computacional (NILC).
A Portuguese-English parallel corpus project, including a novel interface, DISPARA, in collaboration with Ana Frankenberg-Garcia.
COMPARA is an open-ended collection of Portuguese-English and English-Portuguese translations. One can use COMPARA to find out how translators have translated words and expressions from Portuguese into English and from English into Portuguese.
Corpógrafo was created by CLUP/FLUP node of Linguateca to facilitate the creation of specialized, "do-it-yourself" corpora. The system offers text preprocessing, terminology extraction and help in defining concepts. A toolbox is provided that allows the user to manage his/her own texts and terminological databases.
Esfinge is a general domain question answering system that answers questions in Portuguese based on the Web.
This project, in collaboration with the VISL project, has as aim to create a syntactically annotated treebank for Portuguese, humanly revised, to advance computational syntax and to create a reosurce for future evaluation tasks of tools for Portuguese.
- provide one place where access to all corpora is given
- further improve the information associated with these corpora
- develop a good user interface
METRA is a meta translator: a service that submits a piece to be translated to several different commercial translation engines on the Web, and presents the results together. It deals with the English-Portuguese and Portuguese-English translation pairs.
PAPEL is a dictionary-based lexical ontology for Portuguese lexical, created from Porto Editora's Dicionário da Língua Portuguesa, created mainly at the Coimbra node of Linguateca. It will be made publicly available.
REPENTINO is a repository of textual named entity instances, i.e. a set of proper nouns denoting a specific entity which in Portuguese is written with at least one capital, classified as to which kind of entity they denote (e.g, company, book title, place name, etc.). REPENTINO is organized in several major categories, in turn subdivided in subcategories.
This space provides a kind of electronic Web shelf for all NLP resources for Portuguese that people want us to make available. We give access to IR collections, MT lexicons and corpora of summaries, among others.
WebJspell is a Web interface to Jspell, a morphological analyser and spell checker developed by Natura for Portuguese and English. Through WebJspell it is also possible to spellcheck entire Webpages by simply submitting their URL, as well as propose new entries for the dictionaries. WebJspell was created by the Braga node of Linguateca.
The WPT 03 is a collection of Web pages created from a crawl of the entire Portuguese Web in the year 2003. As far as we know, the WPT 03 is the first and only collection that spans the entire Web of a country which is freely available for research purposes.
The WPT 03 is a result of a web crawl made between March and June of 2003 by the crawlers of tumba!, a Web search engine for the Portuguese community. In addition, the log of the queries to tumba! in the period from 1st October 2003 are also provided, after having run them through an anonymization procedure. The WPT03 was created by the XLDB group and made available here.
See also the Language Resource CatalogSearch for language resources:
Last update: 18 March 2010
Send questions, comments and suggestions