
ATILF
12 Projects, page 1 of 3
assignment_turned_in ProjectFrom 2022Partners:Laboratoire dInformatique Fondamentale et Appliquée de Tours, LLF, ATILF, AMU, CNRS +7 partnersLaboratoire dInformatique Fondamentale et Appliquée de Tours,LLF,ATILF,AMU,CNRS,UTLN,Laboratoire Interdisciplinaire des Sciences du Numérique,Laboratoire d'Informatique Fondamentale et Appliquée de Tours,UL,University of Paris,INSHS,Laboratoire d'informatique et des systèmesFunder: French National Research Agency (ANR) Project Code: ANR-21-CE23-0033Funder Contribution: 678,191 EURDespite great enthusiasm for deep learning in NLP, concern is rising about its limitations. First, neural models are often blackboxes, and their behavior is hard to interpret. Second, benchmark-based evaluation overlooks biases, questioning the robustness and coverage of the resulting generalisations, yielding a landscape of overall diversity. The goal of the SELEXINI project is to address these issues by developing **weakly supervised methods to induce semantic lexicons** from raw corpora, which will then be **seamlessly integrated with semantic text processing models**. Lexical units are seen as useful abstractions that allow representing complex phenomena (e.g. polysemy, similarity, multiword units) associated with interpretable labels, avoiding the overhead and opaqueness of contextualized embeddings (one vector per occurrence). Moreover, our lexicon will combine continuous data (embeddings, clusters) and symbolic data (labels). We will model single and multiword units, their senses, and their semantic frames (arguments, roles). Hence, we propose a new "by-construction" view on interpretability, which can be seen as an alternative to methods trying to dissect complex neural models. For extrinsic evaluation of interpretability and diversity, the induced lexicon will be integrated into standard deep learning models in downstream tasks requiring semantic information: machine reading comprehension and multiword expressions identification. We will develop an experimental protocol to assess the lexicon-corpus complementarity on diverse linguistic phenomena, and to assess the lexicon's usefulness for non-expert end users requiring interpretable results. We expect that this original approach will increase both the interpretability of models and the coverage of diverse phenomena (e.g. rare/unseen items in training data).
more_vert assignment_turned_in ProjectFrom 2014Partners:Centre de Recherche Moyen-Orient et Méditerranée, ATILF, MSH, UL, CNRS +2 partnersCentre de Recherche Moyen-Orient et Méditerranée,ATILF,MSH,UL,CNRS,INSHS,Maison des Sciences de lHomme LorraineFunder: French National Research Agency (ANR) Project Code: ANR-13-BSH3-0009Funder Contribution: 239,948 EURIn the Ninth Century, the rich Arabic tradition of adab finds its way to Spain, in al-Andalus, which then played a central role in knowledge exchange from the Orient and then relayed to the West, by monasteries from the North of the Iberian Peninsula in the 11th and 12th C. In al-Andalus, the adab literature meets the Jewish sapiential tradition of the midrashic literature. New collections are composed, including original works in the 10th and 11th centuries and from the 12th century on, exempla and philosophers’ sayings are translated into Hebrew, Latin, and Romance languages. Much of this complex heritage is found in the extensive Spanish paremiological literature, which is at its highest in the 16th and 17th centuries, and in current Spanish, Judeo-Spanish and Maghrebian collections of proverbs. Although the main lines of these exchanges are known, we lack specific information on the circulation of these short sapiential statements (our basic research unit), on the successive translating choices made by the translators, the cultural reinterpretations, or the weight of a borrowing over another. If sapiential textual filiations and translation sequences should be treated cautiously, this is particularly true for the sapiential statements contained in these texts. Due to the difficulty of understanding them, these volatile elements, whose categorization varies with time and considered cultures, have never been subject to overall textual studies, which would recount their sources, circulation and evolution through the different spoken or written languages by the three cultures within the Iberian Peninsula, during the Middle-Ages. The paremiological studies have principally produced compilations of proverbs (thesauri); editions; erudite studies dedicated to a single work, a single language or a single culture, except for D. Gutas’ remarkable groundbreaking work on the Philosophical Quartet (1975). The few existing databases take into account contemporary “paremiae” corpora, most often unilingual or with a traductology perspective. Therefore, the aim of the ALIENTO project is to calculate matches even when partial, close or distant connections in order to reassess inter-textual relations by comparing a great quantity of data and intersecting encoded texts written in different languages. This I why the project, which needs a close interdisciplinary collaboration between computational researchers (ATILF) and the linguists and specialists of literature (MSH Lorraine + INALCO and the international network of collaborators), will develop a computational software transferable to other similar texts using a large corpus of reference composed of 8 related texts which circulated in the Iberian Peninsula (in Latin, Arabic, Hebrew, Spanish and Catalan), representing 582 pages for a number of sapiential statements evaluated at 9,570 units. The developed software will extract and connect brief sapiential units through matching generated by the specific encoding system elaborated scientifically and written in an encoding manual XML-TEI. The choice and the type of annotations used result from a collaborative reflexion between the members of the project, specialists of linguistic paremiology, ancient texts, design engineers of textual databases, computational researchers during special scientific sessions. It will evolve in a collaborative manner during the matching processes. At the end we will have: - a body of texts belonging to a multilingual corpus, digitized, tagged in XML/TEI and publicly accessible, linked to a set of data on the text and its author. - a set of brief sapiential units with their XML/TEI annotations, accessible free of charge. - a trilingual questioning interface, making it possible to display the matched statements contained in these works, with information which can be used to study them regardless of the language. - an encoding methodology and a software for matching data transferable to other similar corpora.
more_vert assignment_turned_in ProjectFrom 2023Partners:CNRS, INSHS, Laboratoire d'Ecologie, Systématique et Evolution, UL, Université de Paris +1 partnersCNRS,INSHS,Laboratoire d'Ecologie, Systématique et Evolution,UL,Université de Paris,ATILFFunder: French National Research Agency (ANR) Project Code: ANR-22-CE38-0002Funder Contribution: 412,460 EURThe CODIM project focuses on the two main linguistic resources for organizing monologues or conversations in human languages : D(iscourse) M(arkers) (therefore/donc, well/ben,bon etc. in English/French) and prosody (in particular intonation). It will evaluate their status with respect to two major views on communication: compositionality (the possibility of combining meaningful expressions into more complex meaningful expressions) and pattern or construction-based approaches (the idea that language users exploit partly ‘frozen’ strings of words). We will compare the semantic and prosodic properties of simple and complex French DM (e.g. ah + bon) found in corpora for written and spoken French, using a variety of complementary approaches for DM identification (category-driven text mining), clustering (statistics and Machine Learning) and research in prosody (ToBI representation, speech analysis/synthesis). This will foster or reinforce strong collaborations between linguists and computer scientists.
more_vert - UL,INSHS,CNRS,EHESS,ATILF,CAK,MNHN,EHESSFunder: French National Research Agency (ANR) Project Code: ANR-12-CORP-0017Funder Contribution: 240,000 EUR
This project starts from the consideration that it is time conjugating inquiry on a scientific corpus with information technology research tools. Until now the building of information technology infrastructures for scientific corpora is mainly devoted to make available the images and the transcription of the texts. AMPERE2014 starts from the already existing electronic resource “@Ampère et l’histoire de l’éléctricité” (www.ampere.cnrs.fr) and intends to exploit Ampère’s corpus, which is qualitatively and quantitatively impressive. Ampère wrote thousands of pages and discussed the most important subjects of the sciences in the first third of the Nineteenth century, proposed a vast inquiry into philosophy, and analyzed both knowledge and its creative process. Main aim of AMPERE2014 is to empower analysis of and research on Ampère’s corpus through IT applications. This project is planned in order to perform analysis, comparison and connection among elements within the different texts (publications, manuscripts, correspondence, private writings,) of Ampère’s corpus. It will be a process from indexing, to the production of semantics interrelations, until actual research: a real synthesis between scholarship and information technology, something that is in fact a novelty in history of science and inquiry into scientific corpora.
more_vert assignment_turned_in ProjectFrom 2024Partners:INSHS, Centre de traitement automatique du langage, Louvain-La-Neuve, University of Strasbourg, AMU, CNRS +4 partnersINSHS,Centre de traitement automatique du langage, Louvain-La-Neuve,University of Strasbourg,AMU,CNRS,ATILF,LPL,UL,Linguistique, Langues et Parole (EA 1339 - UR 1339 depuis 01.01.2020)Funder: French National Research Agency (ANR) Project Code: ANR-23-CE38-0007Funder Contribution: 536,351 EURThe heterogeneity of levels of langage learners is very frequent in the same class and its handle represents a major problem for the langage teachers, which should provide personalised resources to each learner. Thus, the STAR-FLE project aims to propose innovant digital solutions available in the Natural Language Processing (NLP) area, that may improve text comprehension of French L2 learners and that helps teachers to handle multiple levels of learners. We proposed context-based aided for the comprehension of lexical issues, but also of MWE expressions found in original texts. Our system provides MWE identification, generation of definitions adressed to a specific learner’s profile but also synonym search, word sense disambiguation and simpler synomyms and the possibility to chose simpler synonyms for a better comprehension of a text. On the other hand, we build original NLP resources such as annotated CEFR corpus and lexicons, MWE annotated corpus.
more_vert
chevron_left - 1
- 2
- 3
chevron_right