Le rapport présente une méthode de fouille de textes appliquée aux publications scientifiques à grande échelle et mise en œuvre au niveau national pour suivre la production d’ensembles de données et de logiciels de recherche à travers les publications. Les résultats obtenus sur ces produits de la recherche enrichissent le Baromètre de la science ouverte.

 

 

Large-scale Machine-Learning analysis of scientific PDF for monitoring the production and the openness of research data and software in France

Aricia Bassinet (Université de Lorraine)
Laetitia Bracco (Université de Lorraine)
Anne L’Hôte (Ministère de l’enseignement supérieur et de la recherche)
Eric Jeangirard (Ministère de l’enseignement supérieur et de la recherche)
Patrice Lopez (science-miner)
Laurent Romary (Inria)

Juin 2023

Consulter le rapport sur HAL

There is today no standard way to reference research datasets and software in scientific communication. Emerging editorial workflows and supporting infrastructures dedicated to researchdatasets and software are still poorly adopted in current publishing
practices and are highly fragmented. To better follow the production of research datasets and software, we present a text mining method applied to scientific publications at scale and implemented at the French national level. Our approach relies on state-of-the-art Machine Learning and document engineering techniques to ensure reliable accuracy across multiple research areas and document types. The annotations produced by our system are used by the French Open Science Monitor (BSO) platform to follow the production and the openness of research datasets and software in the context of the French second National Plan for Open Science. The source code and the data of the French Open Science Monitor, as well as all the associated tools and training data, are all available under open licenses.

 

1. Introduction

1.1     Motivations

1.2     The French Open Science Monitor

1.3     Quality criteria for Open Science indicators

1.4     Existing Open Science indicators for research datasets and soft- ware

2. Identification of countrywide research publications

3. Full text harvesting

4. Machine Learning for mention detection and characterization

4.1     Advantages of text mining publications

4.2     Limitation of rule-based tools

4.3     Machine Learning for mention detection and characterization

4.4     Research software

4.5     Research datasets

4.6     Characterization of mention contexts

4.7     Recognition of availability statements

4.8     Architecture of the mention recognizers

5. Application to the French Open Science Monitor

5.1     Monitoring indicators

5.2     Infrastructure and runtime

5.3     Dataset and software mention extraction

5.4     French Open Science Monitor indicators and dashboards

5.5     Local versions of the Open Science Monitor

6. Limitations and future work

7. Conclusion

References