Data Management Plan

Navigating the Intersection of Open Science and Ethical Responsibility

In the rapidly evolving landscape of digital humanities and artificial intelligence, effective data management is crucial for the success and integrity of any research project. The URRACA project, aimed at revolutionizing historical research through AI, recognizes the importance of adhering to the FAIR principles (Findable, Accessible, Interoperable, Reusable) while also maintaining ethical standards. This Data Management Plan outlines our comprehensive approach to managing diverse types of data, balancing the ethos of open science with the ethical responsibilities that come with AI development and historical research.

Types of Data/Research Outputs

The URRACA project will generate a diverse range of data types:

Textual data from manuscripts, books, journals, and databases
Numeric and statistical data from archaeological, geographic, and spatial databases
Visual and 3D data from paintings, photographs, and images of artifacts and archaeological sites
3D objects generated from various 3D modeling and virtual reality software

The estimated size of the entire dataset is around 10-20 terabytes.

Findability

Data will be assigned persistent and unique identifiers like DOIs. We will use trusted repositories, including those federated in the European Open Science Cloud (EOSC), to ensure long-term storage and findability. In the URRACA project, a Digital Object Identifier (DOI) can be assigned to AI-generated analyses and datasets to ensure their long-term accessibility and citation. For instance, let’s consider an AI-generated analysis that identifies patterns of cultural exchange in medieval Iberia based on textual and archaeological data. Once the analysis is complete and peer-reviewed, it can be uploaded to a trusted repository that is registered with the European Open Science Cloud (EOSC). Upon upload, the repository will automatically assign a unique DOI to this specific piece of analysis. This DOI serves as a permanent link that can be cited in academic papers, ensuring that future researchers can easily find and reference the AI-generated analysis. It also allows for version control, so if the analysis is updated or corrected, the new version can be linked to the original DOI. This practice enhances the findability and credibility of the AI-generated work, aligning with the FAIR principles of data management.

Accessibility

Data will be made openly accessible in phases, adhering to ethical, safety protocols related to AI, and Intellectual Property Rights (IPR) considerations. Given the sensitive nature of AI-generated analysis, stringent safety measures will be implemented to prevent misuse or misinterpretation of the data. Restricted data, particularly those involving AI algorithms and models, will be available for third-party verification under controlled conditions that comply with established safety and ethical guidelines. This approach ensures both the integrity and the responsible handling of AI-generated data.

Interoperability

Data will be stored in standard formats, ensuring interoperability. Metadata will adhere to standards provided by the FAIRsharing portal, ELIXIR, CESSDA, and DARIAH. The FAIRsharing portal is an online platform that provides information and resources on data standards, databases, and policies across various scientific disciplines to promote Findable, Accessible, Interoperable, and Reusable (FAIR) data. ELIXIR is a European research infrastructure that coordinates and provides computational resources, tools, and services to facilitate life sciences research. CESSDA (Consortium of European Social Science Data Archives) is a European infrastructure that provides a network of social science data archives, offering research data and related services. DARIAH (Digital Research Infrastructure for the Arts and Humanities) is a European research infrastructure aimed at enhancing and supporting digitally-enabled research and teaching across the arts and humanities.

Reusability

Data will be released under licenses that promote sharing and re-use, such as Creative Commons for textual data and Open Data Commons for numerical data. Tools and models will also be made openly available once AI safety considerations have been met.

European Open Science Cloud (EOSC)

The integration of the European Open Science Cloud (EOSC) into the URRACA project offers a multitude of benefits that align seamlessly with the project’s objectives of responsible data management and open science. Utilizing EOSC’s trusted repositories ensures that URRACA’s diverse datasets, ranging from textual archives to 3D models, are stored in a manner that adheres to the FAIR principles—Findable, Accessible, Interoperable, and Reusable. This not only enhances the project’s credibility but also facilitates interdisciplinary research through easier data discovery and sharing. EOSC’s robust data management tools and guidelines can aid URRACA in creating a comprehensive data management plan that complies with European and international standards, particularly crucial for a project dealing with sensitive historical and societal data. The EOSC platform also offers long-term data preservation services, ensuring the project’s contributions to historical research remain accessible and reusable for future scholarly endeavors. Furthermore, EOSC’s support for open access publishing aligns with URRACA’s commitment to making its research findings widely accessible while adhering to ethical considerations. Overall, leveraging EOSC’s resources can significantly enhance the reach, impact, and ethical standing of the URRACA project.

Additional Resources

We will consult the FAIR Data Maturity Model Working Group’s indicators, use DMPONLINE and ARGOS tools for developing DMPs, and refer to the Science Europe Practical Guide for international alignment.

Curation and Storage Costs

In the context of the URRACA project, the allocation of €70,000 for data curation and storage is meticulously planned to encompass a range of essential activities. This budgetary provision includes fees for secure cloud storage, which is pivotal for the safeguarding of diverse datasets ranging from textual archives to 3D models. To mitigate the risk of data loss, additional funds are earmarked for backup and redundancy solutions. The budget also accounts for the acquisition of specialized database software licenses, essential for the efficient management and querying of complex datasets. Given the sensitive nature of the data, a portion of the budget is allocated for implementing robust encryption protocols. Personnel costs constitute a significant share, covering the salary of a dedicated data manager and technical support staff responsible for the infrastructure’s maintenance. Data curation activities, such as the creation of comprehensive metadata and data validation, are also budgeted for, ensuring the dataset’s quality and findability. Compliance measures, including obtaining necessary certifications and conducting security audits, are integral to the budget, aligning the project with legal and ethical standards. Additional funds are set aside for staff training in data management best practices, legal and ethical consultations, and a contingency reserve for unforeseen expenditures. This holistic approach to budgeting ensures that the project not only adheres to the FAIR principles but also meets the ethical guidelines, thereby striking a balance between open science and responsible data management.

Data Management Team

A dedicated team will oversee data management, working closely with the project’s ethics board to ensure compliance with FAIR principles and ethical standards.

By adhering to these comprehensive data management practices, the URRACA project aims to balance the open science approach with ethical responsibilities, in line with the guidelines and resources provided by the European Commission and EOSC.