RO-Crate Metadata
Table of contents
- RO-Crate uses Linked Data principles
- Common principles for RO-Crate entities
- Base metadata standard: Schema.org
- Additional metadata standards
- Summary of Coverage
- Future coverage
- Recommended Identifiers
RO-Crate aims to capture and describe the Research Object using structured metadata.
The RO-Crate Metadata Descriptor contains the metadata that describes the RO-Crate and its content, in particular:
- Root Data Entity - the RO-Crate
Dataset
itself, a gathering of data - Data Entities - the data payload, in the form of files and folders
- Contextual Entities - related things in the world (e.g. people, organizations, places), providing provenance for the data entities and the RO-Crate.
This machine-readable metadata can also be represented for human consumption in the RO-Crate Website, linking to data and Web resources.
RO-Crate uses Linked Data principles
RO-Crate makes use of the Linked Data principles for its description. In particular:
- (Meta)data should be made available as Open Data on the web.
- (Meta)data should be machine-readable in a structured format.
- (Meta)data should not require proprietary software packages.
- (Meta)data should use open standards from W3C, such as RDF and SPARQL.
- (Meta)data should link to other people’s data to provide context, using URIs as global identifiers
RO-Crate realizes these principles using a particular set of technologies and best practices:
- The RO-Crate Metadata Document can be stored in an RO-Crate Metadata File. The RO-Crate Metadata File and an RO-Crate Website can be directly published on the web together with the RO-Crate payload. In addition, a data package (e.g. BagIt Zip archive) that contain the RO-Crate can also be published on the web.
- The RO-Crate Metadata Document is based on the structured data format JSON-LD.
- Multiple open source tools/libraries are available for JSON and for JSON-LD.
- The RO-Crate Website is HTML 5, and the RO-Crate Metadata Document is JSON-LD, one of the W3C RDF 1.1 formats.
- The RO-Crate Metadata Document reuse common vocabularies like Schema.org, and this specification recommend identifiers it should link to.
Common principles for RO-Crate entities
For all entities listed in an RO-Crate Metadata Document the following principles apply:
- The entity MUST have a
@id
(see Describing entities in JSON-LD) - The entity MUST have a
@type
, which MAY be an array. - The
@type
SHOULD include at least one Schema.org type that accurately describe the entity. Thing or CreativeWork are valid fallbacks if no alternative external or ad-hoc term is found (see Extending RO-Crate). - The entity SHOULD have a human-readable
name
, in particular if its@id
do not go to a human-readable Web page - The properties used on the entity SHOULD be applicable to the
@type
(or superclass) according to their definitions. For instance, the property publisher can be used on a Dataset as it applies to its superclass CreativeWork. - Property references to other entities (e.g.
author
property to aPerson
entity) SHOULD use the{ "@id": "..."}
object form (see JSON-LD appendix) - The entity SHOULD be ultimately referencable from the root data set (possibly through another reachable data- or contextual entity)
Base metadata standard: Schema.org
Schema.org is the base metadata standard for RO-Crate. Schema.org was chosen because it is widely used on the World Wide Web and supported by search engines, on the assumption that discovery is likely to be maximized if search engines index the content.
As far as we know there is no alternative, well-maintained linked-data schema for research data with the coverage needed for this project - i.e. a single standard for expressing all the examples presented in this specification.
RO-Crate relies heavily on Schema.org, using a constrained subset of JSON-LD, and this specification gives opinionated recommendations on how to represent the metadata using existing linked data best practices.
The main principle of RO-Crate is to use a Schema.org whenever possible, even if its official definition may seem broad or related to every day objects. For instance, IndividualProduct can describe scientific equipment and instruments (see Provenance of entities). RO-Crate implementers are free to use additional properties and types beyond this specification (see also appendix [Extending RO-Crate(appendix/jsonld#extending-ro-crate)]).
Differences from Schema.org
Generally, the standard type and property names (terms) from Schema.org should be used. However, RO-Crate uses variant names for some elements, specifically:
File
is mapped to http://schema.org/MediaObject which was chosen as a compromise as it has many of the properties that are needed to describe a generic file. Future versions of Schema.org or a research data extension may re-defineFile
.Journal
is mapped to http://schema.org/Periodical.
JSON-LD examples given on the Schema.org website may not be in flattened form; any nested entities in RO-Crate JSON-LD SHOULD be described as separate contextual entities in the flat
@graph
list.
To simplify processing and avoid confusion with string values, the RO-Crate JSON-LD Context requires URIs and entity references to be given in the form "author": {"@id": "http://example.com/alice"}
, even where Schema.org for some properties otherwise permit shorter forms like "author": "http://example.com/alice"
.
See the appendix RO-Crate JSON-LD for details.
Additional metadata standards
RO-Crate also uses the Portland Common Data Model (PCDM version https://pcdm.org/2016/04/18/models) to describe repositories or collections of digital objects and imports these terms:
RepositoryObject
mapped to http://pcdm.org/models#ObjectRepositoryCollection
mapped to http://pcdm.org/models#CollectionRepositoryFile
mapped to http://pcdm.org/models#FilehasMember
mapped to http://pcdm.org/models#hasMemberhasFile
mapped to http://pcdm.org/models#hasFile
The terms
RepositoryObject
andRepositoryCollection
are renamed to avoid collision between other vocabularies and the PCDM termsCollection
andObject
. The termRepositoryFile
is renamed to avoid clash with RO-Crate’sFile
mapping to http://schema.org/MediaObject.
RO-Crate use the Profiles Vocabulary to describe profiles using these terms and definitions:
ResourceDescriptor
mapped to http://www.w3.org/ns/dx/prof/ResourceDescriptor (definition)ResourceRole
mapped to http://www.w3.org/ns/dx/prof/ResourceRole (definition)Profile
mapped to http://www.w3.org/ns/dx/prof/Profile (definition)hasArtifact
mapped to http://www.w3.org/ns/dx/prof/hasArtifact (definition)hasResource
mapped to http://www.w3.org/ns/dx/prof/hasResource (definition)hasRole
mapped to http://www.w3.org/ns/dx/prof/hasRole (definition)
From Dublin Core Terms RO-Crate uses:
conformsTo
mapped to http://purl.org/dc/terms/conformsToStandard
mapped to http://purl.org/dc/terms/Standard
From the IANA link relations registry:
cite-as
mapped to http://www.iana.org/assignments/relation/cite-as (defined by RFC8574)
These terms are being proposed by Bioschemas profile ComputationalWorkflow 1.0-RELEASE and FormalParameter 1.0-RELEASE to be integrated into Schema.org:
ComputationalWorkflow
mapped to https://bioschemas.org/ComputationalWorkflowFormalParameter
mapped to https://bioschemas.org/FormalParameterinput
mapped to https://bioschemas.org/properties/inputoutput
mapped to https://bioschemas.org/properties/output
To support geometry in Places, these terms from the GeoSPARQL ontology:
Geometry
mapped to http://www.opengis.net/ont/geosparql#GeometryasWKT
mapped to http://www.opengis.net/ont/geosparql#asWKT
In this specification the proposed Bioschemas terms use the temporary https://bioschemas.org/ namespace; future releases of RO-Crate may reflect mapping to the http://schema.org/ namespace.
From CodeMeta 3.0:
buildInstructions
mapped to https://codemeta.github.io/terms/buildInstructionsdevelopmentStatus
mapped to https://codemeta.github.io/terms/developmentStatuscontinuousIntegration
mapped to https://codemeta.github.io/terms/continuousIntegrationembargoEndDate
mapped to https://codemeta.github.io/terms/embargoEndDatehasSourceCode
mapped to https://codemeta.github.io/terms/hasSourceCodeisSourceCodeOf
mapped to https://codemeta.github.io/terms/isSourceCodeOfissueTracker
mapped to https://codemeta.github.io/terms/issueTrackerreadme
mapped to https://codemeta.github.io/terms/readmereferencePublication
mapped to https://codemeta.github.io/terms/referencePublicationsoftwareSuggestions
mapped to https://codemeta.github.io/terms/softwareSuggestions
As of 2024-05-23, the CodeMeta URIs do not resolve correctly, but are used here to match the Codemeta JSON-LD context https://w3id.org/codemeta/3.0 (issue #275). The CodeMeta terms
maintainer
andfunding
are not mapped, as these are already defined by schema.org.
Summary of Coverage
RO-Crate is simply a way to make metadata assertions about a set of files and folders that make up a Dataset. These assertions can be made at two levels:
- Assertions at the RO-Crate level: for an RO-Crate to be useful, some metadata should be provided about the dataset as a whole (see minimum requirements for different use-cases below). In the RO-Crate Metadata Document, we distinguish the Root Data Entity which represents the RO-Crate as a whole, from other Data Entities (files and folders contained in the RO-Crate) and Contextual Entities, e.g. a person, organisation, place related to an RO-Crate Data Entity
- Assertions about files and folders contained in the RO-Crate: in addition to providing metadata about the RO-Crate as a whole, RO-Crate allows metadata assertions to be made about any other Data Entity
This document has guidelines for ways to represent common requirements for describing data in a research context, e.g.:
- Contact information for a data set.
- Descriptive information for a dataset and the files within it and their contexts such as an abstract, spatial and temporal coverage.
- Associated publications.
- Funding relationships.
- Provenance information of various kinds; who (people and organizations) and what (instruments and computer programs) created or contributed to the data set and individual files within it.
- Workflows that operate on the data using standard workflow descriptions including ‘single step workflows’; executable files or environments such as singularity containers or Jupyter notebooks.
However, as RO-Crate uses the Linked Data principles, adopters of RO-Crate are free to supplement RO-Crate using Schema.org metadata and/or assertions using other Linked Data vocabularies.
Future coverage
A future version of this specification aim to cater for variable-level assertions: In some cases, e.g. for tabular data, additional metadata may be provided about the structure and variables within a given file. See the use case Describe a tabular data file directly in RO-Crate metadata for work-in-progress.
Recommended Identifiers
RO-Crate JSON-LD SHOULD use the following IDs where possible:
- For a Root Data Entity, an
identifier
which is RECOMMENDED to be a https://doi.org/ URI. - For a Person participating in the research process: ORCID identifiers, e.g. https://orcid.org/0000-0002-1825-0097
- For Organizations including funders, Research Organization Registry URIs, e.g. https://ror.org/0384j8v12
- For entities of type Place, a geonames URL, e.g. http://sws.geonames.org/8152662/
- For file formats, a Pronom URL, for example https://www.nationalarchives.gov.uk/PRONOM/fmt/831.
In the absence of the above, RO-Crates SHOULD contain stable persistent URIs to identify all entities wherever possible.