Scopes

Scopes and Profiles

Research Objects can be used to capture academic outputs in a wide range of scopes, from detailed traces of software execution to consortium-wide results from a 5 year research project.

Research Object extensions and serialization formats can be applied together with domain-specific annotations to form a specific type of Research Object with a particular scope.

Profiles help define the shape and form of a domain- or application-specific research object. Loosely, a profile defines a format (e.g. Research Object Bundle), an expectation of what kind of resources should be expected, and link to any specific vocabularies or further specifications that should be used in its annotations. In a way, a profile defines the general purpose of that type of Research Objects, and documents what assumptions a consumer could rely on when processing such Research Objects.

A profile may mandate a minimal information model checklist to formally specify its requirements, although the profile may just be a text document for a human reader. Below is a list of known Research Object Profiles - feel free to suggest changes to this list.

Scientific Workflows

Scientific workflows are computational experiments that are composed of a set of coordinated computational tasks, each of which takes some data inputs and produces some data outputs, which are then consumed by subsequent tasks according to the workflow definition, i.e. the experiment protocol definition. Scientific workflows provide the means for automating computational scientific experiments, enabled by established workflow execution engines, like Apache Taverna, Kepler, VisTrails or Galaxy.

A workflow-centric research object (https://doi.org/10.1016/j.websem.2015.01.003) contain the application-specific workflow definition, annotated with wfdesc, and combined with a PROV-based execution trace of one or more workflow runs, including inputs and outputs. As an example of an application-specific profile, see Apache Taverna’s data bundle from TavernaProv. (https://doi.org/10.5281/zenodo.51314)

Building on that work, CWLProv (https://doi.org/10.5281/zenodo.1215611) is a recent development that extend the reference implementation for the Common Workflow Language to capture workflow run provenance (what we can call retrospective provenance) using PROV, wfprov and ro-bagit BDBags. Here the provenance capture rewrite the CWL workflow definition (which may have used local file paths) so it can be self-contained within the Research Object, and provide mechanisms for rerunning the workflow execution. As a vendor-neutral community project, CWL has multiple implementations and relies on Docker containers and BioConda packages these Research Objects also achieve portability across platforms and workflow engines.

A slightly different workflow research object is aiming to capture just the recipe for execution, what we can call a plan or prospective provenance. Here the challenge is to provide all the required resources for executing the workflow, as well as history of the workflow definition itself. Early work in this area include SCUFL2 Workflow Bundle, used by Apache Taverna Language. More recently, in the Common Workflow Language Viewer, workflow definitions can be downloaded as a Research Object Bundle to capture runtime requirements of a CWL workflow, in addition to provenance of how the definition itself was created, examining the corresponding GitHub log. (https://doi.org/10.7490/f1000research.1114375.1).

ISA model

The ISA model (Investigation/Study/Assay) is commonly used in systems biology, life sciences, environmental and biomedical domains to structure research outputs. ISA defines a top-level investigation, consisting of studies, which contain several assays (experiment or test) that produces data files. In SEEK for Science, used by the FAIRDOM data management project, a complete investigation can be downloaded as a Research Object Bundle, which is structured according to the ISA model to include all contained resources and their annotations. These research objects corresponds to temporal snapshots of the whole investigation and its resources, which can be assigned DOIs like https://doi.org/10.15490/fairdomhub.1.investigation.162.1 - from where you can download the corresponding Research Object Bundle.

The SOAPdenovo2 reproducibility study (https://doi.org/10.1371/journal.pone.0127612) showed how to recreate a scientific workflow (using Galaxy) of the original SOAPdenovo2 paper using a combination of the ISA model, Nanopublication, and Research Object (RO). As the resulting dataset (https://doi.org/10.5524/100148) includes all of these serializations they can be compared for completeness and complexity.

Digital Preservation

In digital libraries, preservation of source artifacts commonly use the BagIt format for archive serialization, capturing digital resources like audio recordings, document scans and their transcriptions, provenance and annotations. The Research Object BagIt archive is a profile for describing a BagIt archive and its content as a Research Object to structure the metadata and relate the captured resources, used by the NIH-funded Big Data for Discovery Science (BDDS) project to capture Big Data bags (BDBag) of large complex datasets from genomics workflows. BDDS also developed the the bdbag utility and Python library to create/inspect/manipulate such research objects. (https://doi.org/10.1109/BigData.2016.7840618).

A key aspect of BDBag is the ability to use Minimal Viable Identifiers (minid) for referencing potentially large data sources held in multiple remote repositories, effectively making a “Big Data” Research Object for large-scale workflows (https://doi.org/10.1101/268755). A bag of bags (minid:b9vx04) may look tiny at 182 kB, but it is a metadata skeleton which may be completed with tools like bdbag to download the big data using efficient methods like GridFTP from alternative repositories, confirming cryptographically strong hashes against BagIt manifests. Alternatively, the bags’ Research Object manifests can be consumed independently, linking to the remote resources. Further work on BDBags is taking place as part of NIH Data Commons.

Simulation Experiments

A simulation study usually consists of multiple modules. A computational model describes a real-world system, a simulation description defines the parametrisation, and some documentation explains how to use and run the study. The experiment can then be simulated in silico to study the behaviour under different conditions. This approach helps understanding the encoded system and allows for making predictions about what will happen in the real system without spending money and time for wet lab experiments.

A simulation object (which is a research object encoding for a simulation experiment) needs to contain all files necessary to reproduce the results of a certain simulation experiment. It must be annotated with the information about the creators and authors of the files/modules shipped with the study. Moreover, it should make use of standard formats (such as SBML, CellML, SED-ML) as far as possible. See also COMBINE Archive (https://doi.org/10.1186/s12859-014-0369-z).

Computational Jobs

The STELAR project uses research objects as part of its computational job management, by capturing analysis code in a Snapshot Research Object, and submitting a Job Research Objects that aggregate all of the content required to execute the analysis code, e.g. the Snapshot Research Object and input data files. Annotations included metadata required for execution: command line parameters, environment variables and the necessary computational resources (e.g. CPU and memory). Results are returned as Execution Research Objects that aggregates the Job Research Object along with the outputs of the process: output files, standard input, standard output and standard error. (https://doi.org/10.1136/thoraxjnl-2015-206781)

Health Informatics

For sharing Public Health datasets within the Farr institute of Health Informatics research (https://doi.org/10.17061/phrp2541541), the Farr Commons defines a Research Object profile with a series of requirements on identifiers, versioning and licensing of datasets. Availability aspects include privacy regulations, and domain-specific annotations include cohort and clinical codes.

Regulatory Science

BioCompute Objects (BCO) is a community-driven project backed by the FDA (US Food and Drug Administration) and George Washington University to standardize exchange of High-Troughput-Sequencing workflows for regulatory submissions between FDA, pharma, bioinformatics platform providers and researchers. (https://doi.org/10.1101/191783)

There is a particular challenge for regulatory bodies like FDA in areas like personalized medicine, as to review and approve the bioinformatics side they need to inspect and in some cases replicate the computational analytical workflow.  The challenge here is not just the normal reproducibility thing about packaging software and providing required datasets, but also for human understanding of what has been done, by expressing the higher level steps of the workflow, their parameter spaces and algorithm settings.

To this end, the Research Objects complements the BioCompute Object, as the RO captures aggregation, provenance and attribution; while the BCO captures domain-specific aspects. We are exploring this approach using the ELIXIR implementation study Enabling the reuse, extension, scaling, and reproducibility of scientific workflows.