Modeling
Introduction
Data modeling captures the structure of the research project. A model corresponds to an entity in the experiment. Each entity has many properties, which are described by attributes of the model. For example, we may have a sample model with two attributes:
sample { identifier :name, parent :patient, file :picture }
The data attaching to a model is a set of records for each entity in the experiment. For example, we may have three sample records:
{ name: "sample1", patient: "patient1", picture: "metis://sample1.png" }
{ name: "sample2", patient: "patient1", picture: "metis://sample2.png" }
{ name: "sample3", patient: "patient2", picture: "metis://sample3.png" }
As the set of records for a model may often be represented in a tabular format, with records in rows and attributes in columns, a model may sometimes be referred to as a “table”.
Attributes
Broadly there are two classes of attributes:
- data attributes (string, integer, float, boolean, matrix, date_time, shifted_date_time, file, file_collection)
- link attributes (parent, link, child, collection, table). Links must be two-way (i.e., a patient has a
collection :sample
attribute, and a sample has aparent :patient
attribute).
The “parent” attribute is required for each model except the project model, producing a model hierarchy.
Data Model Hierarchy
Although models may have non-hierarchical relationships (e.g. belonging to multiple collections) or no relationships to other models, the library enforces the requirement that models must have parents, and thus form a tree with the project at the root. The main purpose of requiring a hierarchy is to allow a conceptually simpler representation of the dataset, with the project as the root entity collecting all other pieces of data.
How to Data Model
Setting up a new set of data models describing a project can be a daunting task. Fortunately, we can make good use of pre-existing work and conventions to make this task easier. Here we will describe one approach you may take to arriving at a set of models you can use to begin linking data.
The work of modeling data in the library is done via the Map page of the Timur application. This page allows us to visualize the model hierarchy, and allows us to inspect the attributes on each model, as well as the details of each attribute. In addition the map page contains a number of modeling tools (if you are an admin on your project), allowing you to add, remove or reparent models, and to add, remove or edit attributes in single or copied in bulk from another model. We will employ these modeling tools to build our project map.
Use a Template Project
An invaluable resource in updating your map is to look at an existing map for another project. The data library may contain one or more “template” projects (e.g. in the UCSF Data Library the “coprojects_template” project) suitable for copying from.
Naming Models
When adding models to your project, although you have the latitude to name models as you see fit, it is best to adhere to convention and use model names taken from the template project.
Creating a Backbone
At first your new project will contain only a single model, the “project” model. Initially we wish to add the principle entities of the project that have produced the data we intend to ingest and collate. Although experiments may in principle have considerable variety in their structure, in practice a common backbone does well to describe the majority of projects:
project > cohort > subject > timepoint > biospecimen > compartment
Projects may not require all of the models in this hierarchy, e.g., if there is a single cohort, or only one timepoint of collection, those models may be excluded from the backbone. Other projects may require models not present here, e.g. some projects might include a study “arm” above a cohort, or more detailed models describing sample aliquoting.
After determining which models in the backbone our project requires, we may add the models to our project using the “Add Model” button, first adding a model below the project
model, then selecting the new model and adding a model below that, etc.
Adding data models
Now that the main structure of the project is determined, we may add more models to hold data. Roughly, data models can be broken down into “clinical” and “assay” categories.
We may add data models as leaves to the backbone by using the “Add Model” button (e.g. to add an rna_seq
model below a biospecimen
model). Once the empty model has been added, we may fill it with attributes using the “Copy Attributes” button. Select the template project and model you wish to copy from, then select the attributes you want to copy.
Clinical data models
Clinical data usually attaches to the patient or timepoint model, and usually originates in a patient’s electronic medical record (EMR). This kind of data often has issues relating to patient privacy; one should be cautious about including unconstrained string
attributes or date_time
fields, as these can contain compromising information. The use of the shifted_date_time
attribute allows the library to safely anonymize entered dates while still maintaining a relative patient-centric timeline for dates.
Many clinical data models are relatively simple and consist of name/value attributes; names or values may often come from standard medical ontologies defining medications, treatments or diseases.
Assay models
Assays are usually leaf models at the very bottom of the data hierarchy, attaching to the biospecimen or compartment models. Assay models might involve large, high-throughput datasets generated by complex protocols and machines. In addition to containing the raw output from the machine (e.g. a fastq file from a sequencer), the format of these models is typically determined to match the output of a corresponding processing workflow which operates on the raw data to produce a more suitable analytic intermediate.
Pool models
Often individual assays may be grouped into batches for processing, e.g. a plate of RNA sequencing, a microarray of tissue cross-sections, or a pool of single cells. These pool entities may combine information from multiple subjects, and so do not attach into the “backbone” models of the project; instead they are parented directly to the project
model and collect the assay model, while the corresponding assay contains a link to the pool model but remains parented to the backbone. E.g., for an sc_seq_pool
of sc_seq
assays, we might have: sc_seq_pool { parent :project, collection :sc_seq }, sc_seq { parent :subject, link :sc_seq_pool }
To create a pool or batch model, use “Add Model” to add the new model below the project model. Then use “Add Attribute” to add a collection attribute pointing to the releant assay, with a reciprocal ‘link’ attribute in the assay model. Finally, use “Copy Attributes” to copy the attributes for the pool model from the corresponding template project model.
Adding additional attributes
Project models, even when copied from a template, may be added to. Each project has its idiosyncracies requiring novel data fields not present in the template, such as alias identifiers from external collaborators, project-specific sample or assay QC, etc. We may add attributes to a model at any time using the “Add Attribute” button.
Creating a new data model
Science is endlessly inventive, and you will inevitably encounter a new data type unrepresented by a template model. Creating a new assay model can be challenging and involves understanding the format of the raw assay data, how it is processed, and what pieces of processed data are relevant for analysis. Subsequently you can define the list of attributes capturing the relevant data.