Magma

Data Warehouse

Organization

The purpose of Magma is to define a data graph for each Etna project: a set of models for each of the entities in the project dataset.

Each model has a set of attributes, which may broadly be divided into value types, which hold, e.g., an integer or a reference to a binary file, and link types, e.g. parent or collection, which define relationships between the models in Magma.

Attached to each model is a collection of records, containing a data value for each attribute, including links to other records and files stored on metis.

Models

The shape of the data graph (and thus the relationships defined in each of the models) has some constraint:

Here is a sketch of what the graph for the “olympics” project might look like:

project {
  identifier project_name
  collection event
  collection athlete
}

event {
  parent project
  identifier event_name
  table entry
}

entry {
  parent event
  link athlete
  integer placement
  integer score
}

athlete {
  parent project
  identifier name
  collection entry
}

Here all links are reciprocal, and every model descends from the project. However, using the link attribute we may indicate other one-to-one or one-to-many relationships, which allows the graph to be more like a directed acyclic graph (DAG) than a tree.

Models may also specify a dictionary model and mapping to be used for more complex validations (see below on dictionary validation).

Attributes

Each attribute has, at least, a unique attribute name within its model, and a distinct attribute type.

The Magma attribute types are:

Value types

Outgoing link types

Incoming link types

Other types

In addition to its type, each attribute may set several other fields:

Records

A record is a set of values for each attribute in the model. The set of records for a project form a data graph with a single project record at the root.

Validation

Magma models may define validations, which helps ensure the integrity of data as it enters Magma (invalid data is rejected). Magma has two basic forms of validation. The first is attribute validation, which adds matchers to each attribute on a model, e.g.:

class MyModel < Magma::Model
  attribute :att1, type: String, match: /[r]egexp/
  attribute :att2, type: String, match: [ 'list', 'of', 'options' ]
end

Dictionaries

An alternative method of validation is via a Magma::Dictionary, which allows a model to be validated using data (records) from another model.

You may define a dictionary relation as follows:

class MyModel < Magma::Model
  attribute :att1
  attribute :att2

  dictionary DictModel, att1: :dict_att1, att2: :dict_att2
end

A my_model record is valid if there is a matching entry in the dictionary model, i.e., where my_record.att1 matches dict_record.att1 and my_record.att2 matches dict_record.att2. Here ‘match’ might mean ‘equality’, but a dictionary may also include a ‘match’ attribute.

class DictModel < Magma::Model
  attribute :dict_att1
  match :dict_att2
end

A match attribute contains json data like {type,value}. This allows us to construct more complex entries:

# match any value in this Array
{ type: 'Array', value: [ 'x', 'y', 'z' ] }
# match within this Range
{ type: 'Range', value: [ 0, 100 ] }
# match this Regexp
{ type: 'Regexp', value: '^something$' }
# match an ordinary value
{ type: 'String', value: 'something' }

Usage

Clients

The main way to interact with Magma is via the API. In the simplest case this can be done using curl or wget to POST if one sets the Authorization: Etna <your token> header; but any HTTP client will suffice. Visit your Janus instance to get a current token, or make use of other ways to authenticate with Janus.

To use Etna::Client to connect to Magma in Ruby, you may gem install etna and then create a new client:

require 'etna'

e = Etna::Client.new('https://magma.example.org', ENV['TOKEN'])

payload = e.retrieve(project_name: 'labors', model_name: 'monster', record_names: [ 'Nemean Lion', 'Lernean Hydra' ], attribute_names: "all")

API

The main way to interact with Magma directly is via its API (you may also perform a great many of the same operations using the data browser Timur).

There are four main endpoints: update, retrieve, query, and update_model. All of them expect a POST in JSON format with a valid Etna authorization header (i.e., Authorization: Etna <valid janus token>).

/update

Parameters

Examples

The basic revision format looks like this:

{
  "project_name" : "labors",
  "revisions" : {
    "monster" : {
      "Nemean Lion" : {
        "species" : 'lion'
      },
  "Lernean Hydra" : {
    "species" : 'hydra'
  }
    }
  }
}

/retrieve

Parameters

Required parameters:

Optional parameters:

Examples

A basic request for a record looks like this:

{
  "project_name"    : "labors",
  "model_name"      : "labor",
  "record_names"    : [ "Nemean Lion" ],
  "attribute_names" : [ "name", "number", "completed" ]
}

The output is in “payload” format, containing a hash { models } keyed by model_name, and returning for each model { documents, template }. The template is a complete description of the model sufficient for import into another Magma instance. The returned documents are keyed by the record identifiers, with each record containing values for the attributes requested in attribute_names.

{
  "models": {
    "labor": {
      "documents": {
        "Nemean Lion": {
          "name": "Nemean Lion",
          "number": 1,
          "completed": true
        }
      },
      "template": {
        "name": "labor",
        "attributes": {
          "name": {
            "name": "name",
            "type": "String",
            "attribute_class": "Magma::Attribute",
            "display_name": "Name",
            "shown": true
          }
          // etc. for ALL attributes, not just requested
        }
      }
    },
    "identifier": "name",
    "parent": "project"
  }
}

A few special cases exist. Here is the “template” query, which will retrieve all of the project templates but no documents:

{ "project_name": "labors", "model_name": "all", "record_names":[], "attribute_names": "all" }

The “identifier” query will retrieve all of the project identifiers at once:

{ "project_name": "labors", "model_name": "all", "record_names": "all", "attribute_names": "identifier" }

/query

The Magma Query API lets you pull data out of Magma through an expressive query interface.

Parameters

A general form of the query is:

[ *predicate_args, *verb_args, *predicate_args, *verb_args, ... ]

A basic query might look like this:

[ 'labor', '::all', 'name' ]

A breakdown of the terms: labor - specifies the model we wish to search, yielding a model predicate ::all - a verb argument to the model predicate, iterating across all of the items in the model, and yielding a record predicate name - a verb argument to the record predicate specifying an attribute name, yielding a value and terminating the query

While the query must eventually terminate in a value (or array of values if an array argument is passed to a record predicate), via records we might traverse through the graph first:

[ 'labor', '::all', 'monster', 'victim', '::first', 'city' ]

The response:

{
   "answer" : [ [ 'Nemean Lion', 'Nemean Lion' ], [ 'Lernean Hydra', 'Lernean Hydra' ] ]
   "format" : ['labors::labor#name', 'labors::labor#name']
}

The format describes the returned values. If the format is an array, the format will contain a list of items with the given format. The format is usually written in project_name::model_name#attribute_name format.

A more advanced query might include a filter:

[ 'monster', [ '::has', 'stats' ], '::all', 'name' ]

Filters may be applied to any model we traverse through:

[ 'labor', '::all', 'prize', [ 'worth', '>', '200' ], '::first', 'name' ]

Predicates

There are a handful of predicate types, each of which take various arguments.

Start

The first predicate initiates the query and usually takes a model name as an argument:

<model_name> - a string specifying the model to be searched

You may also pass the following as initial arguments:

::predicates - return a list of the available predicates and their verbs
::model_names - return a list of model names for the project being queried
Model

A Model predicate is our query starting point and specifies a set of records. Model predicates can accept an arbitrary number of filter [] arguments, followed by:

::first - reduce this model to a single item
::all - return a vector of values for this model, labeled with this model's identifiers
::attribute_names - return a list of attribute names for this model (only following the start predicate)
Record

A Record predicate follows after a Model predicate. The valid arguments are:

<attribute_name> - a string specifying an attribute on this model
::has, <attribute_name> - a boolean test for the existence of <attribute_name> (i.e., the data is not null)
::identifier - an alias for the attribute_name of this Model's identifier. E.g., if a Sample has identifier attribute 'sample_name', '::identifier' will return the same value as 'sample_name'
Column

Column attributes usually just return their value. However, you may optionally follow them with arguments to apply a boolean test.

string

::equals, <string> - A boolean test for equality, e.g. [ 'sample_name', '::equals', 'Dumbo' ]
::in, [ list of strings ] - A boolean test for membership, e.g., [ 'sample_name', '::in', [ 'ant', 'bear', 'cat' ] ]
::matches, <string> - A boolean test for a regular expression match, e.g., [ 'sample_name', '::matches', '[GD]umbo' ]

integer, date_time

::<= - less than or equals
::< - less than
::>= - greater than or equals
::> - greater than
::= - equals

boolean

::true - is true
::false - is false

file, image

::url - a URL to retrieve this file resource
::path - the filename/path for this file resource

matrix

::slice - retrieve a subset of columns from the matrix

Example Queries

Using the examples above, you could formulate a query using a POST request and the following JSON payload:

{
  "query": [ 'labor', '::all', 'monster', 'name' ],
  "project": "labors"
}

Results in something like:

{
    "answer" : [ [ 'Nemean Lion', 'Nemean Lion' ], [ 'Lernean Hydra', 'Lernean Hydra' ] ]
    "format" : ['labors::labor#name', 'labors::monster#name']
}

To get a TSV back, you could add the format=tsv parameter:

{
  "query": [ 'labor', '::all', 'monster', 'name' ],
  "project": "labors",
  "format": "tsv"
}

Results in:

labors::labor#name\tlabors::monster#name
Nemean Lion\tNemean Lion
Lernean Hydra\tLernean Hydra

As noted above, this results in a two-column response. If you want to provide alternate column labels for the TSV, you can supply user_columns:

{
  "query": [ 'labor', '::all', 'monster', 'name' ],
  "project": "labors",
  "format": "tsv",
  "user_columns": ["Labor", "Roar"]
}

Results in:

Labor\tRoar
Nemean Lion\tNemean Lion
Lernean Hydra\tLernean Hydra

Transposing the request:

{
  "query": [ 'labor', '::all', 'monster', 'name' ],
  "project": "labors",
  "format": "tsv",
  "user_columns": ["Labor", "Roar"],
  "transpose": true
}

Results in:

Labor\tNemean Lion\tLernean Hydra
Roar\tNemean Lion\tLernean Hydra

And expanding a matrix attribute (contributions):

{
  "query": [ 'labor', '::all', 'contributions', '::slice', ['Athens', 'Sparta'] ],
  "project": "labors",
  "format": "tsv",
  "user_columns": ["Labor", "Share"],
  "expand_matrices": true
}

Results in:

Labor\tShare.Athens\tShare.Sparta
Nemean Lion\t10\t20
Lernean Hydra\t11\t21

Unexpanded, the data for the matrix attribute will nest into a single cell:

{
  "query": [ 'labor', '::all', 'contributions', '::slice', ['Athens', 'Sparta'] ],
  "project": "labors",
  "format": "tsv",
  "user_columns": ["Labor", "Share"]
}

Results in:

Labor\tShare
Nemean Lion\t[10,20]
Lernean Hydra\t[11,21]

You can also transpose expanded matrices:

{
  "query": [ 'labor', '::all', 'contributions', '::slice', ['Athens', 'Sparta'] ],
  "project": "labors",
  "format": "tsv",
  "user_columns": ["Labor", "Share"],
  "expand_matrices": true,
  "transpose": true
}

Results in:

Labor\tNemean Lion\tLernean Hydra
Share.Athens\t10\t11
Share.Sparta\t20\t21

/update_model

Coming soon.

Setup

Installation

Start with a basic git checkout:

$ git clone https://github.com/mountetna/magma.git

Magma is a Rack application, which means you can run it using any Rack-compatible server (e.g. Puma or Passenger).

Configuration

Magma has a single YAML config file, config.yml; DO NOT TRACK this file, as it will hold all of your secrets. It uses the Etna::Application configuration syntax. See config.yml.template for an example configuration.

Migrations

Magma attempts to maintain a strict adherence between its models and the database schema by suggesting migrations. These are written in the Sequel ORM’s migration language, not pure SQL, so they are fairly straightforward to amend when Magma plans incorrectly.

To plan a new set of migrations, the first step is to amend your models. This also works in the case of entirely new models. Simply sketch them out as described above, setting out the attributes each model requires and creating links between them.

Once you’ve defined your models, you can execute bin/magma plan to create a new migration. If you want to restrict your plan to a single project you may do bin/magma plan <project_name>. Magma will output ruby code for a migration using the Sequel ORM - you can save this in your project’s migration folder (e.g. project/my_project/migration/01_initial_migration.rb).

After your migrations are in place, you can try to run them using bin/magma migrate, which will attempt to run migrations that have not been run yet. If you change your mind, you can roll backwards (depending on how reversible your migration is) using bin/magma migrate <migration version number>.