README
forgiving-data
What is this?
Algorithms for processing and relating data in ways which promote ownership, representation and inclusion, maintaining rather than effacing provenance. See slide 2 in presentation Openly authored data pipelines for a recap of Project WeCount's pluralistic data infrastructure goals.
How to use it?
This is early-stage work. The file pipelines/WeCount-ODC.json5 contains a very simple three-element
data pipeline which loads two CSV files from git repositories to use as inputs for a "forgiving data merge".
Merged outputs together with provenance information linking back to the source data will be written
into directory dataOutput
by means of the overlay pipe pipelines/WeCount-ODC-fileOutput.json5.
Pipeline pipelines/WeCount-ODC-synthetic.json5 interposes into these pipelines via
an open composition process to interpolate synthetic accessibility data into the joined dataset, whilst recording
into the provenance file that this data is synthetic — any genuine data resulting from the join is preserved,
along with noting its provenance as genuine.
Run the sample pipeline by running
node demoDriver.js
What does it contain?
This package contains two primary engines - firstly, the "tangled mat" structure which is used to track the provenance in overlaid data structures, and secondly the data pipeline which orchestrates tasks consuming and producing CSV files also tracking provenance.
There is also a sample pipeline consuming data from covid-assessment-centres, performing an outer right join, and synthesizing accessibility data suitable for visualisation with covid-data-monitor.
Since it is the most directly usable work, the data pipeline system is described first.
The data pipeline
A data pipeline schedules a free graph of activities, starting from sources of data which may consist of, e.g. HTTP or GitHub fetches, transformations and syntheses of this data and then finally writing it to sinks. This is broadly consistent with the industry-standard ETL (Extract, transform, load) pattern of activity, but our implementation in the pluralistic data infrastructure, rather than being optimised for a large bulk of homogeneous data processed by a scalable infrastructure, is more tuned towards diverse, heterogeneous data derived from a variety of different kinds of sources, which require whole-dataset synthesis and transformation, as well as tracking the provenance of data as it passes through the pipeline on a cell-by-cell basis. This provenance is written to the data sinks as "companion structures" (currently as CSV, also the primary format in which data is accepted into the system) alongside the output data.
The pipeline is configured by a JSON5 structure with top-level elements
{
type: {String} The name of the overall pipeline - a grade name
parents: {String|String[]} Any parent pipelines that this pipeline should be composited together with
require: {String|String[]} Any module-qualified references to files holding JavaScript code that should be loaded
elements: { // Free hash of pipeline elements, where each element contains fields
<element-key>: {
type: <String - global function name>
parents: <String|String[] - parent element grades>
<other type-dependent properties>
}
}
}
The name grade refers to a construct from Fluid's Infusion framework in which the pipeline is implemented, but this isn't a detail relevant to most users. A grade can be thought of as a block of JSON configuration with a global name, which can be merged together with other similar configuration blocks.
An example pipeline can be seen at WeCount-ODC.json5.
Data handled by pipeline elements
Simple pipeline elements are JavaScript functions registered in the global Infusion namespace. These functions accept
and return tabular data as records of the following triple, of type ProvenancedTable
:
{
value: {CSVValue} The data value itself
provenance: {CSVValue} Isomorphic to `value`, with a provenance string for each data value
provenanceMap: {Object} A map of provenance strings to records resolving the provenance - derived from an
element's options minus its data references, plus any runtime data such as datestamps or
commit hashes identifying the particular fetch of the data
}
A CSVValue
takes the following structure:
{
headers: {String[]} An array of strings forming the CSV file's headers
data: {Object[]} CSV data as an array, with each row represented by a hash of header names to cell contents
}
Referring to data produced from other pipeline elements
Elements may refer to the data output by other pipelines by referring to them with Infusion-style context-qualified references in their options - e.g.
joined: {
type: "fluid.forgivingJoin",
left: "{WeCount}.data",
right: "{ODC}.data",
...
refers to the data output by two other elements in the pipeline named WeCount
and ODC
. Note that these references
do not perfectly follow Infusion's (1.x-5.x) established scoping rules, since they will prioritise siblings over parents.
See slide 16 of presentation https://docs.google.com/presentation/d/12vLg_zWS6uXaHRy8LWQLzfNPBYa1E6L-WWyLqH1iWJ4 for
details.
You can also refer to any of the options configured into another pipeline element by referring to its definition in the same context-qualified way.
I/O pipeline elements
Pipeline elements reading and writing data are as follows:
fluid.fetchGitCSV
Loads a single CSV file from a GitHub repository given its coordinates in an options structure. These are encoded in its options as follows:
{
repoOwner: {String} The repo owner.
repoName: {String} The repo name.
[branchName]: {String} [optional] The name of the remote branch to operate.
filePath: {String} The location of the file including the path and the file name.
octokit: {fluid.octokit} The octokit instance holding GitHub credentials - the user does not need to supply
this option, it is filled in automatically by the pipeline
}
For example,
WeCount: {
type: "fluid.fetchGitCSV",
repoOwner: "inclusive-design",
repoName: "covid-assessment-centres",
filePath: "WeCount/assessment_centre_data_collection_2020_09_02.csv"
}
The top-level member data
of this element will resolve to the loaded ProvenancedTable
referenced by its GitHub coordinates.
The provenance key will be derived from the element's name in the pipeline, and the provenance structure will be
filled in from the GitHub commit info.
fluid.fetchUrlCSV
Loads a single CSV file from an HTTP or HTTPS URL. Currently only one option is accepted:
{
url: {String} The URL from which the CSV file is to be fetched
}
For example,
ODC: {
type: "fluid.fetchUrlCSV",
url: "{ODCCoordinates}.data.downloadURL"
}
The top-level member data
of this element will resolve to the loaded ProvenancedTable
referenced by the URL.
The provenance key will be derived from the element's name in the pipeline, and the provenance structure will be
filled in from the URL and its access time.
fluid.csvFileOutput
Outputs CSV and provenance data to CSV and JSON files -
Accepts options:
input: {ProvenancedTable} The data to be written
filePath: {String} The filename of the data to be written
writeProvenance: {Boolean} [optional] - If `true`, companion files will be written alongside `filePath` with
related names, writing the provenance and provenance map structures held in the data
For example:
output: {
type: "fluid.csvFileOutput",
input: "{joined}.data",
filePath: "outputData/output.csv",
writeProvenance: true
}
fluid.dataPipe.commitMultipleFiles
Outputs a collection of CSV and provenance data as a single git commit to a GitHub repository for which suitable credentials has been supplied (see the next section).
Accepts options:
repoOwner: {String} The repository owner.
repoName: {String} The repository name.
branchName: {String} [optional] The name of the remote branch to operate.
commitMessage: {String} The message to be attached to the GitHub commit
files: {TransformableFileEntry[]} An array of files to be written
octokit: {fluid.octokit} The octokit instance holding GitHub credentials - this option is supplied automatically
by the pipeline
A TransformableFileEntry
structure holds the following:
filePath: {String} The path of the file within its repository
writeProvenance: {Boolean} [optional] If `true`, additional files will be written adjacent to the file encoding
its provenance
encoder: {String} The name of a function which will encode the file's contents.
Currently this should be "fluid.data.encodeCSV"
convertEntry: {String|String[]} [optional] The name of one or more functions accepting this `FileEntry`
structure and returning one or more others representing additional data to be written
input: {ProvenancedTable} The data to be written
Data processing pipeline elements
Pipeline elements transforming and processing data are as follows:
fluid.dataPipe.filter
Filters the rows of a CSV dataset using a filter callback function with the same signature as the callback accepted by the builtin JavaScript Array.filter operation.
Accepts options:
input: {ProvenancedTable} The dataset to filter
func: {String} Name of a function registered in the Infusion global namespace performing the filter
e.g. one may write in the pipeline configuration
filter: {
input: "{MyDataSource}.data",
func: "fluid.examples.removeThird",
}
with the following function implementation in JavaScript:
fluid.examples.removeThird = function (row, index) {
return index !== 2;
};
to filter the data by removing the third row.
fluid.forgivingJoin
Executes an inner or outer join given two CSV structures.
This join chooses a single best pair of columns to join on based on a simple value space intersection performed over all pairs of columns in the provided tables. An improvement to this algorithm, with increased computational cost, would be able to select a compound column for the join, as well as providing a ranked list of choices rather than just a single best choice - these improvements are ticketed at DATA-1.
Accepts options:
left: {ProvenancedTable} The left dataset to join
right: {ProvenancedTable} The right dataset to join
outerLeft: {Boolean} [optional] If `true`, a left outer join will be executed
outerRight: {Boolean} [optional] If `true`, a right outer join will be executed
outputColumns: {Object} A map of columns to be output in terms of the input columns. The keys of this map record
the names of the columns to be output, and the corresponding values record the corresponding input
column, in a two-part period-qualified format - before the period comes the provenance name of the
relevant dataset, and after the period comes the column name in that dataset
For example:
joined: {
type: "fluid.forgivingJoin",
left: "{WeCount}.data",
right: "{ODC}.data",
outerRight: true,
outputColumns: {
location_name: "ODC.location_name",
city: "ODC.city",
"Individual Service": "WeCount.Personalized or individual service is offered",
"Wait Accommodations": "WeCount.Queue accommodations"
}
}
Working with GitHub credentials
Requests to GitHub's HTTP API are operated via GitHub's standard Octokit library. An
instance of this library is created automatically by any pipeline which derives from the grade fluid.pipelines.withOctokit
,
which accepts a GitHub authentication token from an environment variable named GITHUB_TOKEN.
A suitable token may be generated via the instructions at the Github documentation.
You need only tick the public_repo
permission when generating this.
Alternatively, you can configure an Octokit instance without authentication using the grade fluid.pipelines.withUnauthOctokit
.
This is suitable for lightweight local testing but note that GitHub will limit the use of such requests to 60 per hour
as described at Resources in the REST API.
Either of fluid.pipelines.withOctokit
or fluid.pipelines.withUnauthOctokit
will construct a {fluid.octokit}
instance in the pipeline that can be resolved by pipeline elements such as fluid.fetchGitCSV
,
fluid.dataPipe.commitMultipleFiles
, etc.
Running a pipeline using a run configuration
The library supports a simple JSON structure which allows for the encoding of a particular pipeline run - that is,
which pipelines should be loaded and merged together. This can then be executed using the library's CLI driver named
runFluidPipeline
by giving it this file as an argument. A run configuration has the following structure:
{
loadDirectory: {String|String[]} [optional] - One or an array of module-qualified names of directories in which
all files are to be loaded as pipelines
loadPipeline: {String|String[]} [optional] One or an array of module-qualified names of individual pipeline
execPipeline: {String} A single pipeline grade names, loaded by the above directives, which will be executed
execMergedPipeline: {String[]} Multiple pipeline grade names which will be merged together and executed
For example, the embedded demo driver at demoDriver looks as follows:
{ // Run configuration file for demonstration pipeline
// Can be run directly from command line via runFluidPipeline %forgiving-data/runDemo.json5
loadDirectory: "%forgiving-data/demo/pipelines",
execMergedPipeline: ["fluid.pipelines.WeCount-ODC-synthetic", "fluid.pipelines.WeCount-ODC-fileOutput"]
}
Here are some possible ways of running this run configuration:
forgiving-data
is a dependency
As an npm script of a module for which "scripts": {
"runFluidPipeline": "runFluidPipeline %forgiving-data/demo/runDemo.json5",
}
From the hosting module you can then run
npm run runFluidPipeline
If forgiving-data is installed as a global module using npm install -g forgiving-data
runFluidPipeline %forgiving-data/demo/runDemo.json5
Via a local npm script as provided in this module
"scripts": {
"runFluidPipeline": "node src/cli.js"
}
From this module you can then run
npm run runFluidPipeline -- %forgiving-data/demo/runDemo.json5
Note that module-qualified references such as %forgiving-data/runDemo.json5
are described in the Infusion API
page on fluid.module.resolvePath.
Pipeline element provenance marker grades
There are currently two marker grades which can be applied to these elements, configuring the strategy to be used for
intepreting the provenance of data produced by the element. If one of these is not supplied, the element is assumed
to fully fill out the provenance
and provenanceMap
return values by itself (as does fluid.forgivingJoin
):
fluid.overlayProvenancePipe
A pipeline element implementing this grade does not fill in its provenance
or provenanceMap
elements in its return
value - instead it returns a partial data overlay of CSV values it wishes to edit in values
, and the pipeline performs
the merge, resolves the resulting provenance, and adds a fresh record into provenanceMap
indicating that the pipeline
element sourced the data from itself.
There is a sample pipeline element fluid.covidMap.inventAccessibilityData
implementing fluid.overlayProvenancePipe
that
synthesizes accessibility data as part of the sample demoDriver.js
pipeline.
fluid.selfProvenancePipe
A simpler variety of element for which the pipeline synthesizes provenance, that assumes that all values produced by the pipeline have the same provenance (the element itself). This is suitable, for example, for elements which load data from some persistent source which has not itself encoded any provenance information (e.g. a bare CSV file).
The builtin element fluid.fetchGitCSV
is of this kind.
Loading and running the pipeline via a "run configuration"
The pipeline system supports a minimal JSON/JSON5 format representing a run configuration of a pipeline, as described in the section above.
A run of the demo pipeline is encoded by file runDemo.json5, and can be executed by either
a function call fluid.dataPipeline.runCLI("%forgiving-data/demo/runDemo.json5")
as seen in the demo driver, or else
at the command line via
runFluidPipeline %forgiving-data/demo/runDemo.json5
if this package is installed as a local dependency.
Loading and running the pipeline manually
The pipeline structure is loaded by using fluid.dataPipeline.loadAll
and then executed via fluid.dataPipeline.build
as per the sample in demoDriver.js - e.g.
fluid.dataPipeline.loadAll("%forgiving-data/pipelines");
var pipeline = fluid.dataPipeline.build(["fluid.pipelines.WeCount-ODC-synthetic", "fluid.pipelines.WeCount-ODC-fileOutput"]);
pipeline.completionPromise.then(function (result) {
console.log("Pipeline executed successfully");
}, function (err) {
console.log("Pipeline execution error", err);
});
The tangled mat
The tangled mat is a low-level implementation structure used in the data pipeline's provenance system. It helps the work of keeping track of provenance changes to data structures as they are operated in code.
The basic operation of the tangled mat is illustrated in diagram Tangled Mat. Several layers of arbitrary JSON shape may be combined, with a resulting structure which is the result of a deep merge as if they had been combined using
jQuery/fluid.merge(true, []/{}, layer1, layer2 ...)
With the difference that
- The merging is performed in a lazy manner - a result in the merged structure is only evaluated at the point it is demanded,
- The provenance of each final object in the merged structure is available in a "provenance map" which is isomorphic to the merged value, recording which layer each resulting leaf value was drawn from
- As with traditional merge, the arguments are not modified - however, in the case where a portion of the mat is "uncontested" (drawn from a single argument), the argument will not be cloned, and treated as immutable by producer and consumer
The primary API of the mat consists of its creator function fluid.tangledMat
, and methods addLayer
, getRoot
,
readMember
, getProvenance
, evaluateFully
and setWritableLayer
- see JSDocs. An example session interacting with the
mat mirroring the example drawing:
var mat = fluid.tangledMat([{
value: {
a: 1
},
name: "Layer1"
}, {
value: {
b: 2
},
name: "Layer2"
}]);
mat.addLayer(
{
c: {
d: 4
}
}, "Layer3"
);
var provenanceMap = mat.getProvenance();
// returns {
// a: "Layer1",
// b: "Layer2",
// c: {
// d: "Layer3"
// }
// }
Forward-looking notes on the "Tangled Mat"
The implementation here is of a basic quality which is sufficient for operating on modest amounts of tabular data in pipelines as seen here. However, this provenance-tracking structure will be the basis of the reimplementation of Infusion (probably as "Infusion 5") of which the signature feature is the "Infusion Compiler" loosely written up as FLUID-5304. The key architecture will be that each grade definition ends up as a mat layer, which will then appear in multiple mats, one mat for each co-occurring set of grade definitions (and/or distributions). These mats will be then indexed in a Trie-like structure allowing for quick lookup of partially evaluated merged structures.
Key improvements that are required to bring this implementation to the quality needed for FLUID-5304:
- Ability to store mats with primitive elements at the root, or otherwise to more clearly distinguish where "mutable roots" occur in the structure (in terms of current semantics, where there "is a component", but in future these will be far smaller, lightweight structures)
- Ability to chain evaluation of nested mats and their provenance structures (improvement of current "compound provenance" model)
- Ability to invalidate previously evaluated parts of mats when their constituents change
- Ability to register "change listeners" responsive to changes of parts of mat structure, which are themselves encoded as mat contents, and can also specify their semantics for evaluation (that is, whether they require full evaluation as in FLUID-5891
- Where a region of the mat is "uncontested" (see above), no metadata entry will be allocated in the mat top. This implies that all access will be made through iterators which are capable of tracking which was the last visited part of the mat top before we fell off it.