blobby

An HTTP Proxy for Blob storage systems (such as S3) that automatically shards and replicates your data

Usage no npm install needed!

<script type="module">
  import blobby from 'https://cdn.skypack.dev/blobby';
</script>

README

blobby

No, not that Mr. Blobby.

Blobby is an HTTP Proxy for Blob storage systems (such as S3) that automatically shards and replicates your data. Useful for single and multi datacenter architectures, blobby scales your storage and throughput requirements by way of sharding, as well as enables fast local reads in multi datacenter replication setups. Additionally blobby provides a simple CLI for analyzing your complex data architectures by way of storage comparisons, repairs, stats, and more.

NPM

Installation

Blobby can be installed as a local dependency of your app:

npm i blobby --save
./node_modules/.bin/blobby

Or installed globally:

npm i blobby -g
blobby

Basic Usage

Start the HTTP Proxy Server:

blobby server

Copy between storage systems:

blobby copy myOldStorage myNewStorage

See help for a full list of commands:

blobby help

Full Command List

Options

A number of configuration formats are supported, including JSON, JSON5, CommonJS, and Secure Configurations.

Option Type Default Desc
config arrayOf(string) [] One or more configuration files. If none are provided config-env will be used
config-dir string "config" Directory of configuration files
config-env string "NODE_ENV" Environment variable used to detect configuration
config-default string "local" Default configuration to use if environment is not available
config-base string none If specified will use this configuration as the base (defaults) config that will be deep merged
config-exts arrayOf(string) ['.json', '.json5', '.js'] Supported extensions to detect for with configuration files
secure-config string none Directory of secure configuration files
secure-secret string none The secret required to decrypt secure configuration files
secure-file string none File to load that holds the secret required to decrypt secure configuration files
mode string "headers" Used when comparing files. For usage see Compare Modes
recursive boolean true Enable deep query (recursive subdirectories) for operations that support it
removeGhosts boolean false For repair's if true, will remove missing file instances instead of copying to missing storage
resume-key string none If a previous command was stopped you can resume from where you left off with this option
date-min string none Minimum date required when processing records, all others are ignored
date-max string none Maximum date required when processing records, all others are ignored
retry-min number 1000 Minimum timeout (in ms) for first retry, where retries are applicable
retry-factor number 2 Multiple in time applied to retry attempts, where retries are applicable
retry-attempts number 3 Maximum retry attempts before failure is reported, where retries are applicable

Example using the default NODE_ENV environment variable to load config data:

blobby server --config-dir lib/config

Configuration

Name Type Default Desc
http HttpBindings { "default": { "port": 80 } } Collection (hash for ease of merging) of HTTP bindings
http.{id} HttpBinding (required) HTTP Binding Object
http.{id}.port number 80 Port to bind to
http.{id}.host string undefined Host to bind to, or nothing to use Node.js default
http.{id}.ssl Object (required if enabling SSL) See Node.js TLS Options
http.{id}.ssl.key Buffer or string none If string will attempt to load private key from disk
http.{id}.ssl.cert Buffer or string none If string will attempt to load certificate from disk
httpAgent Object Boolean Defaults
httpHandler string undefined If path is provided to a module (Function(req, res)) will allow parent app to peek into incoming requests. If handler returns false Blobby will ignore the request altogether and assume parent is handling the response
storage StorageBindings (required) Collection of storage bindings
storage.{id} StorageBinding (required) Storage Binding Object
storage.{id}.driver string (required) Module name/path to use as storage client
storage.{id}.maxUploadSize number none Size in bytes allowed by uploads
storage.{id}.cacheControl string "public,max-age=31536000" Default cache control headers to apply for GET's and PUT's if file does not provide it
storage.{id}.accessControl string "public-read" Default to publically readable. Full ACL List
storage.{id}.driver string (required) Module name/path to use as storage client
storage.{id}.dirSplit number false (future) If Number, auto-split paths every N characters to make listing of directories much faster
storage.{id}.auth string none Required to support Uploads and Deletes, see Secure API Operations
storage.{id}.replicas arrayOf(string) [] Required to support Replication, see File Replication
storage.{id}.options Object {} Options provided to storage driver
retry RetryOptions (optional) Retry options used by some HTTP Server operations
retry.min number 500 Minimum timeout (in ms) for first retry
retry.factor number 2 Multiple in time applied to retry attempts
retry.retries number 3 Maximum retry attempts before failure is reported
cors CorsOptions (optional) CORS access is enabled by default, for GET's only
cors.access-control-allow-credentials string true Allow credentials
cors.access-control-allow-headers string * Allow headers
cors.access-control-allow-methods string GET Allow methods
cors.access-control-allow-origin string * Allow origins
cors.access-control-max-age string 86400 Cache duration of CORS headers
auth AuthOptions (optional) Collection of named auth groups
auth.{id}.driver string (required) Path of the driver to load, ala blobby-auth-header
auth.{id}.options Object (optional) Any options to pass to the auth driver
auth.{id}.publicReads Boolean true Set to false if GET's also require auth
log LogOptions (optional) Options based on EventEmitter
log.warnings bool true Log warnings to console.warn automatically. You can subscribe to client.on('warn') if you prefer
log.errors bool true Log warnings to console.error automatically. You can subscribe to client.on('error') if you prefer

Storage Drivers

  • blobby-s3 - An S3 storage client for Blobby, powered by Knox.
  • blobby-fs - A File System storage client for Blobby.
  • blobby-gcp-storage - An Google Cloud storage client for Blobby.

Secure Configuration

An optional feature for sensitive credentials is to leverage the included Config Shield support. Any secure configuration objects will be merged into the parent configuration object. If secure-config option is provided, it's expected that for every configuration file, there will be a corresponding secure configuration file using the same file name, but under the secure-config directory.

blobby server --secure-config config/secure --secure-file config/secure/secret.txt

Example for creating a secure configuration:

npm i config-shield -g
cshield config/secure/local.json config/secure/secret.txt
set storage { app1: { options: { password: 'super secret!' } } }
save
exit

See Config Shield for more advanced usage.

Server

Start HTTP Server using the provided Configuration.

blobby server

REST API

Method Route Auth Info
GET /{storageId}/{filePath} Public Get a file from storage
HEAD /{storageId}/{filePath} Public Get info for file from storage
PUT /{storageId}/{filePath} Secure Create or overwrite file in storage.
PUT (copy) /{storageId}/{filePath} Secure Copy file via experimental header x-amz-copy-source: [optional-bucket:]/source/path.
DELETE /{storageId}/{filePath} Secure Delete file from storage
GET /{storageId}/{directoryPath}/ Secure Get directory contents by postfixing the path with /
DELETE /{storageId}/{filePath}/ Secure Delete directory (recursively) from storage

Example Usage:

curl -XPUT -H "Authorization: ApiKey shhMySecret" --data-binary "@./some-file.jpg" http://localhost/myStorage/some/file.jpg
curl -XHEAD http://localhost/myStorage/some/file.jpg
curl http://localhost/myStorage/some/file.jpg
curl -H "Authorization: ApiKey shhMySecret" http://localhost/myStorage/some/
curl -XDELETE -H "Authorization: ApiKey shhMySecret" http://localhost/myStorage/some/file.jpg

Default permissions will be applied via storage.{id}.accessControl, but can be overridden via the x-amz-acl header, like so:

curl -XPUT -H "x-amz-acl: private" -H "Authorization: ApiKey shhMySecret" --data-binary "@./some-file.jpg" http://localhost/myStorage/some/file.jpg

The above examples is a perfect segway into Secure API Operations.

Secure API Operations

As indicated in Configuration, storage.{id}.auth is required to support uploads and deletes.

Example Config:

  auth: {
    mainAuth: {
      driver: './lib/my-jwt-handler',
      options: { /* options only my auth driver will understand */ }
    }
  },
  storage: {
    store1: {
      driver: '...',
      auth: 'mainAuth' // uploads to store1 require mainAuth
    }
  }

If you're creating your own Authorization handler, you can export a module with the following format:

module.exports = function(req, storageId, fileKey, authConfig, cb) {
  doSomethingAsync(function (err) => {
    if (err) return void cb(err); // fail authorization

    cb(); // authorization check passed, let them through
  });
}

Your handler can be synchronous or asynchronous, but cb must be invoked in either case.

Authorization Drivers

File Replication

As indicated in Configuration, storage.{id}.replicas is required to enabled replication. An array of one or more replicas can be provided, consisting of the storage identifier and optionally the configuration if the desired storage exists in a different environment (such as replication across data centers).

Format is [ConfigId::]StorageId, where ConfigId only needs to be specified if from a different environment.

Example of two replicas, one from same environment, other from a different environment:

replicas: ['myOtherStorage', 'otherConfig::AnotherStorage']

Important: Successful uploads (PUT's) and deletes (DELETE's) are only confirmed if all replica's have been written to. This is to avoid data inconsistencies and race conditions (i.e. performing an action on an asset before it's been written in all locations). In cases where speed is more important than consistency, querystring param waitForReplicas=0 can be set. There is no way to turn off replication without removing from configuration, so this option will only return success once the local storage is successful. The downside of this approach is that high availability is expected for every replica, and uploads (or deletes) will fail if one of the replica's cannot be written to.

Full Command List

Commands:
  checkdir <dir> <storage..>  One-Way shallow directory compare between storage
                              bindings and/or environments
  check <storage..>           One-Way compare files between storage bindings
                              and/or environments
  compare <storage..>         Compare files between storage bindings and/or
                              environments
  copydir <dir> <storage..>   One-way shallow directory copy between storage
                              bindings and/or environments
  copy <storage..>            One-way copy of files between storage bindings
                              and/or environments
  shard <storage> <dir>       Look up the given shard for a given storage and
                              path
  initialize <storage..>      Perform any initialization tasks required by the
                              given storage (ex: pre-creating bucket shards in
                              S3)
  repair <storage..>          Repair files between storage bindings and/or
                              environments
  rmdir <dir> <storage..>     Delete files for the given directory and storage
                              bindings and/or environments
  server                      Start HTTP API Server
  acl <dir> <storage..>       Set ACL's for a given directory for the given
                              storage bindings and/or environments
  stats <storage..>           Compute stats for storage bindings and/or
                              environments

Compare

For comparing the difference between storage bindings and/or environments. This is a two-way comparison. Use check instead if you only want to do a one-way comparison.

blobby compare <storage..>

Example of comparing two bindings:

blobby compare old new

Example of comparing one binding across 2 datacenters:

blobby compare app --config dc1 dc2

Example of comparing two bindings across 2 datacenters:

blobby compare old new --config dc1 dc2

Compare Modes

blobby compare old new --mode deep

Available modes:

  • fast - A simple check of file existence. Only recommended when you're comparing stores configured for immutable data. Size check will also be performed, if the storage driver provides it.
  • headers (recommended) - Similar in speed to fast, but requires ETag or LastModified headers or comparison will fail. Should only be used between storage drivers that support at least one of these headers. NOTE: S3 should only be compared against other S3 storages in this mode due to their inability to overwrite these headers.
  • deep - Performs an ETag check if available, otherwise falls back to loading files and performing hash checks. This option can range from a little slower, to much slower, depending on ETag availability. Recommended for mutable storage comparisons where caching headers are not available (ex: comparing a file system with S3 or vice versa).
  • force - If you want to skip comparison for any reason, this will force the comparison to fail, resulting in update of the destination for all source files. Also has the benefit of being the fastest option since destination does not need queried.

Repair

For repairing the difference between storage bindings and/or environments. This is a two-way repair. Use copy instead if you only want to do a one-way repair.

NPM

blobby repair <storage..>

Example of syncing data between old and new storage:

blobby repair old new

Example of syncing one storage across 2 datacenters:

blobby repair app --config dc1 dc2

Example of syncing two storage across 2 datacenters:

blobby repair old new --config dc1 dc2

For usage of mode, see Compare Modes.

Stats

Query statistics against your storage(s).

blobby stats <storage..>

Example of querying stats for a single storage:

blobby stats old

Initialize

Useful one-time initialization required by some storage drivers, such as pre-creating shard buckets in S3.

blobby initialize <storage..>

Example of initializing a single storage:

blobby initialize new

Shard

Useful for identifying the location of a given directory for storage drivers that support sharding.

blobby shard <storage> <dir>

Example:

blobby shard new 'some/path'