dbcp

Dump Mysql, Postgres, SQLServer, and ElasticSearch databases directly to AWS S3, Google Cloud Storage, or Azure.

Usage no npm install needed!

<script type="module">
  import dbcp from 'https://cdn.skypack.dev/dbcp';
</script>

README

dbcp image test Coverage

$ dbcp --help
cli.js [sourceFile] [targetFile]
$ dbcp data.parquet data.jsonl.gz
$ dbcp data.jsonl.gz s3://bucket/data.csv.gz

Copy from or to MySQL, PostgreSQL, SQLServer, LevelDB, MongoDB, and ElasticSearch directly to/from files on Amazon Web Services (AWS) S3, Google Cloud Storage (GCS), Microsoft Azure, SMB, HTTP, or another database.

Automatically converts between supported formats JSON, ND-JSON, CSV, SQL, Parquet, and TFRecord (with optional gzip compression).

CLI

Either --sourceType or --sourceFile and --targetType or --targetFile are required. Other options can be shortened, e.g --user instead of --sourceUser. Only a database-to-database copy requires both --sourceUser and --targetUser. The file format and compression is inferred from the filename. dbcp pipes Readable Node.JS streams to Writable streams. No intermediate storage is required.

API

  • The transformObject/transformObjectStream API can be used for streaming transforms of Big Data.
  • The CLI uses transformBytes/transformBytesStream to render progress updates.

Features

  • dbcp supports sharding. It can split or join groups of files.
  • dbcp can convert files from one format to another.
  • dbcp supports compound inserts, which can insert groups of associated rows from multiple tables.
  • dbcp can translate SQL dialects, e.g. dump a Postgres table to .sql file with SQLServer CREATE and INSERT syntax.

Credits

Modules

Setup

Global install

$ npm install -g dbcp
$ dbcp --help

Local setup

$ npm init
$ npm install dbcp
$ ./node_modules/.bin/dbcp --help

Examples

API

Write object stream to any source and format

  import { AnyFileSystem } from '@wholebuzz/fs/lib/fs'
  import { LocalFileSystem } from '@wholebuzz/fs/lib/local'
  import { S3FileSystem } from '@wholebuzz/fs/lib/s3'
  import { dbcp } from 'dbcp'
  import StreamTree from 'tree-stream'
  
  const fileSystem = new AnyFileSystem([
    { urlPrefix: 's3://', fs: new S3FileSystem() },
    { urlPrefix: '', fs: new LocalFileSystem() }
  ])
  
  await dbcp({
    fileSystem,
    targetFile: 's3://foo/bar.jsonl',
    // e.g. from level (https://www.npmjs.com/package/level)) database
    sourceStream: StreamTree.readable(levelIteratorStream(leveldb.iterator())),
  })

Read object stream from any source and format

  import { dbcp, openNullWritable } from 'dbcp'
  import { Transform } from 'stream'

  // Supply transformObject and a do-nothing Writable for targetStream.
  await dbcp({
    fileSystem,
    sourceFiles: [ { url: '/tmp/foobar.csv.gz' } ],
    targetStream: [ openNullWritable() ],
    transformObject: (x) => { console.log('test', x) },
  })

  // Or alternatively supply targetStream with targetFormat = object
  await dbcp({
    fileSystem,
    sourceFiles: [ { url: '/tmp/foobar.csv.gz' } ],
    // Without targetFormat = object, transform() would receive Buffer
    targetFormat: DatabaseCopyFormat.object,
    targetStream: [
      StreamTree.writable(new Transform({
        objectMode: true,
        transform(data, _, cb) {
          console.log('test', data)
          cb()
        },
      }))
    ],
  })

CLI

Tested

PASS src/index.test.ts (85.9 s)
  ✓ Should hash test data as string
  ✓ Should hash test data stream
  ✓ Should copy local file 
  ✓ Should read local directory
  ✓ Should convert to JSON from ND-JSON and back
  ✓ Should convert to sharded JSON from ND-JSON and back
  ✓ Should convert to Parquet from ND-JSON and back
  ✓ Should convert to TFRecord from ND-JSON and back
  ✓ Should load to level from ND-JSON and dump to JSON after external sort
  ✓ Should restore to and dump compound data
  ✓ Should restore to and dump from Elastic Search to ND-JSON
  ✓ Should restore to and dump from MongoDB to ND-JSON
  ✓ Should restore to and dump from Postgres to ND-JSON
  ✓ Should restore to and dump from Postgres to SQL
  ✓ Should not hang on error
  ✓ Should copy from Postgres to Mysql
  ✓ Should copy from Postgres to SQL Server
  ✓ Should dump from Postgres to Parquet file
  ✓ Should dump from MySQL to Parquet file
  ✓ Should dump from SQL Server to Parquet file

API Interface

export async function dbcp(args: DatabaseCopyOptions)

export interface DatabaseCopyOptions {
  batchSize?: number
  columnType?: Record<string, string>
  compoundInsert?: boolean
  contentType?: string
  copySchema?: DatabaseCopySchema
  engineOptions?: any
  externalSortBy?: string[]
  extra?: Record<string, any>
  extraOutput?: boolean
  fileSystem?: FileSystem
  group?: boolean
  groupLabels?: boolean
  limit?: number
  orderBy?: string[]
  query?: string
  shardBy?: string
  schema?: Column[]
  schemaFile?: string
  sourceConnection?: Record<string, any>
  sourceElasticSearch?: Client
  sourceFormat?: DatabaseCopyFormat
  sourceFiles?: DatabaseCopySourceFile[] | Record<string, DatabaseCopySourceFile>
  sourceHost?: string
  sourceLevel?: level.LevelDB | LevelUp
  sourceName?: string
  sourceKnex?: Knex
  sourcePassword?: string
  sourceShards?: number
  sourceStream?: ReadableStreamTree
  sourceTable?: string
  sourceType?: DatabaseCopySourceType
  sourcePort?: number
  sourceUser?: string
  targetConnection?: Record<string, any>
  targetElasticSearch?: Client
  targetFormat?: DatabaseCopyFormat
  targetFile?: string
  targetHost?: string
  targetKnex?: Knex
  targetLevel?: level.LevelDB | LevelUp
  targetName?: string
  targetPassword?: string
  targetShards?: number
  targetStream?: WritableStreamTree[]
  targetTable?: string
  targetType?: DatabaseCopyTargetType
  targetPort?: number
  targetUser?: string
  tempDirectory?: string
  transformObject?: (x: unknown) => unknown
  transformObjectStream?: () => Duplex
  transformBytes?: (x: string) => string
  transformBytesStream?: () => Duplex
  where?: Array<string | any[]>
}

CLI Options

$ dbcp --help
cli.js [sourceFile] [targetFile]

Options:
  --help            Show help                                          [boolean]
  --version         Show version number                                [boolean]
  --compoundInsert  Compound insert mode can insert associated rows from
                    multiple tables.                                   [boolean]
  --contentType     Content type                                        [string]
  --dataOnly        Dump only the data, not the schema (data definitions).
                                                                       [boolean]
  --dbname          Database                                            [string]
  --externalSortBy  Sort data by property(s) with external-sorting       [array]
  --format
    [choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
                                                                          "sql"]
  --group           Group inputs with equinvalent orderBy              [boolean]
  --host            Database host                                       [string]
  --limit           Database query LIMIT                                [number]
  --orderBy         Database query ORDER BY                              [array]
  --password        Database password                                   [string]
  --port            Database port                                       [string]
  --query           Query                                               [string]
  --schemaFile      Use schema file if required, instead of schema inspection.
                                                                        [string]
  --schemaOnly      Dump only the object definitions (schema), not data.
                                                                       [boolean]
  --shardBy         Shard (or split) the data based on key              [string]
  --shards          The number of shards to split or join the data      [number]
  --sourceFile      Source file                                          [array]
  --sourceFormat
    [choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
                                                                          "sql"]
  --sourceHost      Source host                                         [string]
  --sourceName      Source database                                     [string]
  --sourcePassword  Source database password                            [string]
  --sourcePort      Source database port                                [string]
  --sourceShards    Source shards                                       [number]
  --sourceTable     Source database table                               [string]
  --sourceType      Source database type
     [string] [choices: "athena", "es", "level", "mssql", "mysql", "postgresql",
                                                                       "sqlite"]
  --sourceUser      Source database user                                [string]
  --table           Database table                                      [string]
  --targetFile      Target file                                         [string]
  --targetFormat
    [choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
                                                                          "sql"]
  --targetHost      Target host                                         [string]
  --targetName      Target database                                     [string]
  --targetPassword  Target database password                            [string]
  --targetPort      Target database port                                [string]
  --targetShards    Target shards                                       [number]
  --targetTable     Target database table                               [string]
  --targetType      Target database type
     [string] [choices: "athena", "es", "level", "mssql", "mysql", "postgresql",
                                                                       "sqlite"]
  --targetUser      Target database user                                [string]
  --user            Database user                                       [string]
  --where           Database query WHERE                                 [array]

CLI Examples

Copy PostgreSQL table to Google Cloud Storage gzipped JSON file

$ dbcp \
  --sourceType postgresql \
  --host localhost \
  --dbname postgres \
  --port 5433 \
  --user postgres \
  --password postgres \
  --table foobar \
  --targetFile gs://bucket/file.json.gz

Copy MySQL table to Amazon Web Services S3 gzipped JSON-Lines file

$ dbcp \
  --sourceType mysql \
  --host localhost \
  --dbname mydb \
  --port 8083 \
  --user root \
  --password wp \
  --table foobar \
  --format jsonl \
  --targetFile s3://bucket/object.jsonl.gz

Copy Amazon Web Services S3 gzipped JSON-Lines to MySQL table

$ dbcp \
  --targetType mysql \
  --host localhost \
  --dbname mydb \
  --port 8083 \
  --user root \
  --password wp \
  --table foobar \
  --sourceFile s3://bucket/object.jsonl.gz

Copy SQLServer table to stdout

$ dbcp \
  --sourceType mssql \
  --host localhost \
  --dbname mymsdb \
  --port 1433 \
  --user SA \
  --password "MyP@ssw0rd#" \
  --table foobar \
  --targetFile=-

Copy MongoDB table to four gzipped JSON-Lines shards

$ dbcp \
  --sourceType mongodb \
  --host localhost \
  --port 27017 \
  --user root \
  --password example \
  --dbname test_db \
  --table dbcptest \
  --targetFile output-SSSS-of-NNNN.jsonl.gz \
  --targetShards 4 \
  --shardBy id

$ ls output*
-rw-r--r--    1 user    staff    782701 Feb  4 10:59 output-0001-of-0004.jsonl.gz
-rw-r--r--    1 user    staff    771980 Feb  4 10:59 output-0003-of-0004.jsonl.gz
-rw-r--r--    1 user    staff    794959 Feb  4 10:59 output-0000-of-0004.jsonl.gz
-rw-r--r--    1 user    staff    788720 Feb  4 10:59 output-0002-of-0004.jsonl.gz

Output a file to stdout

$ dbcp gs://bucket/archive.csv.gz | jq . | less

Copy a file from AWS to GCP

$ dbcp s3://bucket/object.json.gz gs://bucket/file.json.gz

Convert file from ND-JSON to JSON

$ dbcp foobar.jsonl bazbat.json

Download a file

$ dbcp "https://www.w3.org/People/mimasa/test/imgformat/img/w3c_home.png" foo.png

Post a file to HTTP endpoint

$ dbcp "./foo.png" "http://my.api/upload" --contentType "image/png"