Code Monkey home page Code Monkey logo

couchwarehouse's Introduction

couchwarehouse

couchwarehouse is a command-line tool that turns your Apache CouchDB database(s) into a local data warehouse. The target database can be either be SQLite, PostgreSQL, MySQL or Elasticsearch.

It works by:

  • discovering the "schema" of your CouchDB database (for the relational databases).
  • creating a new SQLite, PostgreSQL or MySQL table to match the schema, or in the case of Elasticsearch simply moving the JSON over.
  • downloading all the documents (except design documents) and inserting one row per document into the target database.
  • continuously monitoring CouchDB for new documents, updates to existing documents and deletions.

Once downloaded your database can be queried using SQL or the target database's API.

Installation

Node.js version 8 or above and npm are required:

npm install -g couchwarehouse

Usage with SQLite

By default, your CouchDB installation is expected to be on "http://localhost:5984". Override this with the --url/-u parameter and specify the database name with --database/-db:

$ couchwarehouse --url https://U:[email protected] --db mydb
Run the following command to query your data warehouse:

  $ sqlite3 couchwarehouse.sqlite

Then in sqlite3, you can run queries e.g.:

  sqlite3> SELECT * FROM cities LIMIT 10;

Have fun!
p.s Press ctrl-C to stop monitoring for further changes
downloading mydb [======------------------------] 20% 27.7s

After downloading is complete, couchwarehouse will continuously poll the source database for any changes and update the local database accordingly.

Press "Ctrl-C" to exit.

Accessing the SQLite data warehouse

In another terminal, simply run the sqlite3 command-line tool (which may be pre-installed on your computer, otherwise download here).

$ sqlite3 couchwarehouse.sqlite
sqlite3> SELECT name,latitude,longitude,country,population FROM mydb LIMIT 10;
name                    latitude    longitude   country     population
----------------------  ----------  ----------  ----------  ----------
Brejo da Madre de Deus  -8.14583    -36.37111   BR          27369.0   
Pindaré Mirim           -3.60833    -45.34333   BR          22933.0   
Moju                    -1.88389    -48.76889   BR          21510.0   
Matriz de Camaragibe    -9.15167    -35.53333   BR          18705.0   
Fatikchari              22.68768    91.78123    BD          33200.0   
Picos                   -7.07694    -41.46694   BR          57495.0   
Balsas                  -7.5325     -46.03556   BR          68056.0   
Jaguaruana              -4.83389    -37.78111   BR          21790.0   
Pilar                   -9.59722    -35.95667   BR          30617.0   
Patos                   -7.02444    -37.28      BR          92575.0 

SQLite has an extensive query language including aggregations, joins and much more. You may create warehouses from multiple CouchDB databases to create multiple SQLite tables and join them with queries!

N.B if your database name has a - character in it, it will be removed from the subsequent SQL table e.g "month-54" becomes "month54".

Using with PostgreSQL as the target database

The PostgreSQL connection details are gleaned from environment variables. If you're running PostgreSQL locally without password protection, you need only worry about the PGDATABASE environment variable which defines the name of the database the couchwarehouse tables will be created. If left undefined, a database matching your current username will be assumed (e.g. glynnb). I had to create this database first:

$ createdb glynnb

before running couchwarehouse specifyinhg the --databaseType parameter:

$ couchwarehouse --url https://U:[email protected] --db mydb --databaseType postgresql

You may then run psql locally to query your data:

$ psql
glynnb=# select * from mydb limit 5;
    name    | latitude | longitude | country | population |         timezone          |   id    |                rev                 
------------+----------+-----------+---------+------------+---------------------------+---------+------------------------------------
 Fatikchari |  22.6877 |   91.7812 | BD      |      33200 | Asia/Dhaka                | 6414184 | 1-b463b22510476d1f5a9286654eab306b
 Pilar      | -9.59722 |  -35.9567 | BR      |      30617 | America/Maceio            | 3392126 | 1-249183b8148fa14c2b203d101dbe19be
 Jaguaruana | -4.83389 |  -37.7811 | BR      |      21790 | America/Fortaleza         | 3397665 | 1-93783cc6d4a421f65cc6238275640803
 Patos      | -7.02444 |    -37.28 | BR      |      92575 | America/Fortaleza         | 3392887 | 1-629bf77b67fa9173670008dabceb178f
 Pirané     | -25.7324 |  -59.1088 | AR      |      19124 | America/Argentina/Cordoba | 3429949 | 1-19b66e5364fb1292823e4f9a6c53571d
(5 rows)

Using with MySQL as the target database

The MySQL connection string is taken from the MYSQLCONFIG environment variable, or if absent mysql://root:@localhost:3306/couchwarehouse is used. connection details are gleaned from [environment variables]. You will need to create the couchwarehouse database first:

$ mysql -u root
mysql> CREATE DATABASE couchwarehouse;
Query OK, 1 row affected (0.00 sec)

before running couchwarehouse specifyinhg the --databaseType parameter:

$ couchwarehouse --url https://U:[email protected] --db mydb --databaseType mysql

You can then access your datawarehouse from the mysql console:

$ mysql -u root
mysql> select * from mydb limit 5;
+---------------+----------+-----------+---------+------------+---------------------+---------+------------------------------------+
| name          | latitude | longitude | country | population | timezone            | id      | rev                                |
+---------------+----------+-----------+---------+------------+---------------------+---------+------------------------------------+
| Grahamstown   | -33.3042 |   26.5328 | ZA      |      91548 | Africa/Johannesburg | 1000501 | 1-d8d38173981fe25cc8592b14c34aa262 |
| Graaff-Reinet | -32.2522 |   24.5308 | ZA      |      62896 | Africa/Johannesburg | 1000543 | 1-3256046064953e2f0fdb376211fe78ab |
| Abū Ghurayb   |  33.3056 |   44.1848 | IQ      |     900000 | Asia/Baghdad        | 100077  | 1-101bff1251d4bd75beb6d3c232d05a5c |
| Giyani        | -23.3025 |   30.7187 | ZA      |      37024 | Africa/Johannesburg | 1001860 | 1-cb3cd8dd58cef68b9e2cebc66eedcc10 |
| Ga-Rankuwa    | -25.6169 |   27.9947 | ZA      |      68767 | Africa/Johannesburg | 1002851 | 1-685b969148a5534b9cd85689996c52f0 |
+---------------+----------+-----------+---------+------------+---------------------+---------+------------------------------------+
5 rows in set (0.00 sec)

Using with Elasticsearch as the target database

The MySQL connection string is taken from the ESCONFIG environment variable, or if absent http://localhost:9200 is used.

Run couchwarehouse specifyinhg the --databaseType parameter:

$ couchwarehouse --url https://U:[email protected] --db mydb --databaseType elasticsearch

You can then access your datawarehouse using the Elasticsearch API:

$  curl 'http://localhost:9200/couchwarehouse/_search?q=name:"York"' 
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":6,"max_score":7.998925,"hits":[{"_index":"couchwarehouse","_type":"default","_id":"4562407","_score":7.998925,"_source":{"name":"York","latitude":39.9626,"longitude":-76.72774,"country":"US","population":43718,"timezone":"America/New_York"}},{"_index":"couchwarehouse","_type":"default","_id":"2633352","_score":7.998925,"_source":{"name":"York","latitude":53.95763,"longitude":-1.08271,"country":"GB","population":144202,"timezone":"Europe/London"}},{"_index":"couchwarehouse","_type":"default","_id":"6091104","_score":5.9267497,"_source":{"name":"North York","latitude":43.76681,"longitude":-79.4163,"country":"CA","population":636000,"timezone":"America/Toronto"}},{"_index":"couchwarehouse","_type":"default","_id":"7870925","_score":5.9267497,"_source":{"name":"East York","latitude":43.69053,"longitude":-79.32794,"country":"CA","population":115365,"timezone":"America/Toronto"}},{"_index":"couchwarehouse","_type":"default","_id":"5128581","_score":5.283532,"_source":{"name":"New York City","latitude":40.71427,"longitude":-74.00597,"country":"US","population":8175133,"timezone":"America/New_York"}},{"_index":"couchwarehouse","_type":"default","_id":"5106292","_score":4.734778,"_source":{"name":"West New York","latitude":40.78788,"longitude":-74.01431,"country":"US","population":49708,"timezone":"America/New_York"}}]}

Command-line parameter reference

  • --url/-u - the URL of the CouchDB instance e.g. http://localhost:5984
  • --database/--db/-d - the name of the CouchDB database to work with = --databaseType/-dt - the type of database - sqlite, mysql,postgresql or elasticsearch (default: sqlite)
  • --verbose - whether to show progress on the terminal (default: true)
  • --reset/-r - reset the data. Delete existing data and start from scratch (default: false)
  • --transform/-t - transform each document with a supplied JavaScript function (default: null)
  • --split/-s - split a database into multiple tables on this field (default: null)
  • --version - show version number
  • --help - show help

The CouchDB URL can also be specified with the COUCH_URL environment variable e.g.

export COUCH_URL="https://USER:[email protected]"
couchwarehouse --db mydb

Transforming documents

If you need to format the data prior to it being stored in the SQLite database, you may optionally supply a JavaScript transformation function with the --transform/-t parameter.

Create a JavaScript file, in this case called transform.js:

const f = (doc) => {
  // remove the basket array
  delete doc.basket

  // trim whitespace from the category
  doc.category = doc.category.trim()
  
  // combine the title/firstname/surname into one field
  doc.name = [doc.title, doc.firstname, doc.surname].join(' ') 
  delete doc.title
  delete doc.firstname
  delete doc.surname

  // return the transformed document
  return doc
}

// export the function
module.exports = f

Then instruct couchwarehouse to use the function:

couchwarehouse --db mydb --transform './transform.js' 

Splitting one CouchDB database into multiple tables

A common CouchDB design pattern is to use a top-level field in the the JSON document to identify the "type" of the document (e.g. type: "person" or type: "order") and two have multiple document "types" in the same CouchDB database. If that's the case, you'll need the --split/-s option which allows you to specify the field you are using - couchwarehouse will create a new table for each type e.g. mydb_person, mydb_order.

Simply specify the top-level field name used to differentiate your document types with the --split/-s parameter:

# instruct couchwarehouse to split on the 'type' field
couchwarehouse --db mydb --split type

Once the data is imported, you can then query the tables separately or use JOIN syntax to query across tables e.g.

SELECT * FROM mydb_order 
  LEFT JOIN join mydb_user 
  ON mydb_order.customerId = mixed_user.id 
  LIMIT 10

Schema discovery

CouchDB is a JSON document store and as such, the database does not have a fixed schema. The couchwarehouse utility takes a look at a handful of documents and infers a schema from what it sees. This is clearly only of use if your CouchDB documents that have similar documents.

Let's take a typical document that looks like this:

{
  "_id": "afcc37fbe6ff4dd35ecf06be51e45724",
  "_rev": "1-d076609f1a507282af4e4eb52da6f4f1",
  "name": "Bob",
  "dob": "2000-05-02",
  "employed": true,
  "grade": 5.6,
  "address": {
    "street": "19 Front Street, Durham",
    "zip": "88512",
    "map": {
      "latitude": 54.2,
      "longitude": -1.5
    }
  },
  "tags": [
    "dev",
    "front end"
  ]
}

couchwarehouse will infer the following schema:

{
  "id": "string",
  "rev": "string",
  "name": "string",
  "dob": "string",
  "employed": "boolean",
  "grade": "number",
  "address_street": "string",
  "address_zip": "string",
  "address_map_latitude": "number",
  "address_map_longitude": "number",
  "tags": "string"
}

Notice how:

  • the sub-objects are "flattened" e.g. address.map.latitude --> address_map_latitude
  • arrays are turned into strings
  • _id/_rev become id/rev

The keys of the schema become the column names of the SQLite table.

Removing unwanted SQLite tables

Unwanted tables can be easily removed using the sqlite3 prompt:

sqlite> DROP TABLE mydb;

The whole SQLite database can be removed by deleting the couchwarehouse.sqlite file from your file system.

What's the catch?

  • you need enough memory and hard disk space to store the entire database on your machine
  • conflicted document bodies are ignored
  • objects are flattened
  • arrays are stored as their JSON representation
  • your data needs to be relativelyconsistent. The SQL schema is created from the first document of that type that couchwarehouse sees. If you have documents of the same type whose schema varies slightly across the database, then this may not work. You can, however, use a "transform" function to fill in missing fields and tidy up the data a bit. As of version 1.3, there won't be errors from changes in schema, but couchwarehouse doesn't magically migrate your schema as it changes over time.

Using programmatically

This library can be used programmatically too:

const couchwarehouse = require('couchwarehouse')

// configuration
const opts = {
  url: 'https://USER:[email protected]',
  database: 'mydb'
}

const main = async () => {
  // start downloading data - wait until changes feed is complete
  await couchwarehouse.start(opts)

  // query the database
  couchwarehouse.query('SELECT * FROM mydb').then((data) => {
    console.log(data)
  })
}
main()

The opts object passed to couchwarehouse.start can contain:

  • url - the URL of the CouchDB instance e.g. http://localhost:5984
  • database - the name of the CouchDB database to work with
  • verbose - whether to show progress on the terminal (default: true)
  • reset - reset the data. Delete existing data and start from scratch (default: false)
  • split - the attribute to use to split documents into separate tables (default: splitting disabled)

Debugging

Starting couchwarehouse with the DEBUG environment variable set will produce extra output e.g.

DEBUG=* couchwarehouse --db mydb

couchwarehouse's People

Contributors

dependabot[bot] avatar glynnbird avatar thyarles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

couchwarehouse's Issues

Allow data to be transformed

The couchwarehouse user should be allowed to optionally provide the path of a JavaScript file which exports a single function that will be called during the data download process.

e.g.

couchwarehouse --db mydb --transform ./transform.js

where transform.js contains something like:

const f = (doc) => {

  delete doc.unwantedField
  if (!doc.missingField) {
     doc.missingField = 0
  }
  if (typeof doc.wrongTypeField === 'string') {
      doc.wrongTypeField = parseInt(doc.wrongTypeField)
  }
  return doc
}

module.exports = f
  • the transform function will be called after fetching a document from Cloudant
  • the transform operation will occur before schema discovery and before the data is written to the database

Allow evolving schemas

The schema that is created when couchwarehouse is first run is fixed, but an INSERT/REPLACE statement breaks if

  • a field in the schema is missing in some docs
  • a field is present in later docs which isn't in the schema

To make this more reliable, couchwarehouse should use the schema as a guide and use the fields in the first doc it sees as a pattern for the rest. Nulls should be placed in missing fields and extra fields should be ignored.

Doesn't work with databases containing hyphen characters

IF I have a database called "orders-master", the couchwarehouse will not work. (the same database without the hyphen works).

This is definitely the case. It comes down to SQLite objecting to

CREATE TABLE IF NOT EXISTS orders-master....

hint: the table name needs to be escaped.

Allow transformations using jq

jq is a very capable and fast json transformation tool. It wold be nice if couchwarehouse could apply transformations using jq in addition to using js..

Couchwarehouse doesn't detect schema changes

Couchwarehouse detects the schema when it sees the first document in a database (or the first document of that type in split mode). It doesn't migrate the schema of the SQL table if future documents contain rows that don't exist.

Database with column name starting with number

  • Throws an error.
  • With SQLite
p.s Press ctrl-C to stop monitoring for further changes
events.js:291
      throw er; // Unhandled 'error' event
      ^

[Error: SQLITE_ERROR: unrecognized token: "1stdose"
Emitted 'error' event on Statement instance at:
] {
  errno: 1,
  code: 'SQLITE_ERROR'
}

Shall use this npm for my app?

I have used relational npm for create a doc in CouchDB. CouchDB have following data formate.

{
  "_id": "pfm2653_2_0224E977-A319-6E19-A065-4EE1E3AABDE7",
  "_rev": "1-293de9bf186747f3b696d106650de688",
  "data": {
    "student_no": "1",
    "student_name": "Mohammed Safeer Zaheer",
    "type": "pfm2653"
  }
}

Shall I use this npm for moving this formate data into postgres?

Installation using npm fails

Hello,

FreeBSD-11.3.

sudo npm install -g couchwarehouse fails:

Error: EACCES: permission denied, mkdir '/usr/local/lib/node_modules/couchwarehouse/node_modules/sqlite3/.node-gyp'

Got the same error building as root. When I try as my user (who has write permissions for the directory in which installation is being done):

npm WARN checkPermissions Missing write access to /usr/home/rmason/Software/Test/Couchdb3

and the install fails b/c of the permissions problem.

Cheers.

Does not work with postgresql.

postgresql is version 12.

DEBUG=* couchwarehouse -r --url http://donaldduck:[email protected]:5984 --db tripe --databaseType postgresql
couchwarehouse Getting last change from CouchDB +0ms
follow-redirects options {
protocol: 'http:',
maxRedirects: 21,
maxBodyLength: 10485760,
path: '/tripe/_changes?since=now&limit=1',
method: 'GET',
headers: {
Accept: 'application/json, text/plain, /',
'User-Agent': 'axios/0.19.2'
},
agent: undefined,
agents: { http: undefined, https: undefined },
auth: 'rmason:B103SCaCC',
hostname: '127.0.0.1',
port: '5984',
nativeProtocols: {
'http:': {
_connectionListener: [Function: connectionListener],
METHODS: [Array],
STATUS_CODES: [Object],
Agent: [Function],
ClientRequest: [Function: ClientRequest],
IncomingMessage: [Function: IncomingMessage],
OutgoingMessage: [Function: OutgoingMessage],
Server: [Function: Server],
ServerResponse: [Function: ServerResponse],
createServer: [Function: createServer],
validateHeaderName: [Function: hidden],
validateHeaderValue: [Function: hidden],
get: [Function: get],
request: [Function: request],
maxHeaderSize: [Getter],
globalAgent: [Getter/Setter]
},
'https:': {
Agent: [Function: Agent],
globalAgent: [Agent],
Server: [Function: Server],
createServer: [Function: createServer],
get: [Function: get],
request: [Function: request]
}
}
} +0ms
couchwarehouse Initalise database +33ms

Nothing gets entered into the postgresql database. The postgresql log does not even register a connection. I have tried instrumenting couchwarehouse using console.log() to no avail (no output). I know little of JavaScript, so I'm stuck.

Split mode

Allow a CouchDB database that contains multiple "types" to be split into multiple tables e.g. if a single database contains documents of type person/order/product, they should each have their own schema and be placed into their own tables.

Wider DB support

Also supporting other databases than sqlite3 would widen the appeal of this.

Couchwarehouse unable to migrate database from database with hyphen

Couchwarehouse is not able to migrate database from couchdb to postgresql when the database name in couchdb contains hyphen. It'd be great if you cold fix it and be able to migrate database with hyphen. Otherwise, you can handle this condition.

image

Platform: Windows 10 64-bit
Postgresql Version: 11.1
Couchdb : 2.3.1
Couchwarehouse: 1.4.0

Database with special characters in name

We run Hyperledger Fabric which uses CouchDB to store data. It has $ and _ characters in DB name. I'm unable to make couchwarehouse work with these DBs.
I have tried both urlencoding the characters as well as just passing the dbname as is with special characters.

$ couchwarehouse --version
1.6.4

$ couchwarehouse -u http://user:pass@localhost:5984 --db 'my_db$$name'
Run the following command to query your data warehouse:

  $ sqlite3 couchwarehouse.sqlite

Then in sqlite3, you can run queries e.g.:

  sqlite3> SELECT * FROM my_db$$name LIMIT 10;

Have fun!
p.s Press ctrl-C to stop monitoring for further changes
events.js:174
      throw er; // Unhandled 'error' event
      ^

Error: SQLITE_ERROR: near "~": syntax error
Emitted 'error' event at:



$ couchwarehouse -u http://user:pass@localhost:5984 --db my_db%24%24name
Run the following command to query your data warehouse:

  $ sqlite3 couchwarehouse.sqlite

Then in sqlite3, you can run queries e.g.:

  sqlite3> SELECT * FROM my_db%24%24name LIMIT 10;

Have fun!
p.s Press ctrl-C to stop monitoring for further changes
(node:168047) UnhandledPromiseRejectionWarning: Error: Request failed with status code 404
    at createError (/home/ahmed/node_modules/axios/lib/core/createError.js:16:15)
    at settle (/home/ahmed/node_modules/axios/lib/core/settle.js:17:12)
    at RedirectableRequest.handleResponse (/home/ahmed/node_modules/axios/lib/adapters/http.js:231:9)
    at RedirectableRequest.emit (events.js:198:13)
    at RedirectableRequest._processResponse (/home/ahmed/node_modules/follow-redirects/index.js:399:10)
    at ClientRequest.RedirectableRequest._onNativeResponse (/home/ahmed/node_modules/follow-redirects/index.js:57:10)
    at Object.onceWrapper (events.js:286:20)
    at ClientRequest.emit (events.js:198:13)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:565:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:111:17)
(node:168047) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
(node:168047) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Curl works with urlencoded DB name. Example curl command,

curl -s --header "Content-Type: application/json" -d "$query_str" http://user:pass@localhost:5984/my_db%24%24name/_find

Any thoughts on making this work?

couchwarehouse can not handle non-ASCII characters

User comment: "Characters in languages other than English throw an error and will not be written to SQLite3"

Cannot reproduce. Here's an example:

{
  "_id": "5289a87e82d54d0f2d8f51c87f59cade",
  "_rev": "1-9076c3fd5d6533dc1503341d929b5867",
  "emojii": "🤔",
  "name": "al-Jumhūrīyah al-ʻArabīyah as-Sūrīyah",
  "arabic": "الجمهورية العربية السورية"
}

and in couchwarehouse:

sqlite> select * from nonascii;
emojii      name                                   arabic                     id                                rev                               
----------  -------------------------------------  -------------------------  --------------------------------  ----------------------------------
🤔           al-Jumhūrīyah al-ʻArabīyah as-Sūrīyah  الجمهورية العربية السورية  5289a87e82d54d0f2d8f51c87f59cade  1-9076c3fd5d6533dc1503341d929b5867

Allow couchwarehouse to filter the changes feed

By default, couchwarehouse consumes the whole database changes feed firehose. An optional enhancement to this would be to supply a Cloudant Query selector to allow a subset of the data to be moved to the target SQL database e.g.

couchwarehouse --db master --query '{"status":"complete"}'

where --query/-q is a valid Cloudant Query selector which is used server-side to filter the changes feed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.