mallocator / elasticsearch-exporter Goto Github PK

View Code? Open in Web Editor NEW

593.0 593.0 112.0 579 KB

A small script to export data from one Elasticsearch cluster into another.

License: Apache License 2.0

JavaScript 99.64% Shell 0.36%

elasticsearch-exporter's People

Contributors

Stargazers

Watchers

Forkers

vkisselmann rb2k dplate kirubar phillro rcpsec parkerd rtoma aterreno jasonrojas cbaclig sourcec0de mkurian avvsdevelopment my8bit badubizzle turbosquid sitepoint svortex regit wking1986 horaci rajgit001 laurentstemmermd jeffsteinmetz thangnvdigdinos maxmetagravity dweinstein rukelnikov efuquen pctj101 gamatora ceilingfish mpcarl gabe82 elvin-du jla chunyanghero jcoffeebean abourdon hmittal sagarl bender01-code zhangwei5095 dmcqueen ericalingyuan oelesin strongant deseretbook reddylin xeeo ruckuus vtrifonov ankon zenlotus jze mohitsethi czerasz macbash zy0001 zsjohny vjsamuel tymiles003 pulkitsharva harelba danielintloop lookingcloudy toyota790 gaoqc anukat2015 wangjuntytl michael-ancestor juls1337 gavin-meyers blackoon guyueyuqi joe2hpimn mark-breen aliakbarrashidi devartis qbin ositon dbforever peeyush113 osmanmesutozcan tommylogik ronybarbosa idau2012 doytsujin seeyou8286 coder-caicai 18nvaz zoud yanghongkjxy weichaojie mnjstwins panyahui geekhuyang si24803 luckxiao520

elasticsearch-exporter's Issues

Create tuning and performance tips in docs

Add overwrite flag

Add an option to use the create call instead of the index call, so that documents who's index already exist don't get overwritten, if they have already been modified.

Mappings from file option

The option is exported but I don't think it's enabled/used in the code (in particular the es driver).

this logic seems like it would never be called

Line 471 of "es.js"

if (result.statusCode < 200 && result.statusCode > 299)

It would never be possible to be less than 200 and be greater than 299

_mapping is not duplicated

Hello,

Firstly thanks for your module : it's very simple ans useful !

I get a error when I try to duplicate an index, the mapping is not correct.

node exporter.js -a -i -j <new_index>

In console log I get the message :
<...>
Waiting for mapping on target host to be ready, queue length 400
Waiting for mapping on target host to be ready, queue length 450
Waiting for mapping on target host to be ready, queue length 500
Host phmbusllogb01:9200 responded to PUT request on endpoint /lanceur_bkp with an error
Mapping is now ready. Starting with 500 queued hits.
Host phmbusllogb01:9200 responded to PUT request on endpoint /lanceur_bkp with an error
Mapping is now ready. Starting with 0 queued hits.
Processed 100 of 2268 entries (4%)
Processed 700 of 2268 entries (31%)
<...>

When I go to : http://:9200/<new_index>/_mapping I don't have the original mapping but the dynamic mapping

in ES Log :
<...>creating index, cause [api], shards [5]/[0], mappings [mappings]
<...>update_mapping XXXXX

Edit :

In DEBUG mode on ES side I get an exception :
[2014-06-25 14:15:41,229][DEBUG][cluster.service ] [XXXXX] processing [routing-table-updater]: execute
[2014-06-25 14:15:41,230][DEBUG][cluster.service ] [XXXXX] processing [routing-table-updater]: no change in cluster_state
[2014-06-25 14:15:44,458][DEBUG][cluster.service ] [XXXXX] processing [create-index [lanceur_bkp], cause [api]]: execute
[2014-06-25 14:15:45,399][DEBUG][http.netty ] [XXXXX] Caught exception while handling client http traffic, closing connection [id: 0x1ddde1db, /XXXXX:56038 => /XXXXX:9200]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[2014-06-25 14:15:45,399][DEBUG][http.netty ] [XXXXX] Caught exception while handling client http traffic, closing connection [id: 0x85b4aa05, /XXXXX:56134 => /XXXXX:9200]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Add more drivers

e.g.

MongoDB
MySQL
HBase
Datastore
BigQuery

Cannot read property 'maxSockets' of undefined

On master branch I get this error :
Elasticsearch Exporter - Version 1.3.2
Caught exception in Main process: TypeError: Cannot read property 'maxSockets' of undefined
TypeError: Cannot read property 'maxSockets' of undefined
at Object.exports.reset (/home1/Elasticsearch-Exporter-1.3.3/drivers/es.js:41:39)
at Object.exports.export (/home1/Elasticsearch-Exporter-1.3.3/exporter.js:263:26)
at Object. (/home1/Elasticsearch-Exporter-1.3.3/exporter.js:283:19)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)
at node.js:906:3

To workarround this issue, I force the maxSockets value in es.js :
vi drivers/es.js (line 41)
...
//http.globalAgent.maxSockets = opts.maxSockets;
http.globalAgent.maxSockets = 30;
...

Caught exception in Main process: TypeError: Cannot read property 'hits' of undefined

While running exporter.js to export individual indices, I've found that sometimes, it'll exit with the following error:

Caught exception in Main process: TypeError: Cannot read property 'hits' of undefined
TypeError: Cannot read property 'hits' of undefined
at IncomingMessage. (/data1/opt/elasticsearch-exporter/node_modules/elasticsearch-exporter/drivers/es.js:272:31)
at IncomingMessage.EventEmitter.emit (events.js:117:20)
at _stream_readable.js:920:16
at process._tickCallback (node.js:415:13)

If I re-run it on the index it failed to dump, it usually works, though there's a chance it'll throw the same error.

Deprecate sourceCompression option

--targetFile option is barfing with 'no such file' error

Caught exception in Main process: Error: ENOENT, no such file or directory 'undefined.data'

here's a fix:
@@ -145,6 +145,10 @@
exports.lineCount = Math.ceil(count/2);
callback(exports.lineCount);
});

stream.on('error', function() {
```
   exports.lineCount = 0;
```
```
   callback(exports.lineCount);
```
});
}

function getNewlineMatches(buffer) {

Add ability to read options from file

Add stats call to driver interface

Add a call to fetch index statistics before actually running the exporter. This way we can get the number of total documents, documents per index/type/etc. This might simplify some calls as well as make it possible in the future to act more intelligently based on the layout of the source database (and the target as well).

Clean up option flags

Change the option flags to represent the new support for multiple database types.

Add BasicAuth

Option to specify basic authentication against ES

README reference to exporter.js for taking file dump of index

I was referring to README for taking backup of elasticsearch index as a file. It seems documentation has typo while referring exporter.js as exports.js.

Regards,
Arun

Print when request is Unauthorized

When the request is Unauthorized, you get a message like:
SyntaxError: Unexpected token U
This is because it tries to parse the data object which is not a valid JSON object but a response like:
401 Unauthorized /_status
You could print this unauthoriezed message so the user knows the reason why it fails.

Add ability to override mapping/settings

No feedback on failed transfer

I have a single node (node1) and one cluster (clus1). Both of them are protected by basic auth. I tried running this:

node exporter.js -a node1 -b clust1

I got following output which did not imply any failed transfer and an exit code of 0.
But the indices actually didn't get transferred.

Had to run the following to get it passing:
node exporter.js -a node1 -b clust1 -A superadmin:password -B superadmin:password

Add version information to help output

Missing "ops.XXXAuth" usage on requests

It seems like you are missing auth: opts.sourceAuth or auth: opts.targetAuth each time http.request is performed. So it's not possible to use the exporter with basic auth without fixing it manualy.

http.request({
            host : opts.sourceHost,
            port : opts.sourcePort,
            auth: opts.sourceAuth,

EADDRNOTAVAIL during local index copy

When using

$ node exporter.js -i nodes_1 -j nodes_1_export_test

I get a

Processed 144400 of 671344 entries (22%%)
{ [Error: connect EADDRNOTAVAIL]
code: 'EADDRNOTAVAIL',
errno: 'EADDRNOTAVAIL',
syscall: 'connect' }
Number of calls: 14450
Fetched Entries: 144490
Processed Entries: 144490
Source DB Size: 671344

Caught exception in Main process: Error: connect EADDRNOTAVAIL
Error: connect EADDRNOTAVAIL
at errnoException (net.js:646:11)
at connect (net.js:525:18)
at net.js:584:9
at asyncCallback (dns.js:84:16)
at Object.onanswer as oncomplete

always around the same time (21%-22%). Elasticsearch is still alive. The index to be copied has a size of 1GB.

$ node -v
v0.6.12

$ npm list
├── [email protected]
└─┬ [email protected]
├── [email protected]
└── [email protected]

transfer index by query fail

Transfer index from one server to another by query failed after 300k docs received.

copying new entries after the bulk copy is done

Thanks for the great tool.
When the script is running to copy test_v1 to test_v2, test_v1 is constantly updated with new entries.
Is there a way to copy only the new entries from test_v1 to test_v2 after the bulk copy is done?
Thanks for your help!

getaddrinfo enotfound error

npm install elasticsearch-exporter
...

node node_modules/elasticsearch-exporter/exporter.js -a 10.223.240.225:9200 -g data -r true
Reading mapping from ElasticSearch
{ [Error: getaddrinfo ENOTFOUND] code: 'ENOTFOUND', errno: 'ENOTFOUND', syscall: 'getaddrinfo' }
{ [Error: getaddrinfo ENOTFOUND] code: 'ENOTFOUND', errno: 'ENOTFOUND', syscall: 'getaddrinfo' }
{ [Error: getaddrinfo ENOTFOUND] code: 'ENOTFOUND', errno: 'ENOTFOUND', syscall: 'getaddrinfo' }
{ [Error: getaddrinfo ENOTFOUND] code: 'ENOTFOUND', errno: 'ENOTFOUND', syscall: 'getaddrinfo' }
[deployer@el3 migrate]$ curl http://10.223.240.225:9200/
{
"ok" : true,
"status" : 200,
"name" : "CTM EL3",
"version" : {
"number" : "0.90.7",
"build_hash" : "36897d07dadcb70886db7f149e645ed3d44eb5f2",
"build_timestamp" : "2013-11-13T12:06:54Z",
"build_snapshot" : false,
"lucene_version" : "4.5.1"
},
"tagline" : "You Know, for Search"

The export file start with string 'null'

I export file with node js version v0.10.9 with the script from master source.

To export data but it cannot import cause of the starting text is null.

I saw that write null to test the file existing so can you change it to empty string?

Add option to compress data while writing to file

JS Allocation failed - process out of memory

We experience memory issues when we try to export lots of data. Around 4 Million hits.

The problem seems to be related to these lines:
https://github.com/mallocator/Elasticsearch-Exporter/blob/master/drivers/es.js#L173-182

We tried to throttle the requests to 10 at a time, but no luck. Any ideas?

Are these connect ETIMEDOUT issues normal?

Hi guys,

This looks like a great project for our needs as we increasingly rely on elasticsearch and manually updating locally is a bit of a pain. I'd love to get involved in supporting this (once I tune up my familiarity with node).

It took me a little while to figure out how best to pass parameters, but now I think it's working. Can someone help me confirm? Does this look right? Do you just get timeouts from time-to-time or is this abnormal? Should I be seeing some other logging in the console if it's working successfully?

node exporter.js -a site:####obscured#####@api.searchbox.io -b localhost
Elasticsearch Exporter - Version 1.3.0
Reading source statistics from ElasticSearch
Reading mapping from ElasticSearch
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT
connect ETIMEDOUT

There's a few other url variations I can hit searchbox with, for ex below, but it doesnt seem to change the results.

api.searchbox.io/api-key/####obscured#####

FWIW, our elasticsearch index is roughly 500K documents, 350MB.

Thanks for the help!

Logstash .raw fields are not created

I was trying to move a logstash index from a elasticsearch that is getting data from logstash to a bare ES. Data ate moving cleanly but the mapping created on target is not correct. It does not contain the .raw fields that exist in the original ES.

Without that a lot of kibana dashboard are just failing.

Any plans to parallelise this tool?

Right now, this tool is very slow.

Are there any plans to allow forking/threading so that a large cluster export can be split into separate simultaneous export tasks that glue the data together again at the end?

Making this work automatically on some sensible defaults (cpu cores,whatever) would be great too.

Add ability to sync databases

Add an option where you can apply changes to a target database based on timestamps (which need to be active in the index mapping).

Importing always stalls around 40%

Hi,

When trying to import Elasticsearch-Export always seems to stall out around 30-40% or so. Also seeing some MaxListeners errors. Is this related?

Weirdly, even if I truncate the JSON .data file it stalls out around the same %. It makes me think something weird is going on with queueing. After trying a few different indexes/mappings it still hangs.

Output:
https://gist.github.com/oceanplexian/5961335

Tried node v0.11.3 & v0.10.5, same results...

Any ideas?

mismatch between retrieved and inserted records

Hi,
I am trying to export/import from ES 0.9 to ES 1.0 (different live clusters).
When I run it in simulate mode on the donor cluster, I get:
Elasticsearch Exporter - Version 1.3.0
Reading source statistics from ElasticSearch
Reading mapping from ElasticSearch
Stopping further execution, since this is only a test run. No operations have been executed on the target database.
Number of calls: 0
Fetched Entries: 0 documents
Processed Entries: 0 documents
Source DB Size: 623066 documents

After running the actual export/import I get:
Number of calls: 15582
Fetched Entries: 623066 documents
Processed Entries: 623066 documents
Source DB Size: 623066 documents

But the receiving cluster (ES 1.0) only reports:

Elasticsearch Exporter - Version 1.3.0
Reading source statistics from ElasticSearch
Reading mapping from ElasticSearch
Reading mapping from ElasticSearch
Stopping further execution, since this is only a test run. No operations have been executed on the target database.
Number of calls: 0
Fetched Entries: 0 documents
Processed Entries: 0 documents
Source DB Size: 170515 documents

I don't get any errors in the logs, but the number of exported/imported articles don't match.
is there any way to find out why this happens?

Error when trying to export an index to a file...

I'm trying to do a simple export and I get this error... any clue?
node exporter.js -a localhost -i index1 -g /es/dump -l true

Elasticsearch Exporter - Version 1.3.1
Reading source statistics from ElasticSearch
Caught exception in Main process: Error: ENOENT, open 'undefined.data'
Error: ENOENT, open 'undefined.data'
Number of calls: 0
Fetched Entries: 0 documents
Processed Entries: 0 documents
Source DB Size: 0 documents

None

Add ability to use basic auth

Add real doc counter

ElasticSearch returns more docs than unique docs are stored in the cluster, if it's set up with more than one node. This leads to the exported docs counter to be higher than the actual exported docs (some docs seem to be returned more than once).

To enhance the report keep track of doc IDs and report a count at the end.

This can also be used to skip docs in the bulk import and improve performance there.

This option should be optional, as it could take up a large amount of memory in a big export. Hopefully duplicate docs are exported around the same time as the original ones, so that after a short while the IDs don't need to be stored anymore.

Settings and Mappings Not Exported

When I export data from my development server, the settings (analyzers, specifically) and mappings on my index are not exported.

Speedup data transfer

How to transfer data async or multithread?
I saw about max socket, but can't find how to speedup transfer in config or driver.

Source documents field manipulation

Would be nice to have an additional config file, holding some source field manipulation.

For example:

delete particular fields
insert new ones
override field values
apply some filters in relation with other fields

Source:

_source :{
    field_a: "value a",
    field_b: "value b",
    field_to_delete: "this field I want to be removed",
    field_to_be_replaced: "this field will have another value"
}

Config:

_config: {
    field_to_delete: delete _source["field_to_delete"],
    field_to_be_replaced: "this field has a new value",
    field_to_be_inserted: "this field is completely new",
    field_with_filter: _source["field_a"].length,
    field_with_filter_2: _source["field_b"].replace("b", "c")
}

Result:

_source: {
    field_a: "value a",
    field_b: "value b",
    field_to_be_replaced: "this field has a new value",
    field_to_be_inserted: "this field is completely new",
    field_with_filter: 7,
    field_with_filter_2: "value c"
}

Add retries to stats output

Since the script supports retries we can add a stat that tells us how many times the script had to retry a call and how many times on average a call was successful on the first try.

Add retry functionality if connection fails

drivers/file.js TypeError: Object #<Object> has no method 'existsSync'

possible fix:

@@ -69,7 +69,7 @@ function createParentDir(opts) {
var dir = '';
path.dirname(opts.targetFile).split(path.sep).forEach(function(dirPart){
dir += dirPart + path.sep;

```
   if (!fs.existsSync(dir)) {
```

   if (typeof(fs.existsSync) != "undefined" && !fs.existsSync(dir)) {
     fs.mkdirSync(dir);
 }

});

redundant mappings in exported meta file

I work on elasticsearch 1.0.0-1.

I use Elasticsearch-Exporter export all indices to a local file. Meta of local file is look like this:

{
  "test": {
    "mappings": {
       "mappings": {
         "key_sum": {
          "properties": {

below is my patch:

--- node_modules/elasticsearch-exporter/drivers/es.js  2014-03-15 18:25:24.114289938 +0800
+++ node_modules/elasticsearch-exporter/drivers/es.js   2014-03-15 18:25:56.246290802 +0800
@@ -34,13 +34,11 @@
             if (opts.sourceType) {
                 getSettings(opts, data, callback);
             } else if (opts.sourceIndex) {
-                getSettings(opts, { mappings: data[opts.sourceIndex] }, callback);
+                getSettings(opts, data[opts.sourceIndex] , callback);
             } else {
                 var metadata = {};
                 for (var index in data) {
-                    metadata[index] = {
-                        mappings: data[index]
-                    };
+                    metadata[index] = data[index];
                 }
                 getSettings(opts, metadata, callback);

Import from file failing

I exported to a file successfully, but importing it to a different ES server is now failing with the following error:

$ node exporter.js -b localhost -j myindex -f reuters
Reading mapping from meta file reuters.meta
Creating index mapping in target ElasticSearch instance
Mapping is now ready. Starting with 0 queued hits.
Caught exception in Main process: TypeError: Cannot read property 'length' of null
TypeError: Cannot read property 'length' of null
at ReadStream. (/home/eric/Elasticsearch-Exporter-master/drivers/file.js:75:50)
at ReadStream.EventEmitter.emit (events.js:100:17)
at emitReadable_ (_stream_readable.js:418:10)
at emitReadable (_stream_readable.js:412:7)
at onEofChunk (_stream_readable.js:395:3)
at readableAddChunk (_stream_readable.js:139:7)
at ReadStream.Readable.push (_stream_readable.js:123:10)
at onread (fs.js:1532:12)
at Object.wrapper as oncomplete
Number of calls: 0
Fetched Entries: 0 documents
Processed Entries: 0 documents
Source DB Size: 0 documents
Peak Memory Used: 0 bytes (0%)
Total Memory: 26057216 bytes

TypeError: Cannot read property 'total' of undefined

My run fails:

node exporter.js -a XXXXXXX -b 127.0.0.1 -t logstash-adm-log4j-2013.08.06
Reading mapping from ElasticSearch
Creating type mapping in target ElasticSearch instance
Caught exception in Main process: TypeError: Cannot read property 'total' of undefined
TypeError: Cannot read property 'total' of undefined
    at IncomingMessage.<anonymous> (/home/rtoma/elasticsearchExporter/drivers/es.js:149:64)
    at IncomingMessage.EventEmitter.emit (events.js:117:20)
    at _stream_readable.js:910:16
    at process._tickCallback (node.js:415:13)
Number of calls:        0
Fetched Entries:        0 documents
Processed Entries:      0 documents
Source DB Size:         0 documents
Peak Memory Used:       0 bytes
Total Memory:           7195904 bytes

Sniffing the traffic this failure is the result of the 2nd ES call:

$ curl -i -s -d "{\"fields\":[\"_source\",\"_timestamp\",\"_version\",\"_routing\",\"_percolate\",\"_parent\",\"_ttl\"],\"query\":{\"match_all\":{}}}" 'http://XXXXX:9200/_search?search_type=scan&scroll=5m'

HTTP/1.1 503 Service Unavailable
Content-Type: application/json; charset=UTF-8
Content-Length: 159

{"error":"EsRejectedExecutionException[rejected execution of [org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$2]]","status":503}

Any clue what's wrong here?

Readme has a typo

Under the import / export examples area, the file options are wrong:
i think you switched -f with -g

Support exports using aliases

Apparently ES doesn't support scan requests when using aliases for the index name.

A way to support this would be to make an initial request to ES to find out if the job is actually running against an alias.

strange error on use

λ parabellum Elasticsearch-Exporter → λ git master → node exporter.js -a 10.251.76.43 -b 10.251.76.42
Warning: compression has been set for target file, but no target file is being used!
Number of calls: 0
Fetched Entries: 0 documents
Processed Entries: 0 documents
Source DB Size: 0 documents