We're using neptune-export tool for graph data export and encountered the following pr

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

problems with streaming to kinesis with neptune-export tool about amazon-neptune-tools HOT 15 CLOSED

alxcnt commented on September 4, 2024

problems with streaming to kinesis with neptune-export tool

from amazon-neptune-tools.

Comments (15)

iansrobinson commented on September 4, 2024 1

Hi @alxcnt

You need to run the tool with credentials that can access the Neptune Management API for your Neptune cluster. If you're running it in a different account to your Neptune cluster, the tool won't be able to access any metadata about the cluster. At the moment there isn't a way to supply credentials from another account, though we can consider adding that as a feature.

Access to the Management API for inferring the clusterId has been necessary since the end of Nov 2022, when the clusterId inferencing was improved. The clone cluster feature, which is necessary for ensuring a static view of the data for databases undergoing a write workload, has needed access to the Management API for a couple of years.

The Management API client uses the default credentials provider chain . These are the locations it will look for the relevant credentials:

public DefaultAWSCredentialsProviderChain() {
        super(new EnvironmentVariableCredentialsProvider(),
              new SystemPropertiesCredentialsProvider(),
              WebIdentityTokenCredentialsProvider.create(),
              new ProfileCredentialsProvider(),
              new EC2ContainerCredentialsProviderWrapper());
    }

from amazon-neptune-tools.

iansrobinson commented on September 4, 2024 1

@alxcnt Thanks. I've opened an enhancement issue for this: aws/neptune-export#12

Please comment there if you've anything else to add. Thanks.

from amazon-neptune-tools.

iansrobinson commented on September 4, 2024

Hi @alxcnt

Thanks for reporting this. I've been able to reproduce, and have a fix.

However, as of 28th February, neptune-export has moved to a new repository, https://github.com/aws/neptune-export, so as to better facilitate its ongoing development independent of the other projects in amazon-neptune-tools. We'll be posting a deprecation notice in this project shortly.

The fix, therefore, is currently a PR in the new repository:

aws/neptune-export#11

Regrading the file-system writes: it's not possible to turn off file-system writes completely. Results coming from Neptune to the Java Gremlin client, which the export tool uses, must be handled immediately: any delay in handling can cause the client-side Netty buffer to fill, and the Java Gremlin client to crash. Writing to Kinesis exerts some back-pressure, which can delay the handling of results, and cause the client to crash. Hence the tool writes immediately to disk, and from there to Kinesis (using the KinesisProducer library) at a rate Kinesis can sustain.

from amazon-neptune-tools.

alxcnt commented on September 4, 2024

Hello @iansrobinson appreciate prompt response and thank you for the fix, I'll try it asap and let you know!
Regarding the file output, thanks for confirming that it can't be turned off. While looking at the code myself I noticed that files are used as a buffering solution, but hoped it could be solved somehow.

from amazon-neptune-tools.

alxcnt commented on September 4, 2024

Hello @iansrobinson. When trying the build made using your PR branch I get the following error which seem to be unrelated to the problem, but it didn't appear in v.1.11, so I assume it's related to changes introduced after it was released.

java.lang.IllegalStateException: Unable to identify cluster ID from endpoints: [neptune endpoint-redacted]
        at com.amazonaws.services.neptune.cluster.NeptuneClusterMetadata.createFromEndpoints(NeptuneClusterMetadata.java:84)
        at com.amazonaws.services.neptune.cli.CommonConnectionModule.clusterMetadata(CommonConnectionModule.java:96)
        at com.amazonaws.services.neptune.ExportPropertyGraphFromGremlinQueries.lambda$run$0(ExportPropertyGraphFromGremlinQueries.java:90)
        at com.amazonaws.services.neptune.util.Timer.timedActivity(Timer.java:41)
        at com.amazonaws.services.neptune.util.Timer.timedActivity(Timer.java:34)
        at com.amazonaws.services.neptune.ExportPropertyGraphFromGremlinQueries.run(ExportPropertyGraphFromGremlinQueries.java:88)
        at com.amazonaws.services.neptune.export.NeptuneExportRunner.run(NeptuneExportRunner.java:72)
        at com.amazonaws.services.neptune.export.NeptuneExportService.execute(NeptuneExportService.java:235)
        at com.amazonaws.services.neptune.export.NeptuneExportLambda.handleRequest(NeptuneExportLambda.java:173)
        at com.amazonaws.services.neptune.RunNeptuneExportSvc.run(RunNeptuneExportSvc.java:60)
        at com.amazonaws.services.neptune.export.NeptuneExportRunner.run(NeptuneExportRunner.java:72)
        at com.amazonaws.services.neptune.NeptuneExportCli.main(NeptuneExportCli.java:48)
[main] WARN com.amazonaws.services.neptune.export.ExportToS3NeptuneExportEventHandler - Uploading results of failed export to S3
[main] WARN com.amazonaws.services.neptune.export.ExportToS3NeptuneExportEventHandler - S3 output path is empty

In our case we have the neptune cluster accesible on VPN via the cluster(and other endpoints) while attempting the export with streaming to kinesis stream created under different AWS account(than the neptune cluster) using credentials to that account.
How could we solve it using the latest version of the neptune-export tool with the applied fix?

from amazon-neptune-tools.

alxcnt commented on September 4, 2024

Hello @iansrobinson thanks for the clarification!
My understanding is that if you don't have writes then you'd not need to clone the cluster to have the consistent view of the data for the export (cloning the cluster was/is an optional param afair).
I'd be great if we could specify 2 sets of credentials: one for accessing the cluster and one for writing to the stream/s3bucket upload. It may be needed when one operates under the same account but with separate roles used for service access, or when using 2 different accounts.
The above is our current case, could you please consider adding it as feature?

from amazon-neptune-tools.

iansrobinson commented on September 4, 2024

If you're running this from the command line, how would you expect to pass those credentials?

As the names of different IAM profiles
Explicitly as two sets of access key, secret key
Via two different sets of environment variables
Via two different sets of system properties

from amazon-neptune-tools.

alxcnt commented on September 4, 2024

Perhaps having an ability to configure a prefix for env vars proving alternative configuration for accessing the Neptune cluster could be a solution.

{
  "command": "export-pg-from-queries",
  "params": {
    "authVarsPrefix": "AWS_NEPTUNE_",
    ...
   
  }
}

so you'd specify alternative profile, session tokens etc.

from amazon-neptune-tools.

alxcnt commented on September 4, 2024

Hello @iansrobinson, i've tried your fix by applying the patch to the version tagged amazon-neptune-tools-1.12 and I still observe the behaviour reported in the issue's description. Namely,

While reading from the stream(using python consumer, implemented using boto3) the following error occurs(same as before):
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte
The files found in exports/../*.json have incomplete line at the end
When all queries are executed the tool doesn't terminate and produces the following output: [kpl-daemon-0003] INFO com.amazonaws.services.kinesis.producer.LogInputStreamReader - [2023-03-01 19:17:47.170641] [0x00010d31][0x00007000099e7000] [info] [processing_statistics_logger.cc:114] Stage 2 Triggers: { stream: ‘your-export-stream-name’, manual: 0, count: 0, size: 0, matches: 0, timed: 0, KinesisRecords: 0, PutRecords: 0 }

from amazon-neptune-tools.

iansrobinson commented on September 4, 2024

Hi @alxcnt

There was an additional fix after the version tagged 1.12: 2d281ea

Files in export should be deleted once the export completes: the fact the tool doesn't terminate and you still have export files suggests that the issue is at least partly result of this fix not having been applied.

from amazon-neptune-tools.

alxcnt commented on September 4, 2024

Thanks, will use this commit for testing your fix, will let you know the results.

from amazon-neptune-tools.

iansrobinson commented on September 4, 2024

OK - testing with a highly concurrent export, and I see the unicode errors again. This needs a larger piece of diagnosis work, which we'l have to schedule.

from amazon-neptune-tools.

alxcnt commented on September 4, 2024

Confirmed, after applying those 2 patches to 1.12 - the files get deleted, and the tool terminated on completion, but unicode decoding issue remains as you say.
Would it be possible to share an ETA for when the fix would be available?

from amazon-neptune-tools.

iansrobinson commented on September 4, 2024

Hi @alxcnt

So, it turns out the unicode issue is not a bug, but a feature of the way in which data is written to a Kinesis Data Stream.

neptune-export uses the Kinesis Producer Library to write to Kinesis, and by default, records are aggregated before being written to the stream. This means that the client must deaggregate the records before processing them.

In Python you can use the record deaggregation module from the Kinesis Aggregation/Deaggregation Modules for Python to deaggregate records.

I've also submitted a PR that adds a "disableStreamAggregation":true option to the command parameters: aws/neptune-export#15

from amazon-neptune-tools.

alxcnt commented on September 4, 2024

Hello @iansrobinson, thanks tracking this down!
With 3 patches on top of the 1.12 I was able to stream the records and read them back using the simple python consumer(used the disableStreamAggregation option, and didn't change the reading code).
This is awesome news, looking forward to PR to be merged and separate cluster authorisation credential option to be supported. Thank you again for this.

from amazon-neptune-tools.

problems with streaming to kinesis with neptune-export tool about amazon-neptune-tools HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent