Code Monkey home page Code Monkey logo

Comments (33)

ryanrupp avatar ryanrupp commented on August 30, 2024 1

@houmark I believe the Athena query history is showing the engine time (so the time actually spent executing the query) but there's also the concept of queue time which can maybe explain the discrepancy you see (and as noted lower concurrency limits contributing to that). As of November 2019 Athena exposes metrics for this, see here and announcement under here. This was an ask I had via support so was excited to see it :) - previously we would take client wall time minus Athena engine time (stat returned in API response) to give a rough estimate of "queue time".

Anyway, you can either publish those stats to Cloudwatch (linked article) or if you're using the AWS SDK directly you could read the query stats from the response and do something with them on your end (log high queued queries etc.). If you're using an abstraction client like JDBC, I'm not sure offhand, but think possibly you could downcast/unwrap the result set to read the Athena specific details (not positive about this though or if it would be recommended).

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024 1

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

This seems related to #81 as well. @atennak1 I wonder if we should have all connectors (since this issue wont be unique to DDB) ignore fields that they don't support. This would allow folks to use tables that have unsupported columns. Either that or we'd have to use some generic coercion for those fields... thoughts?

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

@houmark I also wanted to comment on a couple things you mentioned above. Firstly, we are thrilled you are finding success with this connector and thank you for working with us through some of these rough edges. Please keep in mind that these features are currently in 'Preview' and as such account limits are being managed a bit differently (we are less likely to grant limit increases for these preview features). We advise against using these preview features in production workloads until they are made generally available. As we gather feedback from customers such as yourself we are increasing the limits as well as refining the feature set and programming model. I want to ensure you and your product have the appropriately level of support from the Athena team when your feature goes live. Let me know if you have questions, happy to connect on a call to better explain the preview vs. general availability classification.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Yeah, one of them seems related to #81 or the struct bug (when empty) and the other one just seems like it's not a supported field (date). I guess it could fall back to ignore or try to stringify the value which may work in some cases.

Thanks for your feedback @avirtuos — I appreciate your clarification and I'm fully aware that this is not "production stable" and is in public preview, but we really did not have another option, as all other options were simply not more stable or were way too complex to implement. Raising the Athena DML limit seems to have helped somewhat, but we are still seeing random failing queries without any particular log event in the Lambda for that failed query, this simply seems to be Athena failing which is really annoying as our UI then does not load the chart. We are using Amplify, so the entire setup is still a bit complex, as we are querying through GraphQL/AppSync and then relying on subscriptions to receive the data back to the UI.

I'd love to show our case to your guys (I think if things stabilize, then the federated Athena setup is an A+ citizen case to add to Amplify which for statistics querying etc. which could be wired up pretty easily by the developer). I'd like to pass on some of the Query ID's if that enables you or the Athena team to look into why Athena is randomly failing.

from aws-athena-query-federation.

atennak1 avatar atennak1 commented on August 30, 2024

@houmark please go ahead a post some query IDs here and we'll take a look.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Here's a bunch of Query ID's collected at random times. I have way many more than those, but the picture is the same, the query fails and not due to errors in my end, the query has not changed for days:

bc6f1fe3-a931-49a2-a647-5b4f015a5378
d4abf18b-4307-4363-a914-1b3c92e815c9
a71a1dae-f196-48c9-afcb-964e7a22ca76
f688dd7e-139c-44f0-8f54-1ad9fb7ce103
aa6f53d2-bf30-4bf7-ad22-779c68a310aa
42a71a46-d95a-418c-9c08-5a3ab2ee4e7e
30f452d0-a6d4-4556-a131-889bbff0d79d
79584ef6-59cc-48c2-afaa-d74d5dd61491
4fd76338-3c82-40c5-913e-8c2fea6137a4

from aws-athena-query-federation.

atennak1 avatar atennak1 commented on August 30, 2024

@houmark looks like these all hit the same unhealthy compute resource in Athena. I've terminated it and opened an internal issue to root cause why it wasn't automatically detected. Let me know if you're still seeing failed queries going foward.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

That’s great. I will for sure keep an eye and report further issues back. Thanks!

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Early feedback. No failed queries until now, so it may have been just one unhealthy instance. Hoping the Athena team can make sure such instances can be automatically removed in the future, so users don't have to be watch guards on that.

One thing that's weird is that I can see in the query time in the Athena query history, and for many queries, that's around 3 seconds, which is good, but I don't get the result back until 12-15 seconds to the UI (which is listening on a socket that is pushed based on an S3 trigger). Do you have any tips on how I can debug where the slow down may be? I know this is unrelated to the original issue and some parts of this are outside of the Athena Federated area, but I'd thought I'd ask anyways.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

And just as I wrote that I had the first failed query, which also points to an unhealthy instance: 4c9ab3eb-8ee0-40e3-af98-4720698995ca.

Edit: A few more queries that failed:

b99d37c2-3c3d-406f-8cfd-da97603fc39d
5c512cf7-b655-42fd-8eb9-56d7bba76a21
fe9a5421-83be-4087-b699-92f43b625d7d
f1ca9368-bf98-4002-a787-f7c452a24d9c (resource not ready error)
5c512cf7-b655-42fd-8eb9-56d7bba76a21
bdce6c63-481d-4c0d-8ffc-fd12fe08193b

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

@atennak1 can we take a deeper look at the logs on these instances both Presto and OS? I want to make sure we don't have some memory leak or other issue that is sneaking up on us. The last round of stress testing seemed to catch the last leak / memory fragmentation issue from how we were using Apache Arrow but might be best to double check. Let's chat about priority on Monday with Michael as this could also be upgrade related.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

I'd like to add that overall, after the removal of that one unhealthy instance, queries have loaded more stable. The query ID's I've sent were all the ones I could find in the query history from approx 1.000 queries. Up until the removal of the unhealthy instance yesterday, I would easily have 5-10% failed queries.

My main concern right now is that there seems to be quite a latency from the amount of time the query should take according to the query history UI, to the amount of time it takes before I have the result back in the UI — and for that, I am not sure where the slowdown is or if the query history UI in Athena is even reliable in relation to the time it shows the query took. For example, right now in the history UI, it shows "Data scanned" 0 KB for all queries, which it normally only does when the query has an error. That may be an Athena UI error though.

I'd be happy to share more information on my case, and jump on a call if you like to. My setup is fairly simple, so it may be a good candidate for debugging.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Here's a few more queries to investigate.

Super slow to return the results (around 2 minutes) but Athena UI says query only took a few seconds:

41db43aa-db9c-4540-95ea-aabeb034aaf4 (but Athena UI says 8.91 seconds)
66395995-2e46-4e8d-9e76-cfdb68d45632 (but Athena UI says 3.24 seconds)

Failed, not due to unhealthy instance (ErrorCode: INTERNAL_ERROR_QUERY_ENGINE):

d271d9d8-ff88-45a1-8cf5-5b18f321c6cf
0a8f7ad3-31f0-4a2c-a314-5170e850be96
cb1c9a4b-4914-451c-9cc1-677dd43176b2
7b7e930a-32bc-40a7-bbe6-38db29963967
ca51bfc1-1b1c-4353-8699-87aa27e8b538
1b54008e-eede-44e0-b462-c3f4eaa29b56
6b6d8cea-7d3b-4dba-970e-b72729efc0bf
3b92f99d-3b9a-499c-b1eb-6e645fafad17 (42.92 seconds execution time and then failed, when it normally would take 3-4 seconds)

I'm seeing a lot of these INTERNAL_ERROR_QUERY_ENGINE errors today, while over the last 48-72 hours I don't recall seeing one of them. Very weird. The error rate is around 10%.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

This is how it's looking right now, high failure rate, randomly, no queries changed, all work, but Athena is failing at will (this was 21 queries sent at once through API, 6 failed):
Athena

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

I'm aware of the supposed 20 queries DML limit and I saw a lot of errors in the past which I now attribute to the unhealthy instances. But I requested a raise of that limit to 100, which was confirmed done by the support team, and after that, I saw better stability — but maybe that limit is not in use on preview features? I had a few really stable days in terms of the queries actually not failing, but in most cases just returning way slower than Athena say they took in the UI.

I have a plan to combine more queries (and to add caching of results also), and I have one query that does a lot of subqueries, but as I add more, the query time increases, which makes the UI feel even slower than if each chart loads little by little.

As I think it's obvious above, I really would like to contribute any type of feedback for you guys to improve on this as I see the federated solution the one with the most potential in relation to an Amplify setup for statistics. Please do reach out, so we can find the silver line on making this work! :)

If this stabilizes, I'd consider writing a blog and post it on the Amplify blog, on how to wire up Amplify with Athena Query Federation, for other Amplify users to consider. I did extensive research and had a discussion with other Amplify users about the options to create a real-time dashboard with stats in this process, and all other options were even more complex and prone to errors than this one. And as I mentioned, making the CLI wire all this up in a few mins seems trivial, so I'm sure the Amplify team would be interested in adding that when this is released to the public.

from aws-athena-query-federation.

atennak1 avatar atennak1 commented on August 30, 2024

Thanks for your patience @houmark . I'm testing a potential fix for the INTERNAL_ERROR_QUERY_ENGINE failures. I will update here when it gets pushed out to the Preview stack.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Brilliant, happy to see my feedback is helpful :) As I understand this is pushed in your end, so no need for me to re-deploy the connector?

from aws-athena-query-federation.

atennak1 avatar atennak1 commented on August 30, 2024

Correct

from aws-athena-query-federation.

atennak1 avatar atennak1 commented on August 30, 2024

We've deployed the stability fix. Let us know if you still receive a high rate of generic failures.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Thanks @atennak1! So far so good. No failed queries, I'll keep monitoring and report back here.

The main thing now is that there is still a fairly large delay from the time Athena says a query took, to my UI receiving the data. That may be a slowdown in Athena itself, in S3 writing the data (it's a very small dataset), the S3 trigger, the DynamoDB write/trigger (the Lambda that receives the S3 trigger updates a DynamoDB record, that trigger the subscription/socket update to the UI) or the socket connection. I have confirmed my receiving Lambda trigger is not slow in any way. Any recommendations on how to trace that?

Also, Athena is constantly showing 0KB scanned on all queries. It wasn't like that always (only on failed queries) but it's been like that for a week or so now.

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

OK, I did not know that and it did use to show the amount of data scanned even in the preview. I was under the impression that the pricing was $5 per 1TB scanned with a minimum charge of 10MB per query, which made this a fairly good pricing model (effectively, giving us 100k queries for $5 as we never hit the 10MB in one query). I certainly do not hope you are coming up with a more expensive pricing model than that, because that would make me look bad internally. I do not recall reading anything about a custom pricing model in the works on the (otherwise very comprehensive) Github notes / wiki.

I'll share more queries tonight, but would we somehow be able to raise it for our account to test if that improves it? I think we may have a mutual interest in verifying that :)

Our setup feels a bit complex, but nothing more than how other AWS devs have written about doing non-federated queries, so this has the possibility to work.

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

If you email me [email protected], Akila and I will
connect you with our product folks so they can discuss your testing limits
and perhaps get your feedback regarding pricing.

Done!

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Very good info there @ryanrupp. And yes, it looks like QueryQueueTime is in fact in many cases longer than the actual Execution time for us, even on the very first queries after a period with no activity. And as soon as we kick off too many queries in a short time period, the QueryQueueTime increases aggressively.

There seem to be potentially two issues, the above QueryQueTime and then a potential delay in getting results back in some situations. The first one is confirmed and may be due to limits in this preview functionality as @avirtuos mentioned before, and the second one, I have a hard time fully tracking to provide Query ID's, but I will continue to try that and report back.

The most recent fix by @atennak1 seems to work. I have not seen any failed queries besides one due to an unhealthy instance with the Query ID: edba4462-4bd9-4724-86e6-0b44c6fd091b. You may want to check if that instance is once again healthy or if it was removed automatically.

I'd be very interested to try to raise the limits for our account, to see if we can bring down the queue time and get our UI to load faster. We do plan to cache results so we short-circuit the query if we have a valid result cache, but due to the extensive extra work in making this "just" work, that feature is still in the works.

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Hi guys,

Another update from me. It's been a few days and I have more observations from actively using the DynamoDB connector.

Positive:

No failed queries. It seems @atennak1's most recent fix is working smoothly. Also not seeing unhealthy instances, so that means our UI is loading consistently besides a bug in AppSync subscriptions which I am yet to fully identify. When that happens, nothing will load as no results are coming back on the socket.

Not so positive:

Looking at CloudWatch metrics, the performance numbers are quite concerning:

  1. There's always a high QueryQueTime. Often the queue time is larger than the query itself. Even with no recent load/use.

See this chart:

Image 2020-01-15 at 7 12 39 PM

As you can see, the QueryQueTime is near double the execution time. Our use is on a small dataset, and the queries are simple, so they generally finish in 3-5 seconds. A few take longer than that, but in general, we don't have queries that take longer than 10 seconds. We have tried combining many queries, but that also added 2-3 seconds per subquery and that's not desirable from a UI perspective, as the UI will nothing for a long time and then load all charts at once, instead of at least loading chart by chart and the user seeing data coming back.

Now, as soon as we put a bit of load on, by loading our UI a few times in rapid succession, we will see the QueryQueTime raise aggressively:

Image 2020-01-15 at 6 38 34 PM

This may be due to the preview with very low limits on concurrency, but my worry is that raising the limits and raising traffic will still keep the queue time high.

And yes, one can argue we are attacking this wrong and we should pre-cache or cache queries and we are considering both, but I wanted to report back here, to maybe see tweaks in your end. Another angle is to show way less charts (we are currently rendering 20 charts in one go) and that is also on our list, but loading 4-5 charts will not remove the queue time, so we are trying to find the sweet spot on how many to load at a time without the UI feeling insanely slow or the user having to click too many times to load the data they want to see.

Another thing that we have not pursued yet, is to create DynamoDB indexes on certain fields we query on from the Athena side. First of all, I am not sure if those indexes would even be used when the query connecter does a scan, and the query/engine time itself, is actually okay for us — the queue time is what is slowing things down.

I have not been able to verify that we have any significant delay after the query has actually finished in Athena, so the elephant in the room seems to be the QueryQueTime.

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

Yeah, I haven't heard from your PM yet.

I fully understand the preview limits and if those are increased we would be less likely to see the QueryQueueTime increase as fast and high as it currently does, but I'm still concerned at the QueryQueueTime after a silent period. Queue time always being in up in the area of 5-10 seconds would be reason enough to not consider this setup as a valid option at all for many I think. ~1 second I think most can live with, but I think most would expect it to be much less.

One question while we wait on the PM to enter, is it technically possible for you to raise the limit on one account while this is in preview? As I mentioned before, this would help everyone to understand the total execution time of common uses cases and could be useful for giving some reference metrics for potential new users what to expect from a setup like this one (in the README for example).

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

from aws-athena-query-federation.

houmark avatar houmark commented on August 30, 2024

I mean for example one hour or so with no queries to Athena from our account at all (I can verify this in the Athena Query History UI and also in the metrics charts). But maybe the limits are global for all users in the preview, which means I cannot know if other testers have been querying?

I've considered one by one, but according to my best estimates, that would end up being even slower in the UI and unfortunately, it's no trivial for us to change these charts to run one by one, but I may revisit this and give it a shot just to having tried it.

from aws-athena-query-federation.

avirtuos avatar avirtuos commented on August 30, 2024

Im marking this issue as resolved since it likely relates to account limits now that the stability issues have been identified. The final stability fix is rolling out in the next couple days but doesn't seem to be impacting customer queries. Please reopen if you feel there is more investigation needed.

from aws-athena-query-federation.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.