Code Monkey home page Code Monkey logo

Comments (9)

saumehta9 avatar saumehta9 commented on August 20, 2024

@IrvingMg , Thanks for reaching out !

Can you share some more details, we would want to see whats happening inside insert and update ? Would you mind sharing the code ?

Also are you getting an OCC during insert or update ?

from amazon-qldb-driver-python.

IrvingMg avatar IrvingMg commented on August 20, 2024

@saumehta9 Thanks for your reply. Here are the functions to insert and update:

MAX_DOCS_IN_TRANSACTION = 40

def chunks(docs, size):
    for i in range(0, len(docs), size):
        yield docs[i:i+size]

def insert(session, table_name, docs):
    for chunk in chunks(docs, MAX_DOCS_IN_TRANSACTION):
        session.execute_lambda(lambda executor: insert_documents(executor, table_name, chunk),
        lambda retry_attempt: print('Retrying due to OCC conflict...'))
        
def insert_documents(transaction_executor, table_name, documents):
    statement = 'INSERT INTO {} ?'.format(table_name)
    transaction_executor.execute_statement(statement, loads(dumps(documents)))

def update(session, table_name, docs):
    for doc in docs:
        session.execute_lambda(lambda executor: update_document(executor, table_name, doc),
        lambda retry_attempt: print('Retrying due to OCC conflict...'))

def update_document(transaction_executor, table_name, doc):
    statement = "UPDATE {} AS e SET e = ? WHERE e.ID = ?".format(table_name)
    transaction_executor.execute_statement(statement, loads(dumps(doc)), loads(dumps(doc["ID"])))

Actually those are all of the function that I have, my lambda function is going to be used only to insert and update into QLDB tables. In my handler, I have the code that I posted at the beginning.

In regards to the OCC, the error occurs during the update. I tested only the insertion with the 500 docs and there was no problems. I also tested with shorter lists and It works correctly but it seems that error starts to happen when the list has around 250 docs to be updated. I'm thinking that I need an efficient way to update the docs.

from amazon-qldb-driver-python.

marcbowes avatar marcbowes commented on August 20, 2024

@IrvingMg,

OCC conflicts mean that you read data during your transaction that was changed by the time you committed the transaction. Typically this is because you're not using indexes. So the first question I'd ask is: did you add an index on ID? If not, your statements are doing scans. Specifically, my bet is that your UPDATE is causing a table scan:

"UPDATE {} AS e SET e = ? WHERE e.ID = ?"

QLDB updates always read the current version of the document before writing the update. Most customers are doing specific attribute updates, not wholesale replaces like you are. So you can imagine needing the previous copy of the document such that the new revision can be generated (we need a full revision to compute a hash).

To add an index, you can refer to the docs, but the gist of it is you need to run CREATE INDEX ON MyTable (ID).

Another way you can get OCC conflicts is if there is contention. Imagine a million threads all trying to replace a specific ID. This is a classic hot key problem. Again, my bet is that this is not what you're running into.

All that said, your overall transaction pattern is broken from a concurrency point of a view. Specifically, you must include your reads and writes in a single transaction to be correct. The way you're doing it is not idempotent and will result in duplicate records and other forms of data corruption. Note that this isn't unique to QLDB - any database will require you to leverage its concurrency controls to get correct results.

In this case, what I recommend doing is starting a single transaction for each chunk of 40 documents. In that transaction, you would read the current values in QLDB (you can use the WHERE ID IN (?,?,?,?...) predicate for this) and then use the results to partition the chunk into inserts or updates or no-ops based on the results you got back. In this way, your Lambda function can be re-invoked with the same inputs and it will correctly mutate the data in QLDB.

from amazon-qldb-driver-python.

IrvingMg avatar IrvingMg commented on August 20, 2024

@marcbowes Thanks for your explanation!

Currently I don't have an index on ID and actually I only need update some attributes. So, could that be the cause of the OCC issue?

I also noticed that when the lambda finishes and the OCC issue occurs, what you said is happening, the QLDB table has duplicate results. That's because the lambda is not idempotent, right? So, in this case I think that I don't understand. Would I need to update the docs in a single transaction to avoid an OCC conflict which could generate duplicate results? How can I update multiple documents in a single transaction?

from amazon-qldb-driver-python.

marcbowes avatar marcbowes commented on August 20, 2024

Currently I don't have an index on ID and actually I only need update some attributes. So, could that be the cause of the OCC issue?

Yes. The index is needed to efficiently look up the document (based on your WHERE clause). Without the index, you need to scan. Scanning is slower because it is O(n) and not O(1). But that's not the issue! The issue is what you read. If you "scan", you're "reading everything". So if you "read everything" then if "anything" changes your commit will fail. With an indexed lookup, you're only reading one document, so your commit will only fail if that one document fails.

I also noticed that when the lambda finishes and the OCC issue occurs, what you said is happening, the QLDB table has duplicate results. That's because the lambda is not idempotent, right? So, in this case I think that I don't understand. Would I need to update the docs in a single transaction to avoid an OCC conflict which could generate duplicate results? How can I update multiple documents in a single transaction?

Yes.

Transactions can have multiple statements. So just send your INSERT/UPDATE statements (as you have) in one transaction.

from amazon-qldb-driver-python.

IrvingMg avatar IrvingMg commented on August 20, 2024

@marcbowes Thanks! I'm updating only the fields required with the index on the ID and now the OCC issue doesn't occur.

However, I still have the duplicate results so I'm going to use the transactions to update the docs. But I'm not sure if I have to use SessionClient or Transaction. In the first case, I think I'd need to compute the commit digest, does the driver provide a way to compute it?

from amazon-qldb-driver-python.

marcbowes avatar marcbowes commented on August 20, 2024

@marcbowes Thanks! I'm updating only the fields required with the index on the ID and now the OCC issue doesn't occur.

That's good to hear.

However, I still have the duplicate results so I'm going to use the transactions to update the docs. But I'm not sure if I have to use SessionClient or Transaction. In the first case, I think I'd need to compute the commit digest, does the driver provide a way to compute it?

No, don't do that! :) The docs for SessionClient say:

This class is not meant to be used directly by developers

and I'd echo that. You really should only need to use the top level method on the driver. That is, you should use execute_lambda on https://amazon-qldb-driver-python.readthedocs.io/en/latest/reference/driver/pooled_qldb_driver.html.

The lambda you're passing is essentially code running in a transaction. So you can execute as many statements as you like. The lambda is invoked with an instance of a Executor. So, you're really just meant to be calling execute_statement on it as many times as you need.

At the end of that lambda, your transaction will be committed. If there is a failure (network, OCC, etc.), the entire lambda will be retried. This means your code must be written idempotently. If you follow this pattern, you will have correct code that makes progress even under failure or high load.

from amazon-qldb-driver-python.

IrvingMg avatar IrvingMg commented on August 20, 2024

@marcbowes Thanks, now I get it!

Since the OCC issue was solved, I'm going to close the issue.

Thanks a lot for your time and explanation!

from amazon-qldb-driver-python.

marcbowes avatar marcbowes commented on August 20, 2024

Thanks a lot for your time and explanation!

My pleasure. Let me know if you still need help with duplicates.

from amazon-qldb-driver-python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.