The current stream mechanics make it easy to process rows as they are read from the qu

You could also look at using <a href="https://nodejs.org/dist/latest-v12.x/docs/api/st

I've added cursor support with the callback-style <a class="commit-link" data-hovercar

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Stream speed management using async/await (cursor),about porsager/postgres

Comments (18)

porsager commented on May 18, 2024 2

This is great! It should be fairly straight forward to get in, so I'll look at it right after I'm done with benchmarks ;)

from postgres.

porsager commented on May 18, 2024 2

First, really nice interfaces on the library all around... I came to mostly say the same regarding async streams... If you do open an object stream interface, that would allow for await of[1] syntax with current node releases, which I find very intuitive to use. Maybe a asReadStream() method?

I also feel awaiting any promise returned from the per-row function would be very useful as mentioned in the issue, which is imho the only real shortcoming in this library, as it's something I've come across syncing data to DynamoDB and similar (grouping records), though slightly more complex in terms of sizing.

[1] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of

Thanks a lot @tracker1 .. Did you check out the branch with the first stab at an implementation of cursors? https://github.com/porsager/postgres/tree/cursor#cursor-sql-cursorrows--1-fn---promise

I suppose sql` `.cursor() without a callback function could return an async iterable..

So usage would look like this:

const cursor = sql`select * from generate_series(1, 1000)`.cursor()

for await (row of cursor) {
  statement
}

Is it something like that you imagined?

from postgres.

porsager commented on May 18, 2024 2

Thanks a lot for the kind words, and for the help thinking this through. It has been a really nice project to do so far.

I'm leaning towards adding support for async iterables too now, but I realize it isn't identical to the current implementation in that the query might not run immediately, but it has to wait until iteration starts. That's currently implicit with the callback implementation. A hack would be to run the first query and keep the rows in memory until iteration starts, but I don't want to do that since it's basically wrong :P

It also has to return an end method on the iterator to support ending early, but that should be fine.

I think, since adding async iterator support later won't be a breaking change, I might just split this up into two releases.

My biggest complaints about some packages relate to excessive unnecessary and bloated dependencies consuming memory and depleting performance as well as convoluted interfaces that needlessly require a lot more code to accomplish an otherwise simple task. I see none of that here.

Yeah, that was also one of the reasons I got started with this in the first place, so it's really nice to hear you see it that way! I'm almost done with the benchmarks too, so it's gonna be interesting to see everything side by side.. Currently I've got pg, pg-native, pg-promise, pg-promise-native and slonik. Do you know of others that would be interesting to compare?

I know pg-promise and slonik are using pg under the hood currently, but it was interesting to find that pg-promise is faster than raw pg and slonik is like 50x - 100x slower than mine 🤨

from postgres.

Blitzsturm commented on May 18, 2024 1

You could also look at using streams in object mode which afford some cool tricks like piping it through a stream transform and I believe the node stream object naturally allows for "for await" loops. A package for mssql can do this however it doesn't respect high water marks set by users so even if you pause the incoming stream the stream buffer will just fill up. I've had it consume gigabytes of RAM during some testing with hundreds of thousands of rows when I specified a maximum of 200.

However, if this looks like a fast/easy fix then i'd deem it a much higher priority.

from postgres.

porsager commented on May 18, 2024 1

I started my reply, and realized we're off topic, so I thought it was better in it's own issue which could double as improvement for the documentation. Let's continue here: #4 😉

from postgres.

porsager commented on May 18, 2024 1

I've added cursor support with the callback-style 4156c3f .

I'll work on adding for await of support later. The reason I'm not adding it now is because it requires som changes to the inner workings because it's lazy by default, and I'd like to do it properly, so some more thinking around it is needed.

from postgres.

porsager commented on May 18, 2024 1

I know how an object stream would work, and there's currently no upside to using that over simply implementing and returning { [Symbol.asyncIterator]: ... }.

The reason I'm not doing it right away is because I would need to not send the query immediately because I would have to wait for the first call to next. I don't want to hack it in now because that would mean fetching the first chunk unnecessarily before calling next, and worse it would reserve a connection to the db unnecessarily. To make it work it requires a change to the current query flow which I'd like to do properly :)

from postgres.

porsager commented on May 18, 2024

Hi @Blitzsturm ..

I'm not sure I see the benefit of using streams in object mode and for await, but that might come down to a taste/coding preference thing.

I've added support for this by adding a new .cursor() method. I think that is more in line with postgres terminology too, and it won't change / complicate the .stream() surface api.

Would you like to check it out and let me know what you think? You can do npm install porsager/postgres#cursor.

Docs are here: https://github.com/porsager/postgres/tree/cursor#cursor-sql-cursorrows--1-fn---promise

from postgres.

tracker1 commented on May 18, 2024

First, really nice interfaces on the library all around... I came to mostly say the same regarding async streams... If you do open an object stream interface, that would allow for await of[1] syntax with current node releases, which I find very intuitive to use. Maybe a asReadStream() method?

I also feel awaiting any promise returned from the per-row function would be very useful as mentioned in the issue, which is imho the only real shortcoming in this library, as it's something I've come across syncing data to DynamoDB and similar (grouping records), though slightly more complex in terms of sizing.

[1] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of

from postgres.

Blitzsturm commented on May 18, 2024

Brilliant, it works as advertised. Though I do have one last suggestion. The below test demonstrates that the cursor flow does throttle the flow and prevents asynchronous row handling from stepping on each-other however there is a pretty significant hit to performance between the two. This would make sense if it's reading one row at a time. To counter this you may want to look at buffering some rows; perhaps the initial request retrieves 1000 rows then when on each await, if that buffer is below 900, get 100 more until the query is out of rows then let the async handler drain them until the buffer is empty then resolve the overall query-cursor promise. That would likely give it a significant performance jump.

Ideally the semantics of how postgres handles the connection and cursor mechanics should be invisible to the end user so doing something like this would make it look and feel the same but much faster; and while it does indeed work for it's intended purpose, poor streaming performance could dissuade power users. pg-cursor allows more control over cursor behavior but can be somewhat cumbersome in code. With that written I could easily write a wrapper for the "for await" behavior if I really want it or maybe add it in later if there is demand.

As previously mentioned Implementing a well tuned object stream would manage this buffering process for you and allow users to "for await" loop to asynchronously iterate directly from it even allowing power users to adjust buffering behavior by manually adjusting watermarks.

Overall, it's looking great. Some fine tuning for performance and I'd call it ready for production use.

Edit: also; the documentation is missing the "async" in ".cursor((row, cancel)" to ".cursor(async (row, cancel)"

"use strict";
const postgres = require("postgres");
Main().catch(console.error);

async function Main()
{
	var streamResults = await StreamTest();

	console.log("waiting 10 seconds between connections in tests...");
	await new Promise(r => setTimeout(r, 10000));

	var cursorResults = await CursorTest();

	console.log(`Stream Order (${streamResults.order.length}): ${streamResults.order.join(",")}`);
	console.log(`Cursor Order (${cursorResults.order.length}): ${cursorResults.order.join(",")}`);
	console.log(`Stream Speed: ${streamResults.speed} ms`);
	console.log(`Cursor Speed: ${cursorResults.speed} ms`);
}

async function StreamTest()
{
	const sql = postgres(process.env.DATABASE_URL);

	console.log("================================\r\nStream Ordering");
	var streamOrder = [];
	await sql`SELECT * FROM generate_series(1,20)`.stream(async (row) =>
	{
		await new Promise(r => setTimeout(r, Math.floor(Math.random()*1000)));
		streamOrder.push(row.generate_series);
		console.log(`Stream Row: ${row.generate_series}`);
	});

	await new Promise(r => setTimeout(r, 1000)); // (Ensure all have completed)

	console.log("================================\r\nStream Speed");
	var streamStart = new Date().getTime();
	await sql`SELECT * FROM generate_series(1,1000)`.stream(async (row) =>
	{
		console.log(`Stream Row: ${row.generate_series}`);
	});
	var streamEnd = new Date().getTime();

	await sql.end();

	return {order: streamOrder, speed: streamEnd - streamStart};
}


async function CursorTest()
{
	const sql = postgres(process.env.DATABASE_URL);
	
	console.log("================================\r\nCursor Ordering");
	var cursorOrder = [];
	await sql`SELECT * FROM generate_series(1,20)`.cursor(async (row, cancel) =>
	{
		await new Promise(r => setTimeout(r, Math.floor(Math.random()*1000)));
		cursorOrder.push(row.generate_series);
		console.log(`Cursor Row: ${row.generate_series}`);
	});

	await new Promise(r => setTimeout(r, 1000)); // (Ensure all have completed)

	console.log("================================\r\nCursor Speed");
	var cursorStart = new Date().getTime();
	await sql`SELECT * FROM generate_series(1,1000)`.cursor(async (row, cancel) =>
	{
		console.log(`Cursor Row: ${row.generate_series}`);
	});
	var cursorEnd = new Date().getTime();

	await sql.end();

	return {order: cursorOrder, speed: cursorEnd - cursorStart};
}

from postgres.

porsager commented on May 18, 2024

@Blitzsturm sorry afk until later, but did you see the option of returning more than one row per iteration?

What if there was also an option to control next amount of rows by returning a special token with rows value from callback? Eg.

sql` `.cursor(() => sql.ROWS(75))

Also see my iterable suggestion above.

from postgres.

Blitzsturm commented on May 18, 2024

@porsager I apologize, I somehow completely missed that. Using that makes it MUCH faster; though slightly less clean in use. Ideally I'd like it if it would still iterate only one row at a time but pull blocks of a specified amount. Something like the following under the hood. However it's current state is still cleaner than pg-cursor. So, well done.

await sql`SELECT * FROM generate_series(1,1000)`.cursor(100, async (rows, cancel) =>
{
	for (let row of rows)
	{
		console.log(`Cursor Row: ${row.generate_series}`);
	}
});

Alternatively I suppose if you need to send blocks of x rows to an external service this is indeed more ideal. My original use case would be better off as it wouldn't have to group rows it's self.

await sql`SELECT * FROM generate_series(1,1000)`.cursor(100, async (rows, cancel) =>
{
	await transformAndSend(rows);
});

Perhaps this could be handled as mentioned to @tracker1, I can't imagine a cleaner faster way to do it in syntax.

for await (let row of sql`SELECT * FROM generate_series(1,1000)`.cursor(100))
{
	console.log(`Cursor Row: ${row.generate_series}`);
}

from postgres.

porsager commented on May 18, 2024

Ah good to hear :)

I'm not sure the expanded api surface of opening up for singular iteration but fetching large blocks is worth it compared to simply handling it in userland. It's not that big of an addition for the user, and it is clearer to see what's actually going on. There are also concerns like cancellation that might be non obvious (eg. if you cancel half way through a group of rows (iterated behind the scenes per row) it's not obvious the last half was still fetched from the db).

I'm still thinking the async iterable could be ok, but I don't see what the benefit is over the callback.

await sql`
  SELECT * FROM generate_series(1,1000)
`.cursor(async row => {
  // await do something with row
})

for await (let row of sql`
  SELECT * FROM generate_series(1,1000)
`.cursor()) {
  // await do something with row
}

from postgres.

Blitzsturm commented on May 18, 2024

From a use perspective, not a ton of difference other than it's "the new hotness" in JavaScript and considered a standard means to asynchronously iterate through loops. Worst case scenario 5 years from now someone looks at it and says "why'd they do it that way? I mean it works, but it's non-standard" in the same way many legacy libraries don't make use of async/await.

It's really a matter of individual preference; but if it's easy to build in I don't see a reason to not sprinkle in some syntax sugar. If it was somehow monstrously difficult I think it can be skipped.

Ultimately in it's current state with cursor support, it's and outstanding package that I prefer over pg+pg-cursor and should satisfy any major need I can foresee and I can't complain about anything outside of minor nitpicks. But, if you find the time to throw in an extra feature I'd be slightly happier.

Take that for what it's worth; but overall I think you've done a great job. I'll likely be using it and will let you know if I encounter any bugs or performance issues in the future.

My biggest complaints about some packages relate to excessive unnecessary and bloated dependencies consuming memory and depleting performance as well as convoluted interfaces that needlessly require a lot more code to accomplish an otherwise simple task. I see none of that here.

from postgres.

Blitzsturm commented on May 18, 2024

The only other popular database interface I can think of is knex; which just sits on top of pg; but it does have some nice quality of life features. On that topic, I'd love to see the inclusion of "insert" and "upsert" methods that handle common CRUD operations with ease.

I'd imagine something like this:

var rows =
[
    {name: "John Doe", email: "[email protected]", age: 46},
    {name: "Jane Doe", email: "[email protected]", age: 44},
    {name: "Herp Derp", email: "[email protected]", age: 23}
];

var res = await sql.insert("users", ["name", "email", "age"], rows);
/*
    INSERT INTO users
        (name, email, age)
    VALUES
        ('John Doe', '[email protected]', 46),
        ('Jane Doe', '[email protected]', 44),
        ('Herp Derp', '[email protected]', 23)
    ON CONFLICT DO NOTHING;
*/

var res = await sql.upsert("users", ["name", "email", "age"], ["email"], rows);
/*
    INSERT INTO users
        (name, email, age)
    VALUES
        ('John Doe', '[email protected]', 46),
        ('Jane Doe', '[email protected]', 44),
        ('Herp Derp', '[email protected]', 23)
    ON CONFLICT (email) DO UPDATE SET
        name = EXCLUDED.name,
        email = EXCLUDED.email,
        age = EXCLUDED.age
*/

In addition it would be nice to make column names optional, that when omitted retrieve column names from the data provided. Either by just checking the first row or something more akin to:

function UniqueKeys(rows)
{
	var K = {};
	rows.forEach(r=>Object.keys(r).forEach(c=>K[c]=null));
	return Object.keys(K);
	// this could probably be done better with a set or indexOf()
}

(Where rows with unset heads are set to null.)

This would allow for extremely rapid/easy data updates. Imagine a use case where you have a table of user addresses and you want to update them to store latitude and longitude using Google Maps's geocoding API. The code to do so would look something like this:

await sql`
    SELECT
        id,
        address,
        lat,
        lng
    FROM
        users
    WHERE
        address IS NOT NULL
        AND lat IS NULL
        AND lng IS NULL
`.cursor(100, async (rows, cancel) =>
{
    for (let row of rows)
    {
        var geo = await googleMapsGeoCode(row.address);
        row.lat = geo.lat;
        row.lng - geo.lng;
    }

    var upsRes = await sql.upsert("users", ["id"], rows);
    if (!upsRes.success) throw new Error(upsRes.errorMessage); 
    // do thrown exceptions automatically call cancel?
});

insert/updates are especially common yet cumbersome, so something like this baked in would be super nice. I was planning on doing something similar in my own code, it makes sense to be able to put data back in, in the same way you get it out.

from postgres.

porsager commented on May 18, 2024

Did you check out the helpers section of the docs? 😊 https://github.com/porsager/postgres

Your insert example would look like this ;)

await sql`insert into users ${ sql(rows, 'name', 'email, 'age') }`

It's a bit harder for the update case. I wouldn't go all the way and do it as short as yours as I feel that is going too far. Currently I don't feel I want to include something that "builds" sql with special keywords the user doesn't know about. (Eg. in this case I think it's important users understand the implications of on conflict)

Maybe there would be a way to do something like this (note doesn't work currently)

await sql`
  insert into users ${ sql(rows, 'name', 'email', 'age') }
  on conflict(email) do update set
  ${ sql({ excluded: ['name', 'email', 'age'] }) }
`

from postgres.

Blitzsturm commented on May 18, 2024

The helpers go a long way but still require writing out some extra code manually to get what you want. What I was building was something that works a but like knex but allows for some more niche behavior. Virtually every database flavor has a different way to "upsert" which I often have need of. So, I was working on a means to make it as easy as providing json rows to a handler.

I suppose if I want to be clever about it I could just make a function that effectively does this

await sql`
  insert into users ${sql.apply(sql, [rows].concat(Object.keys(rows[0]))}
  on conflict(email) do update set
  ${Object.keys(rows[0]).map(injectionProtection).map(o=>`${o} = EXCLUDED.${o}`).join(', ') }
  // Not so sure this last bit would work...
`;

Maybe have some form of "on conflict" helper based on field names?
query text ${sql.conflictIgnore}
query text ${sql.conflictUpdate}

Is there a way to quickly insert programmatically generated components such as dynamic field name selections into a query or would that just be using "unsafe"?

from postgres.

Blitzsturm commented on May 18, 2024

Thanks! this change makes reading (very) large data sets much more scalable. An object stream would auto-handle buffer data to optimize speed while still working in "for await" syntax. Also I can provide an example of how to implement it in something of a wrapper if that would help.

For now I'll see if I can use it in some personal/work projects and raise a ticket if anything goes wrong.

from postgres.

Stream speed management using async/await (cursor) about postgres HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent