It's probably too late to change the API now, so this is more of a question about why

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Commit <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https:/

Commit <a href="https://github.com/mholt/certmagic/commit/fe722057f2654b3

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Simpler Storage API about certmagic HOT 13 CLOSED

caddyserver commented on May 19, 2024

Simpler Storage API

from certmagic.

Comments (13)

DisposaBoy commented on May 19, 2024 1

This is a good question, but I am still convinced the two interfaces are jointly required for a workable Storage implementation.

It's not the Storage methods that need to be made safe for concurrent use -- they already need to be, of course -- it's that the ACME operations like the obtaining of a certificate need to be synchronized. This means not only one operation at a time, but one in total. TryLock is the easiest way I could think of that allows other processes to wait while another finishes obtaining the certificate it needs, then simply loading that certificate without obtaining it again. This idempotency is crucial for running correctly in a cluster, and it's why other libraries such as autocert don't do well sharing storage imementations even though it has a Cache interface.

Look closer at CertMagic's code and you'll see that it doesn't always wait -- it only waits if the lock is already held, and then it does some other flow when the waiting is over, whereas with a regular mutex you just wait and then continue, but here we have to alter the program flow.

Going by the 2 uses of TryLock that I saw (Obtain, Renew): IIUC, the flow is TryLock then Wait if it returns a Waiter.

To me this the same as a Lock except the Locker implementation needs to figure out if the lock is held, then figure out how to wait for it to be released without blocking the caller.

Without the need to be asynchronous the implementation can just use a sync.Mutex or database transaction since they already implement this behavior correctly.

e.g. Renew using Lock would be:

lock()

if we still need to renew:
    renew()

unlock()

If I understand the code correctly, it currently does:

trylock()

if someone else has the lock:
    wait()
    # assume they succeeded
else:
    renew()
    unlock()

Again, it's entirely possible that I just don't understand the code but it seems to me that if 2 processes happen to try to renew at the same time but don't block each other, i.e. p1 calls Unlock before p2 calls TryLock then p2 will go on to try and renew as well.

For database backed storage you can use a transaction to "obtain" a lock or, if already obtained, to know that and simply return a waiter instead.

How do you implement said waiter?

I hope that makes sense. They used to be separate but I really could not justify totally separating the two because it was not safe to use one but not the other.

To be clear: I originally thought an external lock might not be needed and it could be simpler to remove Locker, but the main source of complexity is TryLock + Wait.

FileStorage does need to handle crashes better, and it's on my list to enable expiring the locks, but it's also safe to assume that in a clean shutdown scenario, Unlock() will be called, cleaning up the resources. (Also a WIP, I haven't hit either of these edge cases in the years I've been using it though.)

Did you not hit not hit them? or are you just not aware that you hit them? :p

from certmagic.

DisposaBoy commented on May 19, 2024 1

@DisposaBoy Thanks for your comments.

Also thanks for GoSublime! blush

No problem 😄

How do you implement said waiter?

For FileStorage, we do a simple poll every so often.

That's a common solution, but it's the one I was afraid of because you run the risk of polling too quickly, thus overloading the system or not quickly enough, thus introducing more latency than necessary.

I don't think this works. If the process crashed, there'd be no chance to clean up the locks at shutdown, otherwise it could just unlock them.

It does work shutdowns that can cleanup (e.g. SIGTERM), but crashes have to be handled by ignoring/removing stale locks, which that commit also does.

Reading the code: stale is defined as being 2 hours old which means it only works if you try the lock after 2 hours. If you try before this time, it will block forever (or until whatever lock timeout might be in place).

The code below demonstrates the issue for me. I ran it with ctrl.,ctrl+r which is roughly equivalent to sigint existing-instance; go run....

package main

import (
	"log"
	"time"

	"github.com/mholt/certmagic"
)

func main() {
	fs := certmagic.FileStorage{}
	k := "hello/world"
	log.Println("TryLock")
	w, err := fs.TryLock(k)
	if err != nil {
		log.Fatalln("TryLock:", err)
	}
	defer fs.Unlock(k)
	if w != nil {
		log.Println("waiting...")
		w.Wait()
	}
	log.Println("sleeping...")
	// give us some time to kill the process
	time.Sleep(10 * time.Second)
}

from certmagic.

DisposaBoy commented on May 19, 2024 1

I'll see if I can convert it tomorrow or Friday.

from certmagic.

mholt commented on May 19, 2024

This is a good question, but I am still convinced the two interfaces are jointly required for a workable Storage implementation.

It's not the Storage methods that need to be made safe for concurrent use -- they already need to be, of course -- it's that the ACME operations like the obtaining of a certificate need to be synchronized. This means not only one operation at a time, but one in total. TryLock is the easiest way I could think of that allows other processes to wait while another finishes obtaining the certificate it needs, then simply loading that certificate without obtaining it again. This idempotency is crucial for running correctly in a cluster, and it's why other libraries such as autocert don't do well sharing storage imementations even though it has a Cache interface.

Look closer at CertMagic's code and you'll see that it doesn't always wait -- it only waits if the lock is already held, and then it does some other flow when the waiting is over, whereas with a regular mutex you just wait and then continue, but here we have to alter the program flow.

For database backed storage you can use a transaction to "obtain" a lock or, if already obtained, to know that and simply return a waiter instead.

I hope that makes sense. They used to be separate but I really could not justify totally separating the two because it was not safe to use one but not the other.

FileStorage does need to handle crashes better, and it's on my list to enable expiring the locks, but it's also safe to assume that in a clean shutdown scenario, Unlock() will be called, cleaning up the resources. (Also a WIP, I haven't hit either of these edge cases in the years I've been using it though.)

from certmagic.

mholt commented on May 19, 2024

Commit fe72205 handles stale locks in FileStorage, and adds a method to the Locker interface to help callers remove all locks before the process exits. (Sorry, it's the opposite of what you wanted -- a simple interface -- but I think it is necessary, since you brought up the idea of cleanup.)

Let me know what you think!

from certmagic.

DisposaBoy commented on May 19, 2024

Commit fe72205 handles stale locks in FileStorage, and adds a method to the Locker interface to help callers remove all locks before the process exits. (Sorry, it's the opposite of what you wanted -- a simple interface -- but I think it is necessary, since you brought up the idea of cleanup.)

Let me know what you think!

I don't think this works. If the process crashed, there'd be no chance to clean up the locks at shutdown, otherwise it could just unlock them.

It's the same problem you have with pid files, disk-based locks, (non-abstract) domain sockets, etc.: you have to take care to clean up at program start.

from certmagic.

mholt commented on May 19, 2024

@DisposaBoy Thanks for your comments.

Also thanks for GoSublime! 😊

e.g. Renew using Lock would be:

lock()

if we still need to renew:
renew()

unlock()

This is probably a better pattern. I'll push a branch soon that changes this, and I would like you to try it if that's okay with you!

How do you implement said waiter?

For FileStorage, we do a simple poll every so often.

To be clear: I originally thought an external lock might not be needed and it could be simpler to remove Locker, but the main source of complexity is TryLock + Wait.

I understand now -- I bet we can remove both.

I don't think this works. If the process crashed, there'd be no chance to clean up the locks at shutdown, otherwise it could just unlock them.

It does work shutdowns that can cleanup (e.g. SIGTERM), but crashes have to be handled by ignoring/removing stale locks, which that commit also does.

from certmagic.

mholt commented on May 19, 2024

@DisposaBoy Please see and review PR #7 when you have a chance. Hopefully soon. :) I'd like to get this pinned down.

from certmagic.

DisposaBoy commented on May 19, 2024

@DisposaBoy Please see and review PR #7 when you have a chance. Hopefully soon. :) I'd like to get this pinned down.

It's later here, so I'll add it to my list of things to do tomorrow.

from certmagic.

mholt commented on May 19, 2024

@DisposaBoy

That's a common solution, but it's the one I was afraid of because you run the risk of polling too quickly, thus overloading the system or not quickly enough, thus introducing more latency than necessary.

What better solution would you propose?

Thanks for your reproduce case, I'd be interested what you think of the new design in #7 after you wake up 👍

from certmagic.

mholt commented on May 19, 2024

Also:

Reading the code: stale is defined as being 2 hours old which means it only works if you try the lock after 2 hours. If you try before this time, it will block forever (or until whatever lock timeout might be in place).

New in #7, the Wait() will only block until the lock is stale or released, whichever comes first. No more arbitrary timeout.

from certmagic.

DisposaBoy commented on May 19, 2024

@DisposaBoy

That's a common solution, but it's the one I was afraid of because you run the risk of polling too quickly, thus overloading the system or not quickly enough, thus introducing more latency than necessary.

What better solution would you propose?

I don't know of a better solution than to simply delegate the locking to someone else who know how to do it better, where possible i.e. by using *Flock(), db transactions, mutexes, etc.

from certmagic.

mholt commented on May 19, 2024

Woohoo! Thanks for your critiques!

At some point, can we get your test program added to the repo as a test case?

from certmagic.

Simpler Storage API about certmagic HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent