vmware-research / temporal-verifier Goto Github PK

An experimental framework for temporal verification based on first-order linear-time temporal logic. Our goal is to express transition systems in first-order logic and verify temporal correctness properties, including safety and liveness.

License: BSD 2-Clause "Simplified" License

Rust 98.54% Shell 1.46%

liveness verification

temporal-verifier's People

Contributors

Stargazers

Watchers

Forkers

dranov

temporal-verifier's Issues

Record timing and solver statistics

We want to track time spent in the solver vs. other parts of the codebase.

The simplest version of this would just track time in the solver, in a global struct (manipulated atomically), and then output that at the end. It would be nice to extend this slightly and also track the distribution of query times, perhaps using an HdrHistogram.

For tracking how we spend time in the Rust code I should just figure out how to get this easily from perf record.

Benchmarking infrastructure improvements

As suggested by @edenfrenkel:

Saving outputs: Currently the stdout of each run is discarded, but it might be interesting to look at, and can be used to extract relevant statistics. Even for verify, the output associated with fail might be of interest. For this we could have

Some dedicated folder (like we have .flyvy-log for .smt2 files) where the output of benchmark runs is saved to. We can also have a flag that determines whether the output should be saved or discarded.
Once the above is done, we might want some control over the level of logging in the output, e.g. using a --log level which would set RUST_LOG=level (or not set any logging if not provided).
Finally, there should be a way to access this output from the benchmark code, and to define per-benchmark statistics to present in the printed table. Maybe this should be done by providing a path to the saved file. For example, when I write the benchmarking for qalpha I might want to print the size of the fixpoint found, which I would extract from the output and then add to the table.
Note that we probably want the saved/processed output to contain both stdout and stderr, because I think the logging is sent to stderr.

More flexibility in configuring benchmarks: Currently this uses all examples in the examples folder and runs the benchmark with the same arguments for all of them, but we might want to determine a subset of these files to run a certain benchmark on, with file-specific arguments, or even multiple argument configurations per file. I think this can be done in several ways, for example using some config file that lists the benchmark configurations to run on each file, or by specifying inside the file which benchmark configurations should be run on it, similar to the way snapshot tests work. (Not all benchmarks must use these, just those where it's necessary.)

Bubble up solver error messages

Currently when the solver errors we don't seem to get a proper error message back. We should make sure these get printed even in panics to make debugging easier.

Add scripts to produce performance benchmarks across all examples

We should have a way to run performance benchmarks across all the examples, which encodes the configuration for algorithms like inference and bounded model checking and gathers timing and memory usage.

I envision this being written in Rust, with a small library of functions for running the flyvy binary and gathering statistics. We'll use the same mechanism as time, but perhaps implemented directly in Rust calling getrusage directly rather than wrapping the system binary. This will make it possible to get gather memory usage over time as well by sampling getrusage.

Using getrusage isn't completely trivial because it only gets statistics for the calling process (and children), not for an arbitrary child pid. I may implement this by starting out using time and then replacing it with our own version that reports the results in JSON.

Z3 "incomplete quantifiers" error

On the qalpha branch, z3 raises an error when running
env RUST_LOG=info cargo run -- infer qalpha examples/fol/hybrid_reliable_broadcast_cisa.fly --quantifier "F node 1" --quantifier "E node 1" --qf-body pdnf --cubes 2 --cube-size 1 --non-unit 0 --smt

The error message is thread '<unnamed>' panicked at 'sat solver returned unknown: |(incomplete quantifiers)|', src/inference/basics.rs:160:45, and an example SMT file can be found here.

The same command appears to work when using cvc5, although this takes a long time to run.

running sat sovers with indicator variables produces different results

Seen in the branch unsat_core_bug by running cargo run -- updr-verify ./examples/lockserver.fly - Sorry for the messy execution, I had to set up a lot of context to get this to run...
The added body of code is at the bottom of src/command.rs

When checking sat on a group of terms by assert ing them, and then calling check_sat on an empty hashmap, we get the correct result. However, when checking sat using indicator variables, we get a different, incorrect result - this incorrectness is seen in the calls to eval in the results.

In the branch, I took a term of the form exists n1, n2. f(n1) & g(n1) & h(n2) & ... - to add indicators I converted it to the term exists n1, n2. (ind1 = f(n1)) & (ind2 = g(n1)) & (ind3 = h(n2)) & ... and then called check_sat on a hashmap containing {ind: true} for all indicators. I also tried ind1 <-> f(n1) and got the same results.

Support for more examples

After adding more examples converted from mypyvy (currently on the qalpha branch), I've encountered the following issues:

There appears to be no support for if-then-else parsing, even though such terms are implemented. This is relevant for the following examples, which currently cannot be parsed: bosco_3t_safety, hybrid_reliable_broadcast_cisa, raft, ring_id_not_dead, ring_id, ticket, as well as all stoppable_paxos variants.
Definitions do not behave correctly, at least when appearing in proofs. Specifically, the defined relations aren't substituted when calling the solver, yielding an unknown constant error when running verify. This can be seen in all the choosable variants of paxos, where derived relations are currently defined via def.

Useless wrapper in verify/error?

In verify/error.rs, there's a struct called SolveError that just wraps a Vec without providing or removing any functionality. Can this wrapper be removed?

BDD checker should detect convergence

Similar to the set checker, it should be possible for the BDD checker to detect when no further depth is necessary. This can be implemented by keeping the previous depth of BDD and checking whether extending it by one more step changes the BDD or not.

Z3 timeout during invariant inference

On the qalpha branch, Z3 consistently times out during the execution of
cargo run -r -- infer examples/consensus_epr.fly --quantifier "* quorum 1" --quantifier "* node 3" --quantifier "* value 1" --max-clause-len 3 --max-clauses 1 --time --smt
The printed error message is
thread 'main' panicked at 'sat solver returned unknown: timeout', src/inference/basics.rs:129:45

The same execution finishes without errors when using cvc5. (This takes a few minutes to run, though.)

An example of the resulting SMT file can be found at https://github.com/vmware-research/temporal-verifier/files/11517521/consensus_epr.txt .

Fix substitution in the presence of shadowing global names

Consider this program

sort s

mutable x: bool

def y() -> bool {
    x
}

assume forall x:s. y

If we call Module::inline_defs on this module it will produce

sort s

mutable x: bool

assume forall x:s. x

which is incorrect (it's not even well sorted!). It should instead produce

sort s

mutable x: bool

assume forall x0:s. x

which is well sorted and intuitively the right answer. Notice that the bound variable had to be renamed to allow the body to refer to the global x after substitution.

I'm filing this issue just so we don't lose track of it. I'll assign it to me. I plan to fix it via capture avoiding substitution, but probably not until next week.

Make bdd-check and sat-check have a similar interface to set-check

While they can't be copied directly, they should take their bounds on the command line the way that set-check does.

Add documentation for all top-level commands

We don't have documentation for the variants of Command, which translates to temporal-verifier --help not producing useful output. We should fix that.

Create a dummy infer --updr command

We should set up the infrastructure for UPDR with a command and a function stub, which will currently just panic or report an error.

Inference is broken using cvc5 with --strict-parsing

The inference example from the README currently fails with CVC5 due to the --strict-parsing flag:

echo -e "1\nF node n1 n2\n0\n3" | cargo run -- infer examples/lockserver.fly --solver cvc5 --smt

Then running cvc5 --strict-parsing examples/lockserver.smt2 gives the actual error message:

(error "Parse Error: examples/lockserver.smt2:19.65: Invalid kind 'OR', expected Terms with kind OR must have at least 2 children and at most 67108863 children (the one under construction has 1)")

Add clippy to CI

Should we consider running cargo clippy in CI and making sure it prints no lints?

Improve model interpretation performance

Right now converting a parsed sexpr to a Model can be very slow (e.g., for the model of #5). For example, we ignore test_issue_5_parse_model_with_auxilliary_defs. When we try invariant inference for more complex examples, such models are expected to become common. We should fix the performance issue, probably by interpreting every symbol (including internal auxiliary symbols) only once and caching the result. Once we improve performance, we should include test_issue_5_parse_model_with_auxilliary_defs in CI and also add performance benchmarks like in #18.

Automatically generate and publish the rustdoc docs

We should be able to build the docs and deploy with GitHub Pages, so the API docs are easily accessible.

Z3 model parsing bug

Program panics when executing
echo -e "3\nE quorum q\nF node n1 n2 n3\nF value v\n0\n3" | cargo run --release -- infer examples/consensus_epr.fly --solver z3

Error message: thread 'main' panicked at 'could not find relation k!1058', src/fly/syntax.rs:218:32

Running with RUST_BACKTRACE=1 yields the following trace

   0: rust_begin_unwind
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5`
   1: core::panicking::panic_fmt
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14`
   2: temporal_verifier::fly::syntax::Signature::relation_decl`
   3: <&temporal_verifier::solver::backends::GenericBackend as temporal_verifier::solver::imp::Backend>::parse
   4: temporal_verifier::solver::imp::Solver<B>::get_fo_model
   5: temporal_verifier::solver::imp::Solver<B>::get_minimal_model
   6: temporal_verifier::inference::basics::FOModule::trans_cex
   7: temporal_verifier::inference::basics::Frame<T>::get_cex_trans
   8: temporal_verifier::inference::fixpoint::run_fixpoint
   9: temporal_verifier::command::App::exec
  10: temporal_verifier::main

Move houdini flag to infer subcommand

We plan to have several invariant inference algorithms. We should design some UI for choosing between these; currently the most likely design is to have the infer command handle all of them, perhaps with another layer of subcommand. As a first step we should move Houdini to the infer command rather than verify.

Missing documentation for inference and temporal-verifier crates

See commit 5cce4ab, which was added so that we could get the multiple crates branch merged as quickly as possible and unblock everyone. However, we do want documentation for all our files.

To resolve this issue, uncomment the lines in that commit, and fix all resulting warnings.

It would be great if the original authors of the files in the inference crate could add documentation for them, so that the docs are as accurate as possible.

Prime missing from mutable axioms in SAT checker

Shouldn't there be a prime added to the mutable axioms in the line below? I think that the idea was to add all axioms to the initial states, and then only the mutable ones to post-states of a transition.

temporal-verifier/bounded/src/sat.rs

Line 42 in 343eb69

.chain(d.mutable_axioms(&module.signature.relations))

Refactor out code to get the inits, trs, safes, and axioms from a module

Currently, there are at least five different places in the code where this is done (all three bounded model checkers, verify/module.rs, and inference/basics.rs), and they're all done slightly differently. There should be a single function on Modules that can be called to get this information.

Additionally, this function can be smarter than the current implementations are, by (for example) recursing into Ands.

Proposal for cleaning up Term::Id versus Term::App and the concrete syntax of function calls with zero arguments

The current situation

Declaring functions with zero arguments versus declaring constants

Consider following two similar global declarations

mutable f(): bool
# or
mutable f: bool

Currently these two different concrete syntaxes for a function declaration parse to the same declaration AST, namely RelationDecl {mutable: true, name: "f", args: vec![], sort: Sort::Bool}.

Defining a macro with zero arguments

Currently, definitions must use parentheses, even if there are zero arguments.

# this is allowed
def f() -> bool {
    true 
}

# this is a syntax error -- expected '(' after 'f'
def f -> bool {
    true
}

Calling a function of zero arguments versus referring to a constant

There is a similar situation at the term level. Consider these two terms:

f()
# or
f

But this time, the current implementation handles it differently. The first term parses to the AST Term::App("f", vec![]) while the second term parses to the AST Term::Id("f"), which are not equal and indeed are never converted between each other.

The current sort checker forbids the App AST with zero arguments, so while the first term above parses, it does not sort check.

Proposals

Overall, the examples above show that the current situation is messy, both for end users (who may be confused by the multiple syntaxes, some of which are only allowed in some places and not others) and for flyvy developers (who have to deal with multiple AST nodes that should intuitively be equivalent but are not).

I have had a few discussions with all of you about this, and there are a few ideas for how to improve things.

Eliminate redundancy in the AST for terms

I think everyone I've talked to agrees that we should choose just one of Term::App(f, vec![]) and Term::Id to support. Different people prefer different versions of this.

Idea A1: Eliminate Term::Id entirely. Parse all constants and variables to Term::App(x, vec![]).
- Downside: it may feel strange to flyvy developers who are used to a more traditional AST representation of variables that a logical variable is represented as a function call of zero arguments internally.
- Downside: variables really are different from function calls. For example, you can substitute for a variable but not for a function call. Variables go in and out of scope, but functions cannot be introduced in any local scopes. So some code would need to branch on the length of arguments anyway.
Idea A2: Make it an invariant of Term::App that the argument list is non-empty. Parse function calls of zero arguments to Term::Id(f).
- Downside: if we allow the end user to write f() then it might be surprising to flyvy developers that this does not parse to a function call.
- Downside: sometimes we want to handle variables just like function calls of zero arguments. Having separate AST nodes makes the code have to repeat itself. For example, if you want to traverse the AST and check whether there are any temporal operators used, then you will treat variables exactly like a zero-argument function call, but this design will force you to distinguish the two cases anyways.
Idea A3: Keep both. Use Term::Id only for bound variables (quantified variables and definition arguments) and use Term::App(f, vec![]) for global constants (ie functions of zero arguments) (and macros of zero arguments?).
- Downside: This distinction feels weird.
- Downside: This distinction cannot be easily implemented by the parser, which lacks any scope information. So we would have to parse to some intermediate representation and then resolve names to get into an AST like this.

Overall I prefer Idea A1. I'm also ok with A2, but I am slightly opposed to A3. What do other people think?

Eliminate redundancy in the concrete syntax of terms

Should we allow the user to write both f() and f and parse them to the same thing?

Idea B1: Allow both and parse them to the same thing
Idea B2: Allow only f
Idea B3: Allow only f()
- Downside: end users may feel that it's strange that "there are no global variables, only global functions"

My preference is B1. I like B2 almost as much. I don't like B3 as much, but I'm not opposed to it.

Consider eliminating concrete syntax redundancy in function declarations

Should we allow the user to declare a function of zero arguments with parentheses or not or both?

mutable f(): bool
mutable f: bool

Idea C1: Allow both and parse them to the same thing (this is what we do today)
Idea C2: Allow only f
Idea C3: Allow only f()

My preference is to match our choice at the term level. So if we choose B1, we should also choose C1, etc.

Bring macro definition syntax into agreement with function syntax

Should we allow declaring macros of zero arguments without parentheses?

def f() -> bool {
    true 
}
def f -> bool {
    true
}

Idea D1: Allow both and parse them to the same thing
Idea D2: Allow only f
Idea D3: Allow only f() (this is what we do today)

My preference is again to mirror the previous answer: if we choose C1, then we should choose D1, etc.

Uniformize the return type annotation of definitions

While we're at it, how about replacing -> with : in the definition return type annotation?

def f(): bool {
    true 
}

Idea E1: Use -> for return type annotations (this is what we do today)
Idea E2: Use : for return type annotations

I prefer E2 because we use : everywhere else.

Replace multi_cartesian_product with a correct version

We currently use itertools::multi_cartesian_product in many places. Unfortunately it doesn't handle empty iterators correctly (it produces no results when it should produce a single empty result); see rust-itertools/itertools#337. In several places we handle this ourselves but it would be better to just use a correct version everywhere.

We can implement the correct version by wrapping itertools::multi_cartesian_product.

Minimize models without evaluating them

It occurred to me while fixing #26 that one reason evaluation is slow is that the models are large, and this is because they aren't yet minimized. However, minimization only requires knowing the universe cardinalities and not constructing the full Interpretations, and we can extract this very efficiently with some basic sexp parsing.

After minimization model evaluation should generally be much faster; for example the slow test we currently have involves a node sort with a universe of 78 elements, and evaluating a function of two nodes on about 6,000 inputs. Probably that universe could be of size 3 or 4.