typelevel / cats-parse Goto Github PK

View Code? Open in Web Editor NEW

224.0 224.0 50.0 5.38 MB

A parsing library for the cats ecosystem

License: MIT License

Scala 100.00%

cats-parse's People

Contributors

Stargazers

Watchers

Forkers

regadas scala-steward stephenjudkins goseign rossabaker gitter-badger lvitaly oguzhanunlu satabin martijnhoekstra lbqds andimiller flowi lorandszakacs djspiewak optrak vlachjosef ghostdogpr vasilmkd nightscape mpilquist tabdulradi drruisseau wearexteam archive-tsao-chi-forks zsluedem miles-garnsey denisnovac systemfw cmeade sd-yip lenguyenthanh armanbilge ktedon odomontois i10416 dufrannea satorg keynmol masseguillaume yisraelu christiankjaer daddykotex clarzte colin-m-davis xuwei-k m-combinator gregor-i

cats-parse's Issues

1's everywhere are cumbersome

This issue may be a bit premature/pretentious to file as an issue, but I figured it would be good to get it out early. The safety that Parser1 and its combinators are great. So great, that while trying out the library, I find I use them pretty much all the time. Using a Parser or a method that creates a Parser is an exception.

Has it been considered doing the other way around, renaming Parser to Parser0, and Parser1 to Parser, along with all 1 methods on object Parser? That would make the potentially non-consuming Parsers the exception not only in practice but also in naming.

add a FAQ entry on using fix to parse recursive structures

see:

https://matrix.to/#/!qrcPEbYoUyqhEvxImO:gitter.im/$AoSsgoWnlpnJXPAnAOyk9j49y2CFbAhUB4Ed9PmxCwo?via=gitter.im&via=matrix.org

The main issue is that if you have do this pattern:

Defer[P1].fix[Ast] { self =>
  ...

  P.oneOf1(a :: b :: c ...
}

then you have to make sure that all of a, b, c, etc... can make some progress without first using self.

For instance, you need to put all your constants first in that list. Parsing operators is its own item in the FAQ. Generally what I like to do is parse a list of (Operator, Item). so you do: (parseAtom ~ postOp.rep).map { case (a, fs) => fs.fold(a) { case (a0, (op, a1)) => addOp(a0, op, a1)`

keeping in mind operator precedence, etc... that's a whole other faq...

Provide RFC5234 core rules?

There are many RFCs that reference the core rules of RFC5234. Is there any interest in an object that provides those? http4s has already implemented those imported by RFC7230.

test cats laws

I think we are currently testing all the laws these typeclasses require, but we aren't using the cats laws package.

It might be nice to use those just to be 100% sure everything is fully lawful.

Different way of combining Parser0 and Parser

Hi all!
I'm just trying to port scala-uri from parboiled2 to cats-parse.
One thing I find rather confusing is the with1 method to make a Parser0 behave like a Parser.
I'm wondering if a different way to encode this could work, e.g.

trait LowPriorityImplicits {

  implicit class RichParser0(parser: Parser0) {
    def ~(other: Parser0): Parser0 = ???
  }
}

object Parser0 extends LowPriorityImplicits {
  implicit class EvenRicherParser(parser: Parser0) {
    def ~(other: Parser): Parser = ???
  }
}

val foo0: Parser0 = null
val foo: Parser = null
val concatted: Parser = (foo0 ~ foo)

Consistency of return types

Spun off from an http4s issue.

char returns a Parser1[Unit]. We know what it captured, so we don't return it.
charIn returns a Parser1[Char]. It captured one of a set of characters, and we want to know which.
ignoreChar returns a Parser1[Unit]. Arguably like charIn in that we don't know what it cpatured, but the docs tell us to call .string if we need the result.
string1 returns a Parser[Unit]. Seems consistent with .char.

Did we get this right, or should everything return what it captured?

oneOf doesn't match although one parser in the list matches

Hi!

I encountered a problem when using oneOf.
To make it easier to communicate I created a test that fails. You can find it here:
https://github.com/FloWi/cats-parse/blob/oneOfError/core/shared/src/test/scala/cats/parse/OneOfTest.scala

When you run the failing test, you see that the last parser in the list succeeds, but the oneOf-parser, that uses those parsers, still fails.
sbt "testOnly *OneOfTest*"
I tried to write a parser that simplifies an expression of a grammar.

    // (a|b)a --> aa|ba
    // a(a|b) --> aa|ab
    //  (a|b) -->  a|b

I'm quite new to parsers and have to idea, if this is a bug or if I did something wrong - AdventOfCode brought me to this rabbit-hole :)

Native Scala YAML parser

Is it possible to work with YAML files using this library? So far all solutions I found for Scala are using Java libs for that.

See scala-native/scala-native#2174

investigate mima failure on dotty

https://github.com/typelevel/cats-parse/runs/1392179131?check_suite_focus=true

looks like the dotty version of 0.1.0 didn't publish... maybe I have to rerun some ci job....

ugh

Add a way to make a fully generally parser

users should be able to give us (String, Int) => Either[Error, (Int, A)] for cases where they can't express their parsing in terms of the core combinators.

Then we would at runtime check if the Int is >= the input offset, and it it is return, else report an InvariantViolationError or something for the parser (which I guess would be an epsilon error), or potentially throw an exception, since users should not be able to recover from errors like that...

An alternative is just let all bets be off, and not check that the returned Index makes sense, and just let users live with the consequences.

This is a can of worms, and maybe we should avoid adding such a function.

Add a fluent API for repeating with various options.

There are various ways to repeat stuff. With or without a separator, with a minimum number of repetitions, allowing 0 repetitions or not, gathering into some accumulator. There is also a ticket open to add a maximum: #97

That gives rise to a lot of different combinations for repeating parser constructors, with opportunity for inconsistency and not having some specific combinations of concerns.

What do you think about adding a fluent API starting from rep or rep0, and then adding combinators for min, max, separator and accumulator?

Problem with Maven RC2 release?

I got the following message trying to open Rfc5234 in IntelliJ

Error reading TASTy file: /Users/hjs/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/typelevel/cats-parse_3.0.0-RC2/0.3-3-d801d0a/cats-parse_3.0.0-RC2-0.3-3-d801d0a.jar!/cats/parse/Rfc5234.tasty

I tried other versions of cats-parse for RC2 from maven central with the same result.

As this is a bit bleeding edge it could be IntelliJ or the deployed library. Just thought I'd ping you here.

Parser0[Option[_]] ~ Parser[_] could not get (Option(_), _)

  import cats.parse.{Parser0, Parser => P, Numbers}
  private def name:P[String] = P.charIn(('a' to 'z')).rep.string
  private def alias: P[String] = name <* P.char(':')
  (alias.? ~ name).parse("abc")

the code will return Left(Error(3,NonEmptyList(InRange(3,:,:)))) , It should be Right(_, (None,abc)).
I tried (alias.?.backtrack ~ name) , (alias.?.soft ~ name), both does not work.

Using cats-parse with Scala string interpolation

How would I best use cats-parse with Scalas string interpolation feature, where the input I am parsing is not just a plain string but also arbitrary values that are interpolated in it?

So how would I best write a parser for something like:

json"{ name: $name, id: $id }"

from https://docs.scala-lang.org/overviews/core/string-interpolation.html#advanced-usage

Bug with length restrictions on `rep`?

I am implementing RFC 8941 and noticed testing my code that length restrictions don't quite seem to work as I expected.

val signedDecIntegral: P[String] = 
   (P.char('-').?.with1 ~ digits.rep(1,12)).map { 
    case (min, i) =>
        min.map(_ => "-").getOrElse("")+i.toList.mkString
  }
val decFraction: P[String] = digits.rep(1,3).string
val sfDecimal: P[(String,String)] =
   (signedDecIntegral ~ (P.char('.') *> decFraction)).map { 
      case (dec: String,frac: String) =>
        (dec,frac.toList.mkString) //todo: keep the non-empty list?
     }

It seems like the length restrictions don't cause the expected errors

scala> decFraction.parse("12312323")
val res17: Either[cats.parse.Parser.Error,(String, String)] = Right((,12312323))
scala> sfDecimal.parseAll("12345678901234567890.22222")
val res18: Either[cats.parse.Parser.Error,(String, String)] = Right((12345678901234567890,22222))

Bug? - surroundedBy with optional whitespace fails to parse

`
/**
Demonstrates a possible bug with cats-parse.

Parses a parenthesized word list  (foo, bar, x, y),
 but fails to parse if a space precedes the final ')'.

cats-parse version 0.3.2
Scala version      3.0.0-RC2

import cats.parse.Parser=>P

@main def main():Unit =

// Specific types of characters.
val whitespace = P.charIn( " \r\t\n")
val letter = P.charIn('a' to 'z')
val comma = P.char(',')
val lParen = P.char('(')
val rParen = P.char(')')

// For testing, a lowercase word.
val word = letter.rep.string

// Allow optional spaces around the list characters -  ( , )
val whitespaces0 = whitespace.rep0.void
val listStart = lParen.surroundedBy(whitespaces0).void
val listEnd = rParen.surroundedBy(whitespaces0).void
val listSeparator = comma.surroundedBy(whitespaces0).void

// Define a parenthesized list of words ... eg. (foo, bar, x, y)
val wordList = listStart ~ word.repSep0(listSeparator) ~ listEnd

// This wordlist parses fine.
val result1 = wordList.parseAll("(foo, bar, x, y)")
assert(result1.isRight)

// PROBLEM: If a space precedes the final ')', then it fails.
val result2 = wordList.parseAll("(foo, bar, x, y )")
assert(result2.isRight)

set up automatic publishing of tags to main

rep with min and max

Related to #52, question was raised about parsers that repeat a min and a max number of times. Here's a real-world case from RFC7321, which has a precision to the thousandths:

     qvalue = ( "0" [ "." 0*3DIGIT ] )
            / ( "1" [ "." 0*3("0") ] )

It's not real common, but they're out there.

add a benchmark suite to compare against fastparse v1

We should be able to match fastparse v1, matching v2 would be tricky due to their use of macros there.

Why is Parser0#repSep not possible?

Hi all,

I'm trying to port scala-uri to cats-parse.
One difficulty I'm facing is with parsing path parts including empty ones:

a/b/c
/a/b/c
a//c
/a//c

What I would like to do is something like this (simplified)

  def _path_segment: Parser0[String] = Parser.until0(charIn("/?#"))
  def _path: Parser[String] = (Parser.char('/').? ~ _path_segment.repSep0(char('/')).string

but this doesn't work, as Parser0 doesn't have repSep or repSep0 defined.
I understand that Parser0#rep is problematic, as one could easily run the empty parser infinitely, but in my naive thinking this problem shouldn't exist with repSep, right?

Or is there a nicer way to do this in cats-parse?

publish for Scala 3.0.0-RC3

needed for downstream http4s

Upgrade GitHub Actions

Some of our GitHub Actions are a bit old, and we are getting deprecation warnings. My hyopthesis is that at least one of our actions is making a deprecated call.

something broken in the build

both #152 and #153 have failed with the same error.

It looks like something transitively is breaking the use of the name params in the build.sbt.

add a short note about how to do a release

each scala repo has a somewhat bespoke way to do a release.

Since many of us contribute on many repos, it is easy to lose track of how each works. Let's add a short md doc that explains the steps in a list.

There is the #16 plugin that drafts releases, and there is the auto publishing, and then there is the question of setting version numbers. Lastly, some repos need the mima versions to compare to updated. I'm not 100% what we need in this repo.

cc @mpilquist

implement some standard utility parsers

things like BigInt, Int, Long, Float, etc...

Also things like standard whitespace, bracketed lists, etc...

It should be possible to write a JSON parser with a pretty minimal combination of these. This helps people learn patterns but also lets us really optimize some basic blocks that people will almost always need.

Add some common cats methods directly

.as from Functor and maybe .replicateA from Applicative are useful enough that adding those methods on Parser and Parser1 might help discovery.

Since users with IDEs often rely on autocomplete, this can be useful for them.

design for scoping parsers for error reporting

Currently, when a parser fails you get a nonempty list of offsets and failed expectations.

We could also add "scope" wrapper, like: number.scope("number").orElse(str.scope("string")). How this would work is the mutable State would have a stack of these scopes, and we have:

case class ScopeParser[A](parser: Parser[A], scope: String) extends Parser[A]

in parseMut we would push the current scope onto the stack, then parse with parser, then pop it off the stack.

When we error, we take a snapshot of the current scope stack.

So, if we do this, users can have an easier time labeling parts of their parsers and seeing where things went wrong.

Fastparse does something similar.

What do you think of this design @mpilquist @non @rossabaker and really anyone who cares to comment.

set up scalafmt

make the CI fail the build if the code isn't formatted.

add a prePR alias to run formatting and github CI checker

We are going to waste a lot of CI time on failures that users may not know about.

Cats uses a prePR alias to run everything required to pass CI. We can add this note to the readme as well as in the template to make a PR.

No default min value on rep(1)sep

Repsep is 0 or more and rep1sep is 1 or more, but you still have to pass in a minimum, which feels unnatural, especially for repsep where the minimum is implied to be 0 anyway. Default values of 0 or 1, or overloads to the same effect would be useful.

Detect start of line

I'm trying to parse a format where different parts are separated by <start_of_line>#### fragment and so, I would like to be able to detect the <start_of_line>.

IMHO the logic should be similar to P.start | <prev_char = '\n'>.

I'm not sure if that matters, but I'm trying to parse Intellij HTTP client file format with an explicit requirement of supporting ### in the first line, so for example

###
// A basic request
http://example.com/a/

###

// A second request using the GET method
http://example.com:8080/api/html/get?id=123&value=content

Parser.oneOf could use expanded documentation of when order matters

It appears that order matters for Parser.oneOf, in cases where one parser accepts a subset of another parser.

This makes sense, however it might be good to explicitly mention in the docs the implication this has for generating parsers for a set of String values - specifically that they should be reverse sorted according to length, because it's very easy to create inconsistent parsers if the input isn't correctly prepared.

For example, parsing a truthy value for true will consistently work (or fail) depending on the order of the parsers:

import cats.parse.{Parser => P}

val buggy: P[Boolean] =
  P.oneOf(List("1", "t", "tru", "yes", "true").map(P.string(_)))
    .void
    .as[Boolean](true)

val works: P[Boolean] =
  P.oneOf(List("true", "yes", "tru", "t", "1").map(P.string(_)))
    .void
    .as[Boolean](true)

def sort(strings: List[String]): List[String] =
  strings
    .map(str => (str.length, str))
    .sorted
    .reverse
    .map(_._2)

val sorted: P[Boolean] =
  P.oneOf(sort(List("1", "t", "tru", "yes", "true")).map(P.string(_)))
    .void
    .as[Boolean](true)

List("1", "t", "tru", "true", "y")
  .foreach { input =>
    println {
      """|%-6s => %6s => %s
         |%-6s => %6s => %s
         |%-6s => %6s => %s
         |""".stripMargin.format(
           s"<$input>", "buggy", buggy.parseAll(input),
           "", "works", works.parseAll(input),
           "", "sorted", sorted.parseAll(input)
         )
    }
  }

This is particularly troublesome because the error looks like an unexpected end of string, rather than an expected end of string that didn't happen:

Left(Error(1,NonEmptyList(EndOfString(1,3))))

Scastie

Add a way to modify error messages

Some of the error messages produce results that are either hard to render, or not particularly clear. This could be improved by providing the ability to replace or map over the error.

For example, if we have this parser(which is equivalent to -\s-):

import cats.parse.{Parser => P}
val parser = P.charWhere(_.isWhitespace).surroundedBy(P.char('-'))

List(
  "- -",
  "-t-",
  "--"
).foreach { input =>
  println("%-10s \t => %s".format(s""""$input"""", parser.parseAll(input)))
}

We get errors that aren't terribly readable:

"- -"      	 => Right( )
"-t-"      	 => Left(Error(1,NonEmptyList(InRange(1,	,
), InRange(1,, ), InRange(1, , ), InRange(1,᠎,᠎), InRange(1, , ), InRange(1, , ), InRange(1, , ), InRange(1, , ), InRange(1,　,　))))
"--"       	 => Left(Error(1,NonEmptyList(InRange(1,	,
), InRange(1,, ), InRange(1, , ), InRange(1,᠎,᠎), InRange(1, , ), InRange(1, , ), InRange(1, , ), InRange(1, , ), InRange(1,　,　))))

It would be handy to do something like fastparse's opaque:

parser.opaque("whitespace")

Or a lower level map over the errors:

parser.leftMap { error =>
  case InRange(index, _) => FailWith(index, "whitespace")
  case unexpected => unexpected
}

Which could produce errors like this:

"- -"      	 => Left(Error(1,NonEmptyList(FailWith(1,whitespace))))
"-t-"      	 => Left(Error(1,NonEmptyList(FailWith(1,whitespace))))
"--"       	 => Left(Error(1,NonEmptyList(FailWith(1,whitespace))))

Failing test cases in main

I'm seeing two tests failing:

cats.parse.ParserTest.voided only changes the result
cats.parse.ParserTest.with1 *> and with1 <* work as expected

The errors have the form:

values are not the same
=> Diff (- obtained, + expected)
           upper = 'ﷷ'
+        ),
+        Fail(
+          offset = 0
         )

To reproduce add the following to ParserTest:

override val scalaCheckInitialSeed = "SDzb3fKPxR67aeO2sgq4BlvTm5NphF9OM4j-dSIS9RD="

I haven't dug into this -- just figured I should report it ASAP.

It's hard to run targeted tests

Because ParserTest contains so many tests, making isolated changes and quickly testing with testOnly or testQuick is slower than you'd ideally want it to be.

Splitting up ParserTest would enable a tighter test loop. Are you OK with that?

Set up codecov.io

set up mdoc

make sure any examples are typechecked.

How to write recursive parsers?

How do I write recursive parsers? The following fails with a StackOverflowError. I assume somewhere I don't have something tail-recursive, but I'm not sure how to write this any differently. (Both op and condexp fail similarly below.)

package foo

import cats.parse.{Parser0, Parser, Numbers}
import cats.syntax.all._
import scala.language.postfixOps

sealed class Expr
case class Lit(x: Int) extends Expr
case class Op(left: Expr, op: String, right: Expr) extends Expr
case class Cond(cond: Expr, tr: Expr, fl: Expr) extends Expr

object testrecurse {
    import Parser._

    def expr: Parser[Expr] = recursive[Expr] { recurse =>
        def subexpr = recurse.between(char('('), char(')'))
        def lit = Numbers.digits.map(_.toInt).map(Lit(_))
//        def condexp = ((recurse <* char('?')) ~ recurse ~ (char(':') *> recurse))
//            .map { case ((cond, tr), fl) => Cond(cond, tr, fl) }           
        def op = (recurse, stringIn(List("+", "-", "*", "/")), recurse)
            .mapN(Op(_, _, _))

        oneOf(subexpr :: op :: lit :: Nil)
    }

    def main(args: Array[String]): Unit = {
        //val expr = "1?(5):2"
        val expr = "(5+3)/2"
        println(testrecurse.expr.parse(expr))
    }
}

Make a real readme

make sure no overrides are using flatMap

https://github.com/typelevel/cats/blob/master/core/src/main/scala/cats/FlatMap.scala

for some reason, FlatMap overrides some functions that apply implements in terms of product and map, by using flatMap. Since flatMap is more expensive for a parser, we don't want those overrides.

We should go through all the overrides in FlatMap and if we can implement in terms of product and map do so.

A | B != B | A

maybe, remove this | method. just use orElse

Enable `gh-pages`

Hi @johnynek!

Can you enable gh-pages in the repo settings? The site seems to have been published successfully and I believe that's the only thing missing for this to work https://typelevel.github.io/cats-parse.

Thanks!

CPU time spent in hashcode computation

Hi,
We are currently trying to replace fastparse by cats-parse and ran into the issue that parsing became 3 to 30x slower than it was with fastparse. Using a profiler, I saw that most of the CPU is actually spent on calculating hashcode, triggered by the use of .distinct in oneOf. See the following screenshots:

Do you think this is caused by a wrong use of cat-parse, or is it something that can improved in cats-parse itself? Parser code can be found here if it helps.

add ability to parse substrings

Currently, we can only parse from entire strings. It would be nice to be able to parse a string at a given offset and only up to a given length.

This would allow you to parse the inside of a string that might be provided by another process without having to copy.

Should be as simple as updating State.

set up github actions CI

add a dotty build

make repo public

set up a documentation site

have a doc site that publishes and looks similar to the cats documentation (logos etc...)

publish 0.3.0

There is conceivably one last item to consider: #128

Any others?

cc @regadas @martijnhoekstra

Is it possible to implement a non-greedy repeat?

I'm trying to implement a parser that repeats parser p1 until the rest of the string matches parser p2.
My current solution is this, but it's not really elegant and needs the input string to work.

  def repeatUntil2ndParserMatches(input: String, p1: P[String], p2: P[String], maxRepetitions: Int = 100): P[String] = {
    import cats.syntax.applicative._
    LazyList
      .range(1, maxRepetitions)
      .flatMap { i =>
        println(s"trying p1 $i times")
        val newP1 = p1.backtrack.replicateA(i)
        val p1Result = newP1.parse(input)
        p1Result match {
          case Left(_) =>
            List(P.fail)

          case Right((rest, _)) =>
            val p2Result = p2.backtrack.parse(rest)
            p2Result match {
              case Left(_) => List.empty
              case Right(_) =>
                List((newP1 ~ p2).map { case (list, res2) => list.appended(res2) }.map(_.mkString))
            }
        }
      }
      .headOption match {
        case Some(value) => value
        case None        => P.fail
    }
  }

Can this be done more elegantly?
I don't know if that is a common use-case in the parser world, but do know that regex groups can be made non-greedy.
I saw #128 where repetition is being discussed - maybe that is something others might find useful.

Maybe it'd be helpful if there was a combinator similar to flatMap that provides the tuple of (remainder, matched).