haskell / attoparsec Goto Github PK

View Code? Open in Web Editor NEW

514.0 514.0 93.0 1.11 MB

A fast Haskell library for parsing ByteStrings

Home Page: http://hackage.haskell.org/package/attoparsec

License: Other

Haskell 99.95% Makefile 0.05%

attoparsec's People

Contributors

Stargazers

Watchers

Forkers

sol yihuang lpsmith snoyberg reinerp shawjef3 cje meiersi shimuuar batterseapower ixmatus isturdy whittle fuuzetsu trentonc basvandijk mvv quchen nilcons-contrib zerobuzz alevy k-bx seanrburton liyang treeowl chrismwendt phadej yuras traytonwhite tolysz universal-it-systems vrosnet rickowens tmcgilchrist liu3tao hvr shou henriknordin bgamari neuroradiology werehamster obsidiansystems kuk0 text-utf8 agustinmista davidalphafox k7k7 mightybyte andrewthad bobcarberry norfairking zhujinxuan flavioprosperi rowhit lysxia haskell-vanguard galenhuntington facundominguez ekmett bef0 piyush-kurur typeable ryanglscott felixonmars tubbz-alt hapytex davean duog awjchen standardgalactic mulderr a-fayzullayevich cs-joy sergv chris-martin tfausak tweag isabella232 andreasabel miguelzamora13 alcinnz topikettunen sysfce2 hasufell

attoparsec's Issues

Function request: findAll

Is it possible to add a function that is finding all matches inside a string, like:

import Data.Either (rights)
import Data.Attoparsec.Text (Parser, parseOnly)
import Data.Text (pack)

findAll :: Parser a -> String -> [a]
findAll parser = rights . map (parseOnly parser . pack) . oneLess where
oneLess [] = []
oneLess (whole@(_:xs)) = whole : oneLess xs

Generalise scan to scanEither

The scan parser has type

scan :: s -> (s -> Word8 -> Maybe s) -> Parser ByteString

Sometimes it can be useful to know the state machine's final state, but scan doesn't expose this. I would like to see a generalisation of scan to a function

scanEither :: s -> (s -> Word8 -> Either s r) -> Parser (ByteString, Either s r)

which returns the state machine's final state. Then scan could be implemented in terms of scanEither as

scan s0 p = fst <$> scanEither s0 p'
  where
    p' s w = case p s w of
      Just s' -> Left s'
      Nothing -> Right ()

Initially feeding a parser with empty string results in non-obvious parse

While writing a parser Fold for @Gabriel439's foldl package I stumbled upon this interesting inconsistency (full example here),

A.eitherResult
$ flip A.feed ""
$ flip A.feed "123"
$ A.parse A.double "1.3"
== 1.3123

A.eitherResult
$ flip A.feed ""
$ flip A.feed "123"
$ flip A.feed "1.3"
$ A.parse A.double ""
== 1.3

Given that empty strings are a bit special as they indicate the end of the stream it's not entirely clear that this is a bug but it is quite surprising. Why does the initial empty string affect the resulting parse? Is this considered buggy behavior?

Potential memory regression with 7.8

Hi,

I believe I'm seeing a regression with GHC 7.8 as compared to 7.6; it's documented here with an example file.

Could you check if you see the same? Thanks!

Extend Attoparsec with ByteString functions equivalent to .Text

Currently there are only a few functions in Data.Attoparsec.ByteString making it a far less inferior choice than Data.Attoparsec.Text functions. Unfortunately, in projects where a dependency on Text is not acceptable, this means that we're stuck with a very limited number of possibilities.

This issue asks for an extension of ByteString part of the parser so that it matches its Text cousins more closely.

Negative length

I've accidentally tried to "take" negative amount of bytes. Surprisingly, it worked:

let Right bs = A.parseOnly (A.take (-1)) B.empty in B.length bs

returns -1, which kinda makes sense (in a weird way). Trying to print the result itself, however, doesn't work that well:

A.parseOnly (A.take (-1)) B.empty

and ghci crashes. I'd suggest checking for negative length in "take".

Lazy parser fails

Hi Bryan,

I've written a small attoparsec parser that when run on a strict text succeeds but when run on a lazy text errors.

The program parsers a textual Hoogle database as generated by cabal haddock --hoogle. I attached a modified hoogle DB of the base library to the gist.

Running the program using a strict text succeeds:

$ ./attoparsec-hoogle ./base.txt strict
OK

Running the program using a lazy text errors:

$ ./attoparsec-hoogle ./base.txt lazy
Error:
"-- | aaaaaaaaaa"

Now when I remove one of the a's in the input file the lazy parser succeeds. So I guess this has something to do with the chunking of a lazy text.

Backtracking failure

I encountered a backtracking error in one of my programs and narrowed the error down to the following example:

$ ghci
GHCi, version 7.8.3: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
> :set -XOverloadedStrings
> import Data.Attoparsec.Text.Lazy
> import Control.Applicative
> parseTest ((skipSpace *> char ' ') <* endOfInput) " "
Loading package array-0.5.0.0 ... linking ... done.
Loading package deepseq-1.3.0.2 ... linking ... done.
Loading package bytestring-0.10.4.0 ... linking ... done.
Loading package text-1.1.0.0 ... linking ... done.
Loading package containers-0.5.5.1 ... linking ... done.
Loading package attoparsec-0.10.4.0 ... linking ... done.
Fail "" ["' '","demandInput"] "not enough input"

`isEndOfLine` only works on `Word8`s

It seems odd that Data.Attoparsec.Char8's isEndOfLine function only works on Word8s. I thought it would be consistent with the other functions in the module and have two forms, one for Chars and one for Word8s, like:

isEndOfLine_w8 :: Word8 -> Bool
isEndOfLine    :: Char -> Bool

It doesn't have to have both, but it seems like the Char8 module should have at least the latter one so that it is compatible with the functions in the same module.

My main use case for it is just skipping over an entire line using:

skipLine = skipWhile (not . isEndOfLine) >> endOfLine

... but in order to do that I have to hide the skipWhile from Data.Attoparsec.Char8 and import the one from Data.Attoparsec instead, like so:

import Data.Attoparsec (skipWhile)
import Data.Attoparsec.Char8 hiding (skipWhile)

Need INLINEABLE for hexadecimal, decimal, etc

In order to specialize for custom Integral & Bits types, the attoparsec parse functions need an inlinable pragma. I assume specialization is worthwhile seeing as every built-in type has a SPECIALISE pragma.

stringCI broken in 0.12.1.5 ?

Small test:

Prelude Data.Attoparsec.ByteString.Char8> parse (stringCI "FooBar") "foobar"
Done "" "foobar"
Prelude Data.Attoparsec.ByteString.Char8> parse (stringCI "FooBar") "FooBar"
Fail "FooBar" [] "string"
Prelude Data.Attoparsec.ByteString.Char8> parse (stringCI "foobar") "FooBar"
Fail "FooBar" [] "string"
Prelude Data.Attoparsec.ByteString.Char8> parse (stringCI "foobar") "foobar"
Done "" "foobar"

Feels like stringCI "internally lowers" the case of pattern string (ie its argument) but "does not" lower case of input before matching.

Reproduced this on ghc-7.10.1 with attoparsec-0.12.1.5 and attoparsec-0.12.1.4. Previously this worked as expected (all 4 of above examples result in Done _ _) with attoparsec-0.11.x (tested with attoparsec-0.11.3.4) and attoparsec-0.12.1.3. Could it be that the problem was introduced by commit dcc5e1f ?

Incorrect input state with partial parsing and backtracking

Using partial parsing with backtracking may lead to incorrect input state. I've put a test case here.

We have a parser myParser (which internally uses some nested backtracking) and we feed it input: byte by byte and whole input at once. And we've got these results:

sample = "hello "

partial = foldl feed (parse myParser "") $ map B.singleton sample

full = parse myParser $ B.pack sample

*Atto> partial
Done "ello " ()
*Atto> full
Done "hello " ()

So, the same parser on the same input yields differrent results depending on the way you feed it.

Tested with attoparsec-0.9.1.2 and attoparsec-10.0.1.1.

parseOnly docs neglect to mention that it succeeds even with leftover input

I got bitten by this in cassava. Might be worth adding a sentence to the docs.

takeTill acting wierd

parser :: Parser Text
parser = takeTill ((==) 'a'))

main :: IO ()
main = parseTest parser "𝟘a" >>= print

The code should result in Done "a" "\120792", a clean cut.
But I get Done "\57304a" "\120792"

With the predicate negated, takeWhile also presents the same issue.

The issue can be reproduced with this gist
I'm using attoparsec-0.12.1.2 with text-1.2.0.0

Thanks!

Wrong "takeTill" exported with Data.Attoparsec.Text.Lazy

If you import Data.Attoparsec.Text.Lazy only, and use the takeTill function, it creates a parser for strict Text.... This has caused confusion and delay :) (see http://stackoverflow.com/questions/20460771/compile-error-with-attoparsec-text-lazy and the comments to the question)

Are we missing something?

Parse doubles like ".1"

I write a SVG parser for diagrams: diagrams-input Some images have attributes like: opacity=".1" instead of "0.1". It would be nice if the double parser would accept this.

the result type in Text.Lazy refers to ByteString inputs rather than Lazy Text inputs

https://github.com/bos/attoparsec/blob/master/Data/Attoparsec/Text/Lazy.hs#L47-L56

it says The 'ByteString' is the input and The 'ByteString' is the for the Fail and Done cases

Large space usage increase between 0.10.3.0 and 0.10.4.0

There's a large increase in space usage between 0.10.3.0 and 0.10.4.0. The increase can most easily be seen using tibbe/cassava@76451b4 or later and this test program. Earlier version of cassava show the same problem, but it's easier to see using the current master. You'll need the n32_results.txt input file.

To compile and run the test do:

ghc -O2 Test.hs -rtsopts
./Test n32_results.txt +RTS -h -i0.01

(Make sure cassava is compiled against either 0.10.3.0 or 0.10.4.0.)

Using 0.10.3.0 I get this heap profile:

And with 0.10.4.0 I get this:

I suspect c707514 might be to blame.

New derived combinator

I found the optional combinator in Parsec tremendously useful, so I thought why not add it to attoparsec? Its name should make it obvious what it's used for. Here's the implementation I use currently:

optional :: Parser a -> Parser (Maybe a)
optional = option Nothing . fmap Just . try

Would this be useful to have in the library or should it be defined by the user?

Bug in stringCI from Data.Attoparsec.Text?

Text.toCaseFold "daß" gives "dass", so should then parseOnly (stringCI "daß") "dass" give Right "dass"?
Currently it does not, because stringTransform only considers strings of the same length. Do we consider this a bug?

Performance regression in 7.8

{-# LANGUAGE OverloadedStrings #-}

import Control.Applicative
import qualified Data.Attoparsec.Text as A
import Data.Text (Text)
import qualified Data.Text as T

testParser :: Text -> Either String Int
testParser f = fmap length
    . A.parseOnly (many (A.char 'b' <|> A.anyChar))
    $ f

main :: IO ()
main = print . testParser $ T.replicate 50000 "a"

Compiled using -O2 -threaded:

On GHC 7.6.3:

real    0m0.062s
user    0m0.022s
sys     0m0.007s

On GHC 7.8 tip:

real    0m12.700s
user    0m12.504s
sys     0m0.165s

Data.Attoparsec.Combinators.many1 and Text

I'm not used to posting issues on bugtrackers, so bear with me, and I'm not entirely sure this belongs here.

Data.Attoparsec.Combinators.many1 used with a Parser Char results in the type Parser Char -> Parser [Char], naturally. However, when using Text, there is obviously no implicit conversion. Am I supposed to use fromString directly that OverloadedStrings uses under the hood, or am I missing a combinator somewhere?

I've looked through the documentation, and I can't find anything that would work. Would a specialized combinator many1 :: Parser Char -> Parser Text just use fromString?

Incorrect isHexDigit function

In this code: http://hackage.haskell.org/package/attoparsec-0.9.0.0/docs/src/Data-Attoparsec-Char8.html#hexadecimal

The isHexDigit function is given as:

isHexDigit w = (w >= 48 && w <= 57) ||
               (w >= 97 && w <= 102) ||
               (w >= 65 && w <= 90)

I believe the (w >= 65 && w <= 90) should be (w >= 65 && w <= 70) to only cover A-F.

no space or skipSpace in the Text merge

I am trying to upgraded out of attoparsec-text. I needed to use many from Control.Applicative but I am also missing space and skipSpace. Were these removed intentionally or accidentally?

Report position information upon failure

To simplify the development/debugging of attoparsec parsers it would help if parse-failures that bubble-up to the top-level were annotated with their position in the input stream. This blog post makes a few concrete examples where this would help and claims that

adding position information to parse failures is not possible without a severe performance loss.

However, I'm don't think that reporting the position of a top-level failure is that expensive to compute generically for any attoparsec parser. My reasoning is the following.

Every parser delivers the remainder of the input upon failure. Hence, if you keep track of the number of bytes/chars fed to an attoparsec parser, then you can compute the actual position of an error using that information. The effort of managing this information is proportional to the number of chunk boundaries, which should be rather few. Hence, it should be feasible to compute precise error positions of top-level errors for any attoparsec parser.

I created this issue to ensure that this idea is not forgotten. I currently do not have time to implement it by myself, but perhaps somebody else does.

Note that if one is interested in line-column information for text files, then one can compute that information from the character-position in the input stream. Depending on the memory requirements one either pays for this computation (one additional pass through each chunk of input) at every chunk boundary (if one updates the actual position for every chunk) or in case of a top-level failure only (if one stores all input chunks).

How to get (slightly) better errors?

I know that if I want "really good" errors I should switch to parsec, but right now I have the issue where, say:

parser = (char '=' >> option Nothing fmap (Just piece)) <?> "Path"
piece = takeWhile1 (\x -> x /= '/' && x /= '*' && not (isSpace x))

parseOnly (parser <* endOfLine <* endOfInput)

when fed with, say, just an "x" gives an error message of "satisfy". Would be nice if I could at least say something like "Syntax error in Path".

Document the space usage

attoparsec holds on to all input until we're done parsing, using O(input) memory. It would probably be worth mentioning under "Performance considerations" in Data.Attoparsec.ByteString.

Improve performance of numeric parsers

Hi Bryan,

I improved the performance of the numeric parsers (especially for numbers with powers i.e. 123.456e3).

The following is the benchmark report of the extended benchmarks. (The prime versions are using the new scientific parser)

The code can be found in the scientific branch of my fork of attoparsec.

Explanation of the changes:

I introduced a numeric type (in Data.Attoparsec.Scientific) which represents numbers using their scientific notation:

data Scientific = Scientific
                  { coefficient ::                !Integer
                  , exponent    :: {-# UNPACK #-} !Int
                  }

I introduced a scientifically parser which parses scientific numbers.

{-# INLINE scientifically #-}
scientifically :: (Scientific -> a) -> Parser a
scientifically h = do
  !positive <- ((== '+') <$> I.satisfy (\c -> c == '-' || c == '+')) <|>
               pure True

  let step a c = a * 10 + fromIntegral (ord c - 48)

  n <- T.foldl' step 0 <$> I.takeWhile isDigit

  let f fracDigits = Scientific (T.foldl' step n fracDigits)
                                (negate $ T.length fracDigits)

  Scientific coeff expnt <-
      (I.satisfy (=='.') *> (f <$> I.takeWhile isDigit)) <|>
      pure (Scientific n 0)

  let !signedCoeff | positive  =        coeff
                   | otherwise = negate coeff

  (I.satisfy (\c -> c == 'e' || c == 'E') *>
      fmap (h . Scientific signedCoeff . (expnt +)) (signed decimal)) <|>
    return (h $  Scientific signedCoeff   expnt)

I rewrote the numeric parsers in terms of scientifically:

scientific :: Parser Scientific
scientific = scientifically id

rational' :: Fractional a => Parser a
rational' = scientifically realToFrac

double' :: Parser Double
double' = scientifically realToFrac

number' :: Parser Number
number' = scientifically $ \s@(Scientific c e) ->
    if e >= 0
    then I (c * 10 ^ e)
    else D (realToFrac s)

Note that these parsers could be expressed using a normal fmap like:

double' = realToFrac <$> scientific

However, GHC optimizes the former better.

Note that I haven't created a pull request yet because I obviously want to replace the normal numeric parsers with their faster scientific-based alternatives.

Let me know what you think and if I should turn this into a clean pull request.

Bas

Excessive inlining leads to GHC consuming GBs of RAM while compiling

@kolmodin reports:

Try to add a few more "AB.anyWord8" to word32LE in benchmarks/Benchmarks.hs and GHC will consume gigabytes when it's slowly slowly compiling. Seems it's too aggressive inlining.

I suspect that we might need to roll back the call to the magic inline function I added.

Text's peekChar doesn't always play nice with parseOnly

While I love that you added peekChar in Attoparsec (thanks BTW) I am just letting you know that they are some conditions in which it doesn't play well with parseOnly.

Prelude> :m +Data.Attoparsec.Text
Prelude Data.Attoparsec.Text> :m +Data.Text
Prelude Data.Attoparsec.Text Data.Text> parseOnly peekChar (pack "")
*** Exception: parseOnly: impossible error! <---- HERE
Prelude Data.Attoparsec.Text Data.Text> parse peekChar (pack "")
Partial _
Data.Attoparsec.Text Data.Text> feed (parse peekChar (pack "")) Data.Text.empty
Done "" Nothing

I traced the problem to the definition of peekChar, line 417 of Data.Attoparsec.Text.Internal.hs

peekChar :: Parser (Maybe Char)
peekChar = T.Parser $ \i0 a0 m0 _kf ks ->
let ks' i a m = let w = unsafeHead (unI i)
in w seq ks i a m (Just w)
kf' i a m = ks i a m Nothing
in if T.null (unI i0) <----- HERE
then prompt i0 a0 m0 kf' ks'
else ks' i0 a0 m0

This line succeed so prompt is called but the definition of prompt is a Partial continuation which parseOnly describes as an impossible error.

I don't consider this a bug, more a technical decition (how to play with peekChar, partial continuations and runparser) but I guess it should be added to the documentation of the function to avoid surprises.

Pardon my English. Thanks in advance, I am new to github (and git) but will try to document the behavior in the next days.

Pardon for my english.

GHC HEAD compile error

[12 of 21] Compiling Data.Attoparsec.ByteString.Internal ( Data/Attoparsec/ByteString/Internal.hs, dist/build/Data/Attoparsec/ByteString/Internal.o )

Data/Attoparsec/ByteString/Internal.hs:519:7:
    Illegal equational constraint a_ataf ~ (ByteString, t)
    (Use GADTs or TypeFamilies to permit this)
    In the context: (a_ataf ~ (ByteString, t))
    While checking the inferred type for ?succ'?
    In the expression:
      let
        succ' t' pos' more' a
          = succ t' pos' more' (substring pos (pos' - pos) t', a)
      in runParser p t pos more lose succ'
    In the second argument of ?($)?, namely
      ?\ t pos more lose succ
         -> let succ' t' pos' more' a = ...
            in runParser p t pos more lose succ'?
Failed to install attoparsec-0.12.0.0

This is with today's GHC-7.9 from HEAD.

Speed up range checking

isDigit, digit, etc., are less smart than isDigit_w8. Specifically, they check ranges using code like

'0' <= c && c <= '9'

This involves two comparisons and two branches. It's generally faster to use these:

-- Requires, but does not check, that the third argument is at least
-- as large as the first.
betweenI :: Integral n => n -> n -> n -> Bool
betweenI a b c = (fromIntegral (b - a) :: Word) <= fromIntegral (c - a)
{-# INLINE betweenI #-}

betweenE :: Enum n => n -> n -> n -> Bool
betweenE a b c = betweenI (fromEnum a) (fromEnum b) (fromEnum c)
{-# INLINE betweenE #-}

The c-a will be calculated at compile time in typical uses, so this requires just one subtraction, one comparison, and one branch. The improvement is greatest when values are mostly not in the specified range (as branch prediction may be substantially improved), but even when branches are well-predicted, this way is a bit better.

wowo w wo = w *> wo <|> wo

Careless use could be expensive, but nevertheless I found this combinator useful. Should I submit a patch to Data.Attoparsec.Combinator?

wowo :: Alternative
wowo w wo = w *> wo <|> wo

Otherwise I end up having to name the wo parser, which can be a bit ugly indentation-wise inside a do-block. Also you can use NondecreasingIndentation after wowo (…) $ do…

Not happy about the name though. I thought of calling it bono, but I can't live with (or without) having named it bono. :-/

Getting the current position explicitly?

So I found myself pulling in Data.Attoparsec.Internal.Types and writing:

getPos :: Parser Int
getPos = AT.Parser $ \t pos more _ succ' -> succ' t pos more (AT.fromPos pos)

and it didn't seem like terribly unreasonable code. (My use case is that I'm scanning git packfiles and I want to capture the file offset for each object I encounter). Is this actually a terrible idea that can never work, or is it something you'd consider adding to the public attoparsec interface so I can feel slightly better about myself for pulling an Internal module?

I'm not trying to resurrect #16 or #19 (or anything else that's about automatically doing anything in particular on failure). Apologies if I missed the point somewhere and this actually is a duplicate. Thanks!

Monadic parseOnly

Often it's necessary to run a parser inside another parser. It's usually done with something like this:

captureAndParse = do
    raw <- capture
    case parseOnly parse raw where
        Left err -> fail err
        Right res -> return res

It would be nice to have a function parseOnlyM which runs a parser and calls fail on Left result. It can be implemented in this way:

parseOnlyM :: Monad m => Parser a -> Text -> m a
parseOnlyM parser input = either fail return $ parseOnly parser input

Which would allow to simplify the first example to following:

captureAndParse = capture >>= parseOnlyM parse

What do you think about such addition? I would be happy to make a PR.

documentation for Data.Attoparsec.Text.rational is trimmed on hackage

Part of documentation on hackage is trimmed
http://hackage.haskell.org/packages/archive/attoparsec/latest/doc/html/Data-Attoparsec-Text.html#v:rational
http://hackage.haskell.org/packages/archive/attoparsec/latest/doc/html/Data-Attoparsec-ByteString-Char8.html#v:rational

Perhaps because of the empty line.

The important part being the statement that it doesn't parse NaN and Infinity.

-- | Parse a rational number.
--
-- This parser accepts an optional leading sign character, followed by
-- at least one decimal digit.  The syntax similar to that accepted by
-- the 'read' function, with the exception that a trailing @\'.\'@ or
-- @\'e\'@ /not/ followed by a number is not consumed.
--
-- Examples with behaviour identical to 'read', if you feed an empty
-- continuation to the first result:
--
-- >rational "3"     == Done 3.0 ""
-- >rational "3.1"   == Done 3.1 ""
-- >rational "3e4"   == Done 30000.0 ""
-- >rational "3.1e4" == Done 31000.0, ""

-- Examples with behaviour identical to 'read':                                <<-- this is trimmed
--
-- >rational ".3"    == Fail "input does not start with a digit"
-- >rational "e3"    == Fail "input does not start with a digit"
--
-- Examples of differences from 'read':
--
-- >rational "3.foo" == Done 3.0 ".foo"
-- >rational "3e"    == Done 3.0 "e"
--
-- This function does not accept string representations of \"NaN\" or
-- \"Infinity\".

Position information

Would it be possible for the library to report, for each parsed entity, at which position it was found in the input stream ?

It would be helpful to get positions in terms of lines and columns but perhaps also in terms of offset from the beginning of the input stream.

The report is a followup to a discussion with @snoyberg

Types in Data.Attoparsec.Char8 shouldn't be Word8, should they?

Data.Attoparsec.Char8 generally has predicates of the type (Char -> Bool), for example isSpace, and functions like takeTill and takeWhile take such predicates, having type (Char -> Bool) -> Parser ByteString.

Shouldn't the isEndOfLine and isHorizontalSpace predicate functions similarly be specialized to (Char -> Bool)? Right now they are (Word8 -> Bool) which make them difficult to use with takeTill.

AfC

Odd behavioiur of many'

TL;DR
I had a parser that was failing in odd ways and fixed it by replacing

 P.many' pSomething

with

P.option [] (P.many1 pSomething)

where pSomething is a non-trivial parser.

Q1: Why does the second verion work correctly when the first doesn't?
Q2: Is there any performance penalty for implementing it the second way?

Long version

I had a parser (used in a Conduit via Data.Conduit.Attoparsec) taht somewhere around the time 0.12 was released started intermittently throwing a ParseError (defined in D.C.Attoparsec). However, since it was intermittent and didn't always happen in the same place on the same input (which was ok for my application) I wasn't in a rush to debug it.

When I finally decided to invest some time in debugging this it was really hard going. First I managed to make it trigger more frequently by adding a conduit chunker before the conduitParser. Normally ByteString data flows though conduits in chunks of about 30. My chunker allowed me to make these chunks much smaller which increased the frequency of the parse error. Once that was happening I spent considerable time digging around on both D.C.Attoparsec as well as attoparsec itself. I got nowhere.

After reading the attoparsec documentation for about the 10th time I decied to remove all the attoparsec combinators that had warnings about them. I remove peekWord8 and was more careful about takeWhile etc. In the end, it was the removing many' that fixed it.

Add skipTill?

I found the following gadget useful in many of my scripts. It's the "opposite" of manyTill.

skipTill :: Parser a -> Parser b -> Parser b
skipTill junk end = end <|> (junk *> skipTill junk end)

I am wondering if people find this useful too. If so, I'd like to propose adding this one (or an optimized one) to the library.

parseWith does one redundant request

TLDR version (small test-case for attoparsec):

{-# LANGUAGE OverloadedStrings #-}

import Prelude hiding (take)
import           Data.Attoparsec.ByteString
import           Data.Attoparsec.ByteString.Char8 (signed, decimal)
import qualified Data.ByteString as B
import Data.IORef
import Control.Applicative

data D = S B.ByteString
  deriving Show

parser :: Parser D
parser = do
  len <- signed decimal
  S <$> (take len)

-- | Takes a string and produces it little by little
ioProducerSplitted :: B.ByteString -> Int -> IO (IO B.ByteString)
ioProducerSplitted s n = do
  state <- newIORef s
  return $ do
    putStrLn "more data was requested"
    val <- readIORef state
    let (rv, leftOver) = B.splitAt n val
    writeIORef state leftOver
    putStrLn $ "returning " ++ show rv
    return rv

main :: IO ()
main = do
  prod <- ioProducerSplitted "9foobarbaz" 3
  res <- parseWith prod parser ""
  putStrLn $ "res: " ++ show res

Output for 0.11.3.4:

more data was requested
returning "9fo"
more data was requested
returning "oba"
more data was requested
returning "rba"
more data was requested
returning "z"
res: Done "" S "foobarbaz"

Output for 0.12.1.0:

more data was requested
returning "9fo"
more data was requested
returning "oba"
more data was requested
returning "rba"
more data was requested
returning "z"
more data was requested
returning ""
res: Fail "" [] "not enough input"

More info on bug

There's this crazy bug I'm getting: informatikr/hedis#15

It turned out that (if I'm not mistaken) because hedis reads responses from redis like this https://github.com/informatikr/hedis/blob/master/src/Database/Redis/ProtocolPipelining.hs#L116 -- it relies on attoparsec's behaviour to call socket-reading function exactly needed number of times. Here's a piece of code (should be compiled inside hedis package to see it's hidden modules):

{-# LANGUAGE OverloadedStrings #-}

module Main where

import           Data.Attoparsec.ByteString
import Database.Redis.Protocol (reply)
import Database.Redis
import System.Random
import qualified Data.ByteString.Char8 as BC8
import           Network
import qualified Data.ByteString as B
import GHC.IO.Handle (hSetBinaryMode, hFlush)
-- import Network.Socket.Types (PortNumber(..))
import Data.IORef

debug s = putStrLn $ "MAIN>> " ++ s

-- | Takes a string and produces it little by little
ioProducerSplitted :: B.ByteString -> Int -> IO (IO B.ByteString)
ioProducerSplitted s n = do
  state <- newIORef s
  return $ do
    debug "more data was requested"
    val <- readIORef state
    let (rv, leftOver) = B.splitAt n val
    writeIORef state leftOver
    debug $ "returning " ++ show rv
    return rv

main :: IO ()
main = do
  prod <- ioProducerSplitted
             "*2\r\n$47\r\nvalue-value-value-value-value-56.27533380617696\r\n$47\r\nvalue-value-value-value-value-64.54343491843917\r\n"
            64
  res <- parseWith prod reply ""
  debug $ "res: " ++ show res

For attoparsec version 0.11.3.4 it produces this output:

MAIN>> more data was requested
MAIN>> returning "*2\r\n$47\r\nvalue-value-value-value-value-56.27533380617696\r\n$47\r\nv"
MAIN>> more data was requested
MAIN>> returning "alue-value-value-value-value-64.54343491843917\r\n"
MAIN>> res: Done "" MultiBulk (Just [Bulk (Just "value-value-value-value-value-56.27533380617696"),Bulk (Just "value-value-value-value-value-64.54343491843917")])

While for 0.12.1.0 it produces:

MAIN>> more data was requested
MAIN>> returning "*2\r\n$47\r\nvalue-value-value-value-value-56.27533380617696\r\n$47\r\nv"
MAIN>> more data was requested
MAIN>> returning "alue-value-value-value-value-64.54343491843917\r\n"
MAIN>> more data was requested
MAIN>> returning ""
MAIN>> res: Fail "alue-value-value-value-value-64.54343491843917\r\n" [] "Failed reading: empty"

Why does lenghtAtLeast check for half of the lengths of a Text?

I'm trying to understand the code of Attoparsec.Text (unfortunately non-exported functions are not nearly as well-commented as the exported once) and came across this:

lengthAtLeast :: T.Text -> Int -> Bool
lengthAtLeast t@(T.Text _ _ len) n = (len `div` 2) >= n || T.length t >= n

Why the div by 2? Is this some internal knowledge about how Text works? (If yes, it might be that this doesn't work so well any more in case that changes.)

Also, if that len is always positive (my guess), might not a quot be faster? (There are two more occurences of this, in the FastSet modules, could apply there as well.)

Thanks!

Some support for Safe Haskell.

It's rather not an issue, I'm just wondering. As far as I can see, attoparsec provides completely safe interface and besides it uses ByteString or Text unsafe internal modules I can't imagine how this package can be used in unsafe way. So the question arises: is it reasonable to make every exposed module safe? It will only require to enable Trustworthy extension in every exposed module, if I'm right.

Tests fail to compile with GHC 7.10.1-rc1

[ 1 of 25] Compiling QC.Rechunked     ( tests/QC/Rechunked.hs, dist/build/tests/tests-tmp/QC/Rechunked.dyn_o )

tests/QC/Rechunked.hs:52:9:
    Non type-variable argument in the constraint: G.Vector v Int
    (Use FlexibleContexts to permit this)
    When checking that ‘swapAll’ has the inferred type
      swapAll :: forall (m :: * -> *) (t :: * -> *) (v :: * -> *).
                 (Foldable t, primitive-0.5.4.0:Control.Monad.Primitive.PrimMonad m,
                  G.Vector v Int, G.Mutable v ~ V.MVector) =>
                 t (Int, Int) -> m (v Int)
    In an equation for ‘swapIndices’:
        swapIndices n0
          = do { swaps <- forM [0 .. n] $ \ i -> ((,) i) `fmap` choose ...;
                 return (runST (swapAll swaps)) }
          where
              n = n0 - 1
              swapAll ijs
                = do { mv <- G.unsafeThaw (G.enumFromTo 0 n :: V.Vector Int);
                       .... }
    In an equation for ‘fisherYates’:
        fisherYates xs
          = (V.toList . V.backpermute v) `fmap` swapIndices (G.length v)
          where
              v = V.fromList xs
              swapIndices n0
                = do { swaps <- forM ... $ ...;
                       .... }
                where
                    n = n0 - 1
                    swapAll ijs = ...

tests/QC/Rechunked.hs:53:46:
    Couldn't match expected type ‘Int’ with actual type ‘b1’
      ‘b1’ is untouchable
        inside the constraints (Foldable t,
                                primitive-0.5.4.0:Control.Monad.Primitive.PrimMonad m,
                                G.Vector v Int,
                                G.Mutable v ~ V.MVector)
        bound by the inferred type of
                 swapAll :: (Foldable t,
                             primitive-0.5.4.0:Control.Monad.Primitive.PrimMonad m,
                             G.Vector v Int, G.Mutable v ~ V.MVector) =>
                            t (Int, Int) -> m (v Int)
        at tests/QC/Rechunked.hs:(52,9)-(55,27)
      ‘b1’ is a rigid type variable bound by
           the inferred type of
           swapIndices :: (Enum b1, Num b1,
                           random-1.1:System.Random.Random b1) =>
                          b1 -> Gen b
           at tests/QC/Rechunked.hs:47:5
    Possible fix: add a type signature for ‘swapIndices’
    Relevant bindings include
      n :: b1 (bound at tests/QC/Rechunked.hs:51:9)
      n0 :: b1 (bound at tests/QC/Rechunked.hs:47:17)
      swapIndices :: b1 -> Gen b (bound at tests/QC/Rechunked.hs:47:5)
    In the second argument of ‘G.enumFromTo’, namely ‘n’
    In the first argument of ‘G.unsafeThaw’, namely
      ‘(G.enumFromTo 0 n :: V.Vector Int)’

Parse failure when crossing chunk boundaries

I believe I've found a bug in attoparsec and I need some help writing a failing test. Some background:

I have a tool that processes CSV files using Cassava. After upgrading attoparsec I encountered a regression in my tool while processing production files. The problem only happens in some of the CSV files, and only when the files are big enough to be fed into attoparsec in chunks. Feeding only whole lines into attoparsec works fine.

After some experimenting and some help from git bisect, I've tracked the issue down to commit 791e046c526710dce7b87e308ee48f2fb6811d7b. Before that commit all my regression tests pass. After that commit they fail every time.

I haven't yet been able to shrink this down to a small, reproducible test case. The tool reads chunks of ByteString.Char8 values using io-streams and feeds them into Cassava, which feeds them into attoparsec. The smallest CSV file that produces the problem is about 2MB.

Due to time constraints I don't have the spare cycles to track this down further right now. Since I can lock cabal down to the last working version of attoparsec it's not a show stopper for me. That said, if someone else wants to chase this down I'd be happy to try out patches or provide more detail.

New combinator "fixed :: Int -> Parser a -> Parser a"

I frequently find myself wanting/needing to parse complex fields of a fixed size, which is currently rather tricky using attoparsec. For example a 20 byte field consisting of a zero padded ASCII string. This field can be easily described as manyTill anyWord8 (word8 0) <* many (word8 0), but taking care of the size limiting is non-trivial.

I would like to have a combinator as the one proposed in the issue name, with the semantics that it tries to match it's argument parser completely against the next N bytes. An example implementation (although probably rather inefficient) would be:

fixed :: Int -> Parser a -> Parser a
fixed i p = do
    intermediate <- take i
    case parseOnly (p <* endOfInput) intermediate of
        Left _ -> empty
        Right x -> return x

I want `get` and `put` to be exposed, so i can implement my own primitive parsers.

I want get and put to be exposed, so i can implement my own primitive parsers.
For example, i want a variation of takeWhile which also returns the char that break predication. I can't implement that without get and put.

safe `string` parsing alternative

As you might know, there is some confusion around the behavior of the string parser, specifically in the following situation:

parse (string "wombat" <|> string "foo") "foo"

returns a Partial. This is due to the implementation of string:

string :: ByteString -> Parser ByteString
string s = takeWith (B.length s) (==s)

The solution proposed by various people on StackOverflow is to simply signal to Attoparsec when the end of the input has been reached, so it knows that the first branch of string "wombat" will never complete. This is undesirable in certain situations where you simply do not know if you reach the end of the input (specifically in network programming), and thus a safe alternative is desired.

I propose the following safe alternative of string, string':

string' :: ByteString -> Parser ByteString
string' = mapM (satisfy . (==)) . unpack

This will fail with the first character that makes the first branch impossible, and gives the desired result in many situations. The only situation that this will still return a partial is when you do (string "foo" <|> string "foobar"), but at least that one is more obvious.

Any comments?

more error information

For most of our Yesod related usage of attoparsec & parsec a failure to parse indicates something very bad. For example, in some cases we would be parsing at compile time and a failure to parse would be a compilation error. At this point we don't care about performance at all, just reporting the error as easily as possible. So I am wondering if we can make an alternative parsing mode that is slow but as detailed as possible in parse failure information. For our cases we could do a fast parse first, and then if there is a parse failure, re-parse with the slower version to give the best possible error information.

If this makes sense for attoparsec, could you point out an implementation strategy?

Add type-specialized versions of Applicative operators

The attoparsec-text package provides the type-specialized versions of (*>) and (<*). These make it possible to use the IsString instances of Parser to write parsers using nice syntax like:

 "Shoe size: " .*> decimal

Currently code using attoparsec-text with that technique cannot be ported to attoparsec without seriouse breakage.

In Data.Attoparsec.Text:

-- | Type-specialized version of '*>' for 'Text'.
(.*>) :: Applicative f => Text -> f a -> f a
(.*>) = (*>)

-- | Type-specialized version of '<*' for 'Text'.
(<*.) :: Applicative f => f a -> Text -> f a
(<*.) = (<*)

In Data.Attoparsec.ByteString:

-- | Type-specialized version of '*>' for 'ByteString'.
(.*>) :: Applicative f => ByteString -> f a -> f a
(.*>) = (*>)

-- | Type-specialized version of '<*' for 'ByteString'.
(<*.) :: Applicative f => f a -> ByteString -> f a
(<*.) = (<*)