haskell / attoparsec Goto Github PK
View Code? Open in Web Editor NEWA fast Haskell library for parsing ByteStrings
Home Page: http://hackage.haskell.org/package/attoparsec
License: Other
A fast Haskell library for parsing ByteStrings
Home Page: http://hackage.haskell.org/package/attoparsec
License: Other
Is it possible to add a function that is finding all matches inside a string, like:
import Data.Either (rights)
import Data.Attoparsec.Text (Parser, parseOnly)
import Data.Text (pack)
findAll :: Parser a -> String -> [a]
findAll parser = rights . map (parseOnly parser . pack) . oneLess where
oneLess [] = []
oneLess (whole@(_:xs)) = whole : oneLess xs
The scan
parser has type
scan :: s -> (s -> Word8 -> Maybe s) -> Parser ByteString
Sometimes it can be useful to know the state machine's final state, but scan
doesn't expose this. I would like to see a generalisation of scan
to a function
scanEither :: s -> (s -> Word8 -> Either s r) -> Parser (ByteString, Either s r)
which returns the state machine's final state. Then scan
could be implemented in terms of scanEither
as
scan s0 p = fst <$> scanEither s0 p'
where
p' s w = case p s w of
Just s' -> Left s'
Nothing -> Right ()
While writing a parser Fold
for @Gabriel439's foldl
package I stumbled upon this interesting inconsistency (full example here),
A.eitherResult
$ flip A.feed ""
$ flip A.feed "123"
$ A.parse A.double "1.3"
== 1.3123
A.eitherResult
$ flip A.feed ""
$ flip A.feed "123"
$ flip A.feed "1.3"
$ A.parse A.double ""
== 1.3
Given that empty strings are a bit special as they indicate the end of the stream it's not entirely clear that this is a bug but it is quite surprising. Why does the initial empty string affect the resulting parse? Is this considered buggy behavior?
Hi,
I believe I'm seeing a regression with GHC 7.8 as compared to 7.6; it's documented here with an example file.
Could you check if you see the same? Thanks!
Currently there are only a few functions in Data.Attoparsec.ByteString making it a far less inferior choice than Data.Attoparsec.Text functions. Unfortunately, in projects where a dependency on Text is not acceptable, this means that we're stuck with a very limited number of possibilities.
This issue asks for an extension of ByteString part of the parser so that it matches its Text cousins more closely.
I've accidentally tried to "take" negative amount of bytes. Surprisingly, it worked:
let Right bs = A.parseOnly (A.take (-1)) B.empty in B.length bs
returns -1, which kinda makes sense (in a weird way). Trying to print the result itself, however, doesn't work that well:
A.parseOnly (A.take (-1)) B.empty
and ghci crashes. I'd suggest checking for negative length in "take".
Hi Bryan,
I've written a small attoparsec parser that when run on a strict text succeeds but when run on a lazy text errors.
The program parsers a textual Hoogle database as generated by cabal haddock --hoogle
. I attached a modified hoogle DB of the base library to the gist.
Running the program using a strict text succeeds:
$ ./attoparsec-hoogle ./base.txt strict
OK
Running the program using a lazy text errors:
$ ./attoparsec-hoogle ./base.txt lazy
Error:
"-- | aaaaaaaaaa"
Now when I remove one of the a's in the input file the lazy parser succeeds. So I guess this has something to do with the chunking of a lazy text.
I encountered a backtracking error in one of my programs and narrowed the error down to the following example:
$ ghci
GHCi, version 7.8.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
> :set -XOverloadedStrings
> import Data.Attoparsec.Text.Lazy
> import Control.Applicative
> parseTest ((skipSpace *> char ' ') <* endOfInput) " "
Loading package array-0.5.0.0 ... linking ... done.
Loading package deepseq-1.3.0.2 ... linking ... done.
Loading package bytestring-0.10.4.0 ... linking ... done.
Loading package text-1.1.0.0 ... linking ... done.
Loading package containers-0.5.5.1 ... linking ... done.
Loading package attoparsec-0.10.4.0 ... linking ... done.
Fail "" ["' '","demandInput"] "not enough input"
It seems odd that Data.Attoparsec.Char8
's isEndOfLine
function only works on Word8
s. I thought it would be consistent with the other functions in the module and have two forms, one for Char
s and one for Word8
s, like:
isEndOfLine_w8 :: Word8 -> Bool
isEndOfLine :: Char -> Bool
It doesn't have to have both, but it seems like the Char8
module should have at least the latter one so that it is compatible with the functions in the same module.
My main use case for it is just skipping over an entire line using:
skipLine = skipWhile (not . isEndOfLine) >> endOfLine
... but in order to do that I have to hide the skipWhile
from Data.Attoparsec.Char8
and import the one from Data.Attoparsec
instead, like so:
import Data.Attoparsec (skipWhile)
import Data.Attoparsec.Char8 hiding (skipWhile)
In order to specialize for custom Integral & Bits types, the attoparsec parse functions need an inlinable pragma. I assume specialization is worthwhile seeing as every built-in type has a SPECIALISE pragma.
Small test:
Prelude Data.Attoparsec.ByteString.Char8> parse (stringCI "FooBar") "foobar"
Done "" "foobar"
Prelude Data.Attoparsec.ByteString.Char8> parse (stringCI "FooBar") "FooBar"
Fail "FooBar" [] "string"
Prelude Data.Attoparsec.ByteString.Char8> parse (stringCI "foobar") "FooBar"
Fail "FooBar" [] "string"
Prelude Data.Attoparsec.ByteString.Char8> parse (stringCI "foobar") "foobar"
Done "" "foobar"
Feels like stringCI
"internally lowers" the case of pattern string (ie its argument) but "does not" lower case of input before matching.
Reproduced this on ghc-7.10.1 with attoparsec-0.12.1.5 and attoparsec-0.12.1.4. Previously this worked as expected (all 4 of above examples result in Done _ _) with attoparsec-0.11.x (tested with attoparsec-0.11.3.4) and attoparsec-0.12.1.3. Could it be that the problem was introduced by commit dcc5e1f ?
Using partial parsing with backtracking may lead to incorrect input state. I've put a test case here.
We have a parser myParser
(which internally uses some nested backtracking) and we feed it input: byte by byte and whole input at once. And we've got these results:
sample = "hello "
partial = foldl feed (parse myParser "") $ map B.singleton sample
full = parse myParser $ B.pack sample
*Atto> partial
Done "ello " ()
*Atto> full
Done "hello " ()
So, the same parser on the same input yields differrent results depending on the way you feed it.
Tested with attoparsec-0.9.1.2 and attoparsec-10.0.1.1.
I got bitten by this in cassava. Might be worth adding a sentence to the docs.
parser :: Parser Text
parser = takeTill ((==) 'a'))
main :: IO ()
main = parseTest parser "𝟘a" >>= print
The code should result in Done "a" "\120792"
, a clean cut.
But I get Done "\57304a" "\120792"
With the predicate negated, takeWhile
also presents the same issue.
The issue can be reproduced with this gist
I'm using attoparsec-0.12.1.2
with text-1.2.0.0
Thanks!
If you import Data.Attoparsec.Text.Lazy only, and use the takeTill function, it creates a parser for strict Text.... This has caused confusion and delay :) (see http://stackoverflow.com/questions/20460771/compile-error-with-attoparsec-text-lazy and the comments to the question)
Are we missing something?
I write a SVG parser for diagrams: diagrams-input Some images have attributes like: opacity=".1" instead of "0.1". It would be nice if the double parser would accept this.
https://github.com/bos/attoparsec/blob/master/Data/Attoparsec/Text/Lazy.hs#L47-L56
it says The 'ByteString' is the input
and The 'ByteString' is the
for the Fail and Done cases
There's a large increase in space usage between 0.10.3.0 and 0.10.4.0. The increase can most easily be seen using tibbe/cassava@76451b4 or later and this test program. Earlier version of cassava show the same problem, but it's easier to see using the current master. You'll need the n32_results.txt input file.
To compile and run the test do:
ghc -O2 Test.hs -rtsopts
./Test n32_results.txt +RTS -h -i0.01
(Make sure cassava is compiled against either 0.10.3.0 or 0.10.4.0.)
Using 0.10.3.0 I get this heap profile:
And with 0.10.4.0 I get this:
I suspect c707514 might be to blame.
I found the optional
combinator in Parsec tremendously useful, so I thought why not add it to attoparsec? Its name should make it obvious what it's used for. Here's the implementation I use currently:
optional :: Parser a -> Parser (Maybe a)
optional = option Nothing . fmap Just . try
Would this be useful to have in the library or should it be defined by the user?
Text.toCaseFold "daß"
gives "dass"
, so should then parseOnly (stringCI "daß") "dass"
give Right "dass"
?
Currently it does not, because stringTransform
only considers strings of the same length. Do we consider this a bug?
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative
import qualified Data.Attoparsec.Text as A
import Data.Text (Text)
import qualified Data.Text as T
testParser :: Text -> Either String Int
testParser f = fmap length
. A.parseOnly (many (A.char 'b' <|> A.anyChar))
$ f
main :: IO ()
main = print . testParser $ T.replicate 50000 "a"
Compiled using -O2 -threaded
:
On GHC 7.6.3:
real 0m0.062s
user 0m0.022s
sys 0m0.007s
On GHC 7.8 tip:
real 0m12.700s
user 0m12.504s
sys 0m0.165s
I'm not used to posting issues on bugtrackers, so bear with me, and I'm not entirely sure this belongs here.
Data.Attoparsec.Combinators.many1 used with a Parser Char
results in the type Parser Char -> Parser [Char]
, naturally. However, when using Text, there is obviously no implicit conversion. Am I supposed to use fromString directly that OverloadedStrings uses under the hood, or am I missing a combinator somewhere?
I've looked through the documentation, and I can't find anything that would work. Would a specialized combinator many1 :: Parser Char -> Parser Text
just use fromString?
In this code: http://hackage.haskell.org/package/attoparsec-0.9.0.0/docs/src/Data-Attoparsec-Char8.html#hexadecimal
The isHexDigit
function is given as:
isHexDigit w = (w >= 48 && w <= 57) ||
(w >= 97 && w <= 102) ||
(w >= 65 && w <= 90)
I believe the (w >= 65 && w <= 90)
should be (w >= 65 && w <= 70)
to only cover A-F.
I am trying to upgraded out of attoparsec-text. I needed to use many from Control.Applicative but I am also missing space
and skipSpace
. Were these removed intentionally or accidentally?
To simplify the development/debugging of attoparsec parsers it would help if parse-failures that bubble-up to the top-level were annotated with their position in the input stream. This blog post makes a few concrete examples where this would help and claims that
adding position information to parse failures is not possible without a severe performance loss.
However, I'm don't think that reporting the position of a top-level failure is that expensive to compute generically for any attoparsec parser. My reasoning is the following.
Every parser delivers the remainder of the input upon failure. Hence, if you keep track of the number of bytes/chars fed to an attoparsec parser, then you can compute the actual position of an error using that information. The effort of managing this information is proportional to the number of chunk boundaries, which should be rather few. Hence, it should be feasible to compute precise error positions of top-level errors for any attoparsec parser.
I created this issue to ensure that this idea is not forgotten. I currently do not have time to implement it by myself, but perhaps somebody else does.
Note that if one is interested in line-column information for text files, then one can compute that information from the character-position in the input stream. Depending on the memory requirements one either pays for this computation (one additional pass through each chunk of input) at every chunk boundary (if one updates the actual position for every chunk) or in case of a top-level failure only (if one stores all input chunks).
I know that if I want "really good" errors I should switch to parsec, but right now I have the issue where, say:
parser = (char '=' >> option Nothing fmap (Just piece)) <?> "Path"
piece = takeWhile1 (\x -> x /= '/' && x /= '*' && not (isSpace x))
parseOnly (parser <* endOfLine <* endOfInput)
when fed with, say, just an "x" gives an error message of "satisfy". Would be nice if I could at least say something like "Syntax error in Path".
attoparsec holds on to all input until we're done parsing, using O(input) memory. It would probably be worth mentioning under "Performance considerations" in Data.Attoparsec.ByteString
.
Hi Bryan,
I improved the performance of the numeric parsers (especially for numbers with powers i.e. 123.456e3).
The following is the benchmark report of the extended benchmarks. (The prime versions are using the new scientific parser)
The code can be found in the scientific branch of my fork of attoparsec.
Explanation of the changes:
data Scientific = Scientific
{ coefficient :: !Integer
, exponent :: {-# UNPACK #-} !Int
}
scientifically
parser which parses scientific numbers.{-# INLINE scientifically #-}
scientifically :: (Scientific -> a) -> Parser a
scientifically h = do
!positive <- ((== '+') <$> I.satisfy (\c -> c == '-' || c == '+')) <|>
pure True
let step a c = a * 10 + fromIntegral (ord c - 48)
n <- T.foldl' step 0 <$> I.takeWhile isDigit
let f fracDigits = Scientific (T.foldl' step n fracDigits)
(negate $ T.length fracDigits)
Scientific coeff expnt <-
(I.satisfy (=='.') *> (f <$> I.takeWhile isDigit)) <|>
pure (Scientific n 0)
let !signedCoeff | positive = coeff
| otherwise = negate coeff
(I.satisfy (\c -> c == 'e' || c == 'E') *>
fmap (h . Scientific signedCoeff . (expnt +)) (signed decimal)) <|>
return (h $ Scientific signedCoeff expnt)
scientifically
:scientific :: Parser Scientific
scientific = scientifically id
rational' :: Fractional a => Parser a
rational' = scientifically realToFrac
double' :: Parser Double
double' = scientifically realToFrac
number' :: Parser Number
number' = scientifically $ \s@(Scientific c e) ->
if e >= 0
then I (c * 10 ^ e)
else D (realToFrac s)
Note that these parsers could be expressed using a normal fmap like:
double' = realToFrac <$> scientific
However, GHC optimizes the former better.
Note that I haven't created a pull request yet because I obviously want to replace the normal numeric parsers with their faster scientific-based alternatives.
Let me know what you think and if I should turn this into a clean pull request.
Bas
@kolmodin reports:
Try to add a few more "AB.anyWord8" to word32LE in benchmarks/Benchmarks.hs and GHC will consume gigabytes when it's slowly slowly compiling. Seems it's too aggressive inlining.
I suspect that we might need to roll back the call to the magic inline
function I added.
While I love that you added peekChar in Attoparsec (thanks BTW) I am just letting you know that they are some conditions in which it doesn't play well with parseOnly.
Prelude> :m +Data.Attoparsec.Text
Prelude Data.Attoparsec.Text> :m +Data.Text
Prelude Data.Attoparsec.Text Data.Text> parseOnly peekChar (pack "")
*** Exception: parseOnly: impossible error! <---- HERE
Prelude Data.Attoparsec.Text Data.Text> parse peekChar (pack "")
Partial _
Data.Attoparsec.Text Data.Text> feed (parse peekChar (pack "")) Data.Text.empty
Done "" Nothing
I traced the problem to the definition of peekChar, line 417 of Data.Attoparsec.Text.Internal.hs
peekChar :: Parser (Maybe Char)
peekChar = T.Parser $ \i0 a0 m0 _kf ks ->
let ks' i a m = let w = unsafeHead (unI i)
in w seq
ks i a m (Just w)
kf' i a m = ks i a m Nothing
in if T.null (unI i0) <----- HERE
then prompt i0 a0 m0 kf' ks'
else ks' i0 a0 m0
This line succeed so prompt is called but the definition of prompt is a Partial continuation which parseOnly describes as an impossible error.
I don't consider this a bug, more a technical decition (how to play with peekChar, partial continuations and runparser) but I guess it should be added to the documentation of the function to avoid surprises.
Pardon my English. Thanks in advance, I am new to github (and git) but will try to document the behavior in the next days.
Pardon for my english.
[12 of 21] Compiling Data.Attoparsec.ByteString.Internal ( Data/Attoparsec/ByteString/Internal.hs, dist/build/Data/Attoparsec/ByteString/Internal.o )
Data/Attoparsec/ByteString/Internal.hs:519:7:
Illegal equational constraint a_ataf ~ (ByteString, t)
(Use GADTs or TypeFamilies to permit this)
In the context: (a_ataf ~ (ByteString, t))
While checking the inferred type for ?succ'?
In the expression:
let
succ' t' pos' more' a
= succ t' pos' more' (substring pos (pos' - pos) t', a)
in runParser p t pos more lose succ'
In the second argument of ?($)?, namely
?\ t pos more lose succ
-> let succ' t' pos' more' a = ...
in runParser p t pos more lose succ'?
Failed to install attoparsec-0.12.0.0
This is with today's GHC-7.9 from HEAD.
isDigit
, digit
, etc., are less smart than isDigit_w8
. Specifically, they check ranges using code like
'0' <= c && c <= '9'
This involves two comparisons and two branches. It's generally faster to use these:
-- Requires, but does not check, that the third argument is at least
-- as large as the first.
betweenI :: Integral n => n -> n -> n -> Bool
betweenI a b c = (fromIntegral (b - a) :: Word) <= fromIntegral (c - a)
{-# INLINE betweenI #-}
betweenE :: Enum n => n -> n -> n -> Bool
betweenE a b c = betweenI (fromEnum a) (fromEnum b) (fromEnum c)
{-# INLINE betweenE #-}
The c-a
will be calculated at compile time in typical uses, so this requires just one subtraction, one comparison, and one branch. The improvement is greatest when values are mostly not in the specified range (as branch prediction may be substantially improved), but even when branches are well-predicted, this way is a bit better.
Careless use could be expensive, but nevertheless I found this combinator useful. Should I submit a patch to Data.Attoparsec.Combinator
?
wowo :: Alternative
wowo w wo = w *> wo <|> wo
Otherwise I end up having to name the wo
parser, which can be a bit ugly indentation-wise inside a do
-block. Also you can use NondecreasingIndentation
after wowo (…) $ do
…
Not happy about the name though. I thought of calling it bono
, but I can't live with (or without) having named it bono
. :-/
So I found myself pulling in Data.Attoparsec.Internal.Types
and writing:
getPos :: Parser Int
getPos = AT.Parser $ \t pos more _ succ' -> succ' t pos more (AT.fromPos pos)
and it didn't seem like terribly unreasonable code. (My use case is that I'm scanning git packfiles and I want to capture the file offset for each object I encounter). Is this actually a terrible idea that can never work, or is it something you'd consider adding to the public attoparsec interface so I can feel slightly better about myself for pulling an Internal
module?
I'm not trying to resurrect #16 or #19 (or anything else that's about automatically doing anything in particular on failure). Apologies if I missed the point somewhere and this actually is a duplicate. Thanks!
Often it's necessary to run a parser inside another parser. It's usually done with something like this:
captureAndParse = do
raw <- capture
case parseOnly parse raw where
Left err -> fail err
Right res -> return res
It would be nice to have a function parseOnlyM
which runs a parser and calls fail
on Left
result. It can be implemented in this way:
parseOnlyM :: Monad m => Parser a -> Text -> m a
parseOnlyM parser input = either fail return $ parseOnly parser input
Which would allow to simplify the first example to following:
captureAndParse = capture >>= parseOnlyM parse
What do you think about such addition? I would be happy to make a PR.
Part of documentation on hackage is trimmed
http://hackage.haskell.org/packages/archive/attoparsec/latest/doc/html/Data-Attoparsec-Text.html#v:rational
http://hackage.haskell.org/packages/archive/attoparsec/latest/doc/html/Data-Attoparsec-ByteString-Char8.html#v:rational
Perhaps because of the empty line.
The important part being the statement that it doesn't parse NaN and Infinity.
-- | Parse a rational number.
--
-- This parser accepts an optional leading sign character, followed by
-- at least one decimal digit. The syntax similar to that accepted by
-- the 'read' function, with the exception that a trailing @\'.\'@ or
-- @\'e\'@ /not/ followed by a number is not consumed.
--
-- Examples with behaviour identical to 'read', if you feed an empty
-- continuation to the first result:
--
-- >rational "3" == Done 3.0 ""
-- >rational "3.1" == Done 3.1 ""
-- >rational "3e4" == Done 30000.0 ""
-- >rational "3.1e4" == Done 31000.0, ""
-- Examples with behaviour identical to 'read': <<-- this is trimmed
--
-- >rational ".3" == Fail "input does not start with a digit"
-- >rational "e3" == Fail "input does not start with a digit"
--
-- Examples of differences from 'read':
--
-- >rational "3.foo" == Done 3.0 ".foo"
-- >rational "3e" == Done 3.0 "e"
--
-- This function does not accept string representations of \"NaN\" or
-- \"Infinity\".
Would it be possible for the library to report, for each parsed entity, at which position it was found in the input stream ?
It would be helpful to get positions in terms of lines and columns but perhaps also in terms of offset from the beginning of the input stream.
The report is a followup to a discussion with @snoyberg
Data.Attoparsec.Char8 generally has predicates of the type (Char -> Bool)
, for example isSpace
, and functions like takeTill
and takeWhile
take such predicates, having type (Char -> Bool) -> Parser ByteString
.
Shouldn't the isEndOfLine
and isHorizontalSpace
predicate functions similarly be specialized to (Char -> Bool)
? Right now they are (Word8 -> Bool)
which make them difficult to use with takeTill
.
AfC
TL;DR
I had a parser that was failing in odd ways and fixed it by replacing
P.many' pSomething
with
P.option [] (P.many1 pSomething)
where pSomething is a non-trivial parser.
Q1: Why does the second verion work correctly when the first doesn't?
Q2: Is there any performance penalty for implementing it the second way?
Long version
I had a parser (used in a Conduit via Data.Conduit.Attoparsec) taht somewhere around the time 0.12 was released started intermittently throwing a ParseError (defined in D.C.Attoparsec). However, since it was intermittent and didn't always happen in the same place on the same input (which was ok for my application) I wasn't in a rush to debug it.
When I finally decided to invest some time in debugging this it was really hard going. First I managed to make it trigger more frequently by adding a conduit chunker before the conduitParser
. Normally ByteString data flows though conduits in chunks of about 30. My chunker allowed me to make these chunks much smaller which increased the frequency of the parse error. Once that was happening I spent considerable time digging around on both D.C.Attoparsec as well as attoparsec itself. I got nowhere.
After reading the attoparsec documentation for about the 10th time I decied to remove all the attoparsec combinators that had warnings about them. I remove peekWord8
and was more careful about takeWhile
etc. In the end, it was the removing many'
that fixed it.
I found the following gadget useful in many of my scripts. It's the "opposite" of manyTill
.
skipTill :: Parser a -> Parser b -> Parser b
skipTill junk end = end <|> (junk *> skipTill junk end)
I am wondering if people find this useful too. If so, I'd like to propose adding this one (or an optimized one) to the library.
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (take)
import Data.Attoparsec.ByteString
import Data.Attoparsec.ByteString.Char8 (signed, decimal)
import qualified Data.ByteString as B
import Data.IORef
import Control.Applicative
data D = S B.ByteString
deriving Show
parser :: Parser D
parser = do
len <- signed decimal
S <$> (take len)
-- | Takes a string and produces it little by little
ioProducerSplitted :: B.ByteString -> Int -> IO (IO B.ByteString)
ioProducerSplitted s n = do
state <- newIORef s
return $ do
putStrLn "more data was requested"
val <- readIORef state
let (rv, leftOver) = B.splitAt n val
writeIORef state leftOver
putStrLn $ "returning " ++ show rv
return rv
main :: IO ()
main = do
prod <- ioProducerSplitted "9foobarbaz" 3
res <- parseWith prod parser ""
putStrLn $ "res: " ++ show res
Output for 0.11.3.4:
more data was requested
returning "9fo"
more data was requested
returning "oba"
more data was requested
returning "rba"
more data was requested
returning "z"
res: Done "" S "foobarbaz"
Output for 0.12.1.0:
more data was requested
returning "9fo"
more data was requested
returning "oba"
more data was requested
returning "rba"
more data was requested
returning "z"
more data was requested
returning ""
res: Fail "" [] "not enough input"
There's this crazy bug I'm getting: informatikr/hedis#15
It turned out that (if I'm not mistaken) because hedis reads responses from redis like this https://github.com/informatikr/hedis/blob/master/src/Database/Redis/ProtocolPipelining.hs#L116 -- it relies on attoparsec's behaviour to call socket-reading function exactly needed number of times. Here's a piece of code (should be compiled inside hedis package to see it's hidden modules):
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Attoparsec.ByteString
import Database.Redis.Protocol (reply)
import Database.Redis
import System.Random
import qualified Data.ByteString.Char8 as BC8
import Network
import qualified Data.ByteString as B
import GHC.IO.Handle (hSetBinaryMode, hFlush)
-- import Network.Socket.Types (PortNumber(..))
import Data.IORef
debug s = putStrLn $ "MAIN>> " ++ s
-- | Takes a string and produces it little by little
ioProducerSplitted :: B.ByteString -> Int -> IO (IO B.ByteString)
ioProducerSplitted s n = do
state <- newIORef s
return $ do
debug "more data was requested"
val <- readIORef state
let (rv, leftOver) = B.splitAt n val
writeIORef state leftOver
debug $ "returning " ++ show rv
return rv
main :: IO ()
main = do
prod <- ioProducerSplitted
"*2\r\n$47\r\nvalue-value-value-value-value-56.27533380617696\r\n$47\r\nvalue-value-value-value-value-64.54343491843917\r\n"
64
res <- parseWith prod reply ""
debug $ "res: " ++ show res
For attoparsec version 0.11.3.4 it produces this output:
MAIN>> more data was requested
MAIN>> returning "*2\r\n$47\r\nvalue-value-value-value-value-56.27533380617696\r\n$47\r\nv"
MAIN>> more data was requested
MAIN>> returning "alue-value-value-value-value-64.54343491843917\r\n"
MAIN>> res: Done "" MultiBulk (Just [Bulk (Just "value-value-value-value-value-56.27533380617696"),Bulk (Just "value-value-value-value-value-64.54343491843917")])
While for 0.12.1.0 it produces:
MAIN>> more data was requested
MAIN>> returning "*2\r\n$47\r\nvalue-value-value-value-value-56.27533380617696\r\n$47\r\nv"
MAIN>> more data was requested
MAIN>> returning "alue-value-value-value-value-64.54343491843917\r\n"
MAIN>> more data was requested
MAIN>> returning ""
MAIN>> res: Fail "alue-value-value-value-value-64.54343491843917\r\n" [] "Failed reading: empty"
I'm trying to understand the code of Attoparsec.Text (unfortunately non-exported functions are not nearly as well-commented as the exported once) and came across this:
lengthAtLeast :: T.Text -> Int -> Bool
lengthAtLeast t@(T.Text _ _ len) n = (len `div` 2) >= n || T.length t >= n
Why the div
by 2? Is this some internal knowledge about how Text works? (If yes, it might be that this doesn't work so well any more in case that changes.)
Also, if that len
is always positive (my guess), might not a quot
be faster? (There are two more occurences of this, in the FastSet modules, could apply there as well.)
Thanks!
It's rather not an issue, I'm just wondering. As far as I can see, attoparsec provides completely safe interface and besides it uses ByteString or Text unsafe internal modules I can't imagine how this package can be used in unsafe way. So the question arises: is it reasonable to make every exposed module safe? It will only require to enable Trustworthy extension in every exposed module, if I'm right.
[ 1 of 25] Compiling QC.Rechunked ( tests/QC/Rechunked.hs, dist/build/tests/tests-tmp/QC/Rechunked.dyn_o ) tests/QC/Rechunked.hs:52:9: Non type-variable argument in the constraint: G.Vector v Int (Use FlexibleContexts to permit this) When checking that ‘swapAll’ has the inferred type swapAll :: forall (m :: * -> *) (t :: * -> *) (v :: * -> *). (Foldable t, primitive-0.5.4.0:Control.Monad.Primitive.PrimMonad m, G.Vector v Int, G.Mutable v ~ V.MVector) => t (Int, Int) -> m (v Int) In an equation for ‘swapIndices’: swapIndices n0 = do { swaps <- forM [0 .. n] $ \ i -> ((,) i) `fmap` choose ...; return (runST (swapAll swaps)) } where n = n0 - 1 swapAll ijs = do { mv <- G.unsafeThaw (G.enumFromTo 0 n :: V.Vector Int); .... } In an equation for ‘fisherYates’: fisherYates xs = (V.toList . V.backpermute v) `fmap` swapIndices (G.length v) where v = V.fromList xs swapIndices n0 = do { swaps <- forM ... $ ...; .... } where n = n0 - 1 swapAll ijs = ... tests/QC/Rechunked.hs:53:46: Couldn't match expected type ‘Int’ with actual type ‘b1’ ‘b1’ is untouchable inside the constraints (Foldable t, primitive-0.5.4.0:Control.Monad.Primitive.PrimMonad m, G.Vector v Int, G.Mutable v ~ V.MVector) bound by the inferred type of swapAll :: (Foldable t, primitive-0.5.4.0:Control.Monad.Primitive.PrimMonad m, G.Vector v Int, G.Mutable v ~ V.MVector) => t (Int, Int) -> m (v Int) at tests/QC/Rechunked.hs:(52,9)-(55,27) ‘b1’ is a rigid type variable bound by the inferred type of swapIndices :: (Enum b1, Num b1, random-1.1:System.Random.Random b1) => b1 -> Gen b at tests/QC/Rechunked.hs:47:5 Possible fix: add a type signature for ‘swapIndices’ Relevant bindings include n :: b1 (bound at tests/QC/Rechunked.hs:51:9) n0 :: b1 (bound at tests/QC/Rechunked.hs:47:17) swapIndices :: b1 -> Gen b (bound at tests/QC/Rechunked.hs:47:5) In the second argument of ‘G.enumFromTo’, namely ‘n’ In the first argument of ‘G.unsafeThaw’, namely ‘(G.enumFromTo 0 n :: V.Vector Int)’
I believe I've found a bug in attoparsec and I need some help writing a failing test. Some background:
I have a tool that processes CSV files using Cassava. After upgrading attoparsec I encountered a regression in my tool while processing production files. The problem only happens in some of the CSV files, and only when the files are big enough to be fed into attoparsec in chunks. Feeding only whole lines into attoparsec works fine.
After some experimenting and some help from git bisect
, I've tracked the issue down to commit 791e046c526710dce7b87e308ee48f2fb6811d7b. Before that commit all my regression tests pass. After that commit they fail every time.
I haven't yet been able to shrink this down to a small, reproducible test case. The tool reads chunks of ByteString.Char8
values using io-streams
and feeds them into Cassava, which feeds them into attoparsec. The smallest CSV file that produces the problem is about 2MB.
Due to time constraints I don't have the spare cycles to track this down further right now. Since I can lock cabal down to the last working version of attoparsec it's not a show stopper for me. That said, if someone else wants to chase this down I'd be happy to try out patches or provide more detail.
I frequently find myself wanting/needing to parse complex fields of a fixed size, which is currently rather tricky using attoparsec. For example a 20 byte field consisting of a zero padded ASCII string. This field can be easily described as manyTill anyWord8 (word8 0) <* many (word8 0)
, but taking care of the size limiting is non-trivial.
I would like to have a combinator as the one proposed in the issue name, with the semantics that it tries to match it's argument parser completely against the next N bytes. An example implementation (although probably rather inefficient) would be:
fixed :: Int -> Parser a -> Parser a
fixed i p = do
intermediate <- take i
case parseOnly (p <* endOfInput) intermediate of
Left _ -> empty
Right x -> return x
I want get
and put
to be exposed, so i can implement my own primitive parsers.
For example, i want a variation of takeWhile
which also returns the char that break predication. I can't implement that without get
and put
.
As you might know, there is some confusion around the behavior of the string
parser, specifically in the following situation:
parse (string "wombat" <|> string "foo") "foo"
returns a Partial. This is due to the implementation of string
:
string :: ByteString -> Parser ByteString
string s = takeWith (B.length s) (==s)
The solution proposed by various people on StackOverflow is to simply signal to Attoparsec when the end of the input has been reached, so it knows that the first branch of string "wombat"
will never complete. This is undesirable in certain situations where you simply do not know if you reach the end of the input (specifically in network programming), and thus a safe alternative is desired.
I propose the following safe alternative of string, string':
string' :: ByteString -> Parser ByteString
string' = mapM (satisfy . (==)) . unpack
This will fail with the first character that makes the first branch impossible, and gives the desired result in many situations. The only situation that this will still return a partial is when you do (string "foo" <|> string "foobar"), but at least that one is more obvious.
Any comments?
For most of our Yesod related usage of attoparsec & parsec a failure to parse indicates something very bad. For example, in some cases we would be parsing at compile time and a failure to parse would be a compilation error. At this point we don't care about performance at all, just reporting the error as easily as possible. So I am wondering if we can make an alternative parsing mode that is slow but as detailed as possible in parse failure information. For our cases we could do a fast parse first, and then if there is a parse failure, re-parse with the slower version to give the best possible error information.
If this makes sense for attoparsec, could you point out an implementation strategy?
The attoparsec-text package provides the type-specialized versions of (*>)
and (<*)
. These make it possible to use the IsString instances of Parser to write parsers using nice syntax like:
"Shoe size: " .*> decimal
Currently code using attoparsec-text with that technique cannot be ported to attoparsec without seriouse breakage.
In Data.Attoparsec.Text:
-- | Type-specialized version of '*>' for 'Text'.
(.*>) :: Applicative f => Text -> f a -> f a
(.*>) = (*>)
-- | Type-specialized version of '<*' for 'Text'.
(<*.) :: Applicative f => f a -> Text -> f a
(<*.) = (<*)
In Data.Attoparsec.ByteString:
-- | Type-specialized version of '*>' for 'ByteString'.
(.*>) :: Applicative f => ByteString -> f a -> f a
(.*>) = (*>)
-- | Type-specialized version of '<*' for 'ByteString'.
(<*.) :: Applicative f => f a -> ByteString -> f a
(<*.) = (<*)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.