Comments (5)
Memory management of what's being parsed is left up to the range being parsed and is not the concern of dxml. The parser will operate on any forward range of char
, wchar
, or dchar
. I thought that the documentation was clear about that. As such, if you have a forward range of characters over a file which does not read in the entire file at once, you can parse the file without reading into all into memory (though obviously, any parts you keep around will then stay in memory).
However, if you're going to use parseDOM
, then any portion of the document that it parses is going to result in memory allocations in order to build the DOM regardless of the underlying range being parsed. That's going to be true of any DOM parser since the whole point of a DOM parser is to build the document tree in memory.
from dxml.
I see, but do you have any preferred solution? Phobos does not seem to provide any means to represent file as a forward range without loading the whole file.
DOM of course needs to allocate some structs that represent a tree, but still uses slices of original range to hold the stored data. The point is to minimize allocations, not nullify them.
Upd: I've found MmFile can remap portions of file on demand when window argument is given. Still needs a little trickery to make it a forward range, but it might work.
from dxml.
As dxml uses the standard range mechanism, it bypasses the issue in the sense that it doesn't provide the range that reads in the file efficiently. It assumes that it already exists. Unfortunately, reading from a file efficiently is kind of the achilles heel of ranges in that every time you call save
you then need that range to always be valid for as long as it exists, so in order to buffer file access, you could need to have an arbitrarily large number of buffers, and it gets complicated. It's something that needs to be solved, but Phobos has largely ignored the issue (probably because it's complicated and no one really wants to take the time to write it). It does have stuff like std.stdio's byLine
or byChunk
, to read lines or chunks efficiently, but that translates to a range of bytes or characters only awkwardly, because what you're really getting then is ranges of lines or chunks (and since those algorithms reuse their buffers, it gets even more complicated). Actually, properly, buffering chunks of a file and referencing-counting that with the range API to properly support save
gets complicated fast, and in a lot of cases, simply reading the file in in pieces rather than as a forward range or reading it in all at once avoids the whole issue (though that obviously isn't always an option).
Personally, if I had to read a file in as a forward range and couldn't read it all in at once, I'd probably just use std.mmfile rather than trying to deal with buffering everything, since that gets really complicated. There's always Steven's https://github.com/schveiguy/iopipe, but it's still a work in progress, and as I understand it, he's had to work around the range API on some level precisely because it's so poorly suited to reading in a file efficiently, so I don't know exactly how that's going to work with the range API. I'm aware of iopipe but have to spend time studying it.
I wrote dxml the way I did so that it could work with a range that read over a file without reading it all in but without trying to actually solve that problem. By just operating on ranges, it pushes that entire problem off to ranges, which doesn't entirely solve the problem, but it does mean that as long as the problem is solved with ranges in general, it's solved for dxml.
from dxml.
Yeah, and byline/splitlines aren't good for parsers because they loose vital line ending information. Xml cdata section are most likely the problem for Xml.
I did a range over mmap files awhile back.
https://github.com/JesseKPhillips/libosm/blob/master/source/util/filerange.d
from dxml.
Related Issues (20)
- Utilize Phobos skipOver HOT 1
- Can't build => can't use HOT 7
- Convienience methods suggestions for DOM parser HOT 4
- dom: Entities consisting of whitespace do not capture their contents HOT 2
- Assert descriptions HOT 2
- parser.d(1925): [1:1]: Expected < HOT 2
- How to get position of end tag? HOT 1
- Can't get skipToPath work on real data HOT 1
- stripIndent removes text
- stripIndent removes any run of the "right" number of whitespaces HOT 2
- need a shorter way to reach attributes of Entities. HOT 1
- decodeXML does not compile when given a range of type char[] HOT 2
- parser.d(2726): [1:1273]: There can only be whitespace between an end tag's name and the > HOT 4
- Parser fails to operate on UTF8 Files containing a BOM HOT 1
- Compiling with GDC-10.3.0, invalid UTF characters
- Characters legal in XML 1.1 are not accepted
- Change writer.output to finish() HOT 3
- namespace support HOT 10
- Fuzz target for `parseXML` with 2 crashing testcases HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dxml.