Part of the discussion for Helm 3 at the moment is over the introduction of several approaches to make indexing-related tasks more efficient (heirarchical index files, optional server-side search, etc). My current focus is optimizing the index file parsing step. To this end, I've benchmarked a number of parsers, following existing work by @mattfarina. My current results are available here.
I've created this issue to share with the community my progress and my intended path forward, and to ask for feedback. I'd like to make sure my approach is on the right track, and that I'm not unknowingly duplicating someone else's work, or vice versa.
Approach
My approach is derived from three core expectations:
- Indexes should be saved as JSON files, not YAML files
- YAML is complicated, and parsing it is naturally expensive.
- JSON is a very simple format, and yet it's still more than sufficient to properly encode indexing data, with no change to the structure of the data.
- YAML has no advantage over JSON in this case. The primary reason to use YAML over JSON is increased ability for humans to read and edit the data, but humans don't directly read or edit index files.
- Since JSON is a subset of YAML, a newer JSON index could technically be backwards-compatible with older YAML-based versions of Helm.
- Index files should be streamed from disk
- Currently, index files are loaded to memory in their entirety before being parsed. Streaming the text data into the parser a block at a time is much more space-efficient.
- Much more often than not, index files are parsed simply to search for an attribute or check the existence of a particular chart. These simple lookups can have O(1) space efficiency if streaming is used.
- As YAML supports references, this kind of efficient streaming isn't possible. If JSON is used instead, that problem disappears.
- All else being equal, allocations should be minimized where possible
- Parsers tend to use a lot of temporary intermediate data during conversion. If this data is allocated to the heap, it can potentially create a lot of garbage.
- Go's GC is non-generational, meaning it's less likely to immediately collect short-lived allocations. These allocations can build up quickly, which can cause both time and space efficiency to decrease.
- Minimizing allocations (for example, by allocating intermediate data to the stack instead of the heap) can reduce the GC workload, increasing efficiency.
At this point, my goal was to compare and judge a number of existing parsers based on these three expectations.
Results
This is a breakdown of the benchmarks, which can be found here. Index files contain 500,000 charts, and are 200 - 350 MB in size.
ghodss/yaml
and yaml.v2
are the only YAML parsers being tested, and they have the worst performance by far, with 20 - 40 seconds and 4 - 8 GB to parse an index.
encoding/json
(the standard library JSON parser) fares a little better, with 6 seconds and 1 GB to parse an index. It's interesting to note that the streaming version has worse time and space efficiency than the non-streaming version.
json-iterator/go
and a8m/djson
improve time, but not space, with 2 seconds and 1 GB for an index.
buger/jsonparser
seems very promising, with 1 - 1.5 seconds and 300 - 800 MB for an index. However, it doesn't support streaming.
darthfennec/jsonparser
is a customized fork of buger/jsonparser
, modified to support streaming. I hacked this together as a PoC for this benchmark, because I wanted to see what kind of practical improvements streaming could make for space efficiency. The results are 2 - 3 seconds and 120 MB for an index. This is twice as slow as the original buger/jsonparser
, but it has the best space efficiency by far.
Allocation count appears to be roughly proportional to cpu and ram usage, as expected. buger/jsonparser
claims that zero allocations are made in the common case, although this depends on the lengths of the JSON strings being parsed. In practice, it does appear to allocate less than other parsers.
Next Steps
darthfennec/jsonparser
is unstable and will not function as-is outside of the benchmark environment. I suppose that's what happens when I try to retrofit streaming into a library that really isn't designed to support it. I expect the drop in speed would have been mostly avoided if I had been more careful with the modifications; I don't think it's inherent to the addition of streaming capabilities. Also, I do think the clear space advantage of this approach makes it worth looking further into.
I don't think fixing darthfennec/jsonparser
is the right move. Streaming is very much in conflict with the design of buger/jsonparser
, so I don't believe this hack could ever be particularly effective. Instead, I've started to write a new JSON parser from scratch, which I expect will benchmark at least as well as darthfennec/jsonparser
, without sacrificing flexibility and consistency.
When the new parser is far enough in development, I'll add it to the benchmark to see where it stands.
EDIT: The new parser is up