xitongsys / parquet-go Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 295.0 7.86 MB

pure golang library for reading/writing parquet file

License: Apache License 2.0

Go 91.86% Thrift 8.11% Makefile 0.04%

parquet-go's People

Contributors

Stargazers

Watchers

Forkers

telenor-digital-asia byxiangfei danangrisang sunminghong andyfase qianqiusoft zenixls2 bizenn hasnickl yammine wxlarg jancajthaml suganoo harikrishnanselvapillai maxmira johnny-malizia funny-falcon nahuru marianogappa myrtle idubinskiy aohua jancona formantio bogdad venscjp jomoespe yowatari rlouapre quentinperez wanlitian haigame rosaniline scraly ascend-io lianran amient zaky amitpandia kenhu doron2402 vjeyakum dennybritz readly mapuri xanderflood nauto piyushh1 listenfirstmedia msales gitwak jerry-tao visheratin kostin88 harshavardhana ariden83 poopoothegorilla khoandthcn sgrelak pec1985 leexhwhy shsing2000 zhile panamafrancis rockxcn krehermann raids ws6 marquisthunder cnmac leoauro forlink calmisland vashistkamal11 isgasho justdice sudachen dennisrutjes pmalekn joe2hpimn jwdeitch chenleiwhu wandb langecode appliscale mangaldev mwillfox tyronegroves lsudo o0ignition0o dkaslovsky wootensteve joao-cruz-olx geshuro mengjin001 blunghamer klauspost falconandy oliverbestmann backwardn

parquet-go's Issues

the project "gleam" need help about read Parquet files

hi, the awesome project gleam need help about read Parquet files, do you want to contribute to it? thanks

What is the supported Parquet version

The latest parquet version is on 1.9, does parquet-go support the latest parquet?

Reading all columns

What is a good way of iterating over all rows, while grabbing all column values?

My code needs to work with all parquet files, regardless of structure. That is, i do not have a struct type that I can pass into ParquetReader.Read(). I noticed that even gleam is fetching it column by column.

thanks in advance!

Possible bug on reading some parquet file

Hey, First thanks for putting effort on developing a Go-native parquet tool. It is extremely helpful if we can use this coo library.
When I try some sample parquet file, I seems to have the following error (see the code and test file below. Hopefully it is a user error)

panic: runtime error: makeslice: len out of range

goroutine 1 [running]:
github.com/xitongsys/parquet-go/Layout.ReadPage(0xc4200f6a60, 0xc420112000, 0xc4200968f0, 0x12cd2a4, 0x8, 0xf)
	/Users/zhouyang/test/go/src/github.com/xitongsys/parquet-go/Layout/Page.go:426 +0x25f4
github.com/xitongsys/parquet-go/ParquetReader.(*ColumnBufferType).ReadPage(0xc420118070, 0xc420057e50, 0xc420080a90)
	/Users/zhouyang/test/go/src/github.com/xitongsys/parquet-go/ParquetReader/ColumnBuffer.go:89 +0x6b
github.com/xitongsys/parquet-go/ParquetReader.(*ColumnBufferType).ReadRows(0xc420118070, 0x1, 0xc420057e50, 0xf)
	/Users/zhouyang/test/go/src/github.com/xitongsys/parquet-go/ParquetReader/ColumnBuffer.go:120 +0x2e
github.com/xitongsys/parquet-go/ParquetReader.(*ParquetReader).ReadColumnByPath(0xc42007fb60, 0x12cd2a4, 0x8, 0xc4200f7520)
	/Users/zhouyang/test/go/src/github.com/xitongsys/parquet-go/ParquetReader/ParquetReader.go:209 +0x13a
main.main()
	/Users/zhouyang/test/go/src/github.com/xitongsys/parquet-go/example/my_column_read.go:37 +0x2aa
exit status 2

Test code:

fr, _ := ParquetFile.NewLocalFileReader("alltypes_plain.snappy.parquet")
	pr, err := ParquetReader.NewParquetColumnReader(fr, 1)
	if err != nil {
		log.Println("Failed new reader: ", err)
	}
	fmt.Println(pr.SchemaHandler)
	num := pr.GetNumRows()
	fmt.Println("number of row: ", num)

	
	data := make([]interface{}, 1)
	pr.ReadColumnByIndex(0, &data)
	//pr.ReadColumnByPath("bool_col", &data)
	log.Println("data: ", data)

Link to test file (which is some simple "official" parquet file)
https://s3-us-west-1.amazonaws.com/ascend-io-dev-zzhang/alltypes_plain.snappy.parquet
https://s3-us-west-1.amazonaws.com/ascend-io-dev-zzhang/nation.gzip.parquet

Thanks for reply 👍

Production Use

Hi @xitongsys, it seems like this is under active development still, do you think it's ready for production use? We'd like to replace some of our spark pipelines :)

Issues with String/UTF8 type with AWS Athena (Presto)

It seems the parquet file produced has issues with strings being queried directly in AWS Athena.

The issue seems to be identical to that of the Python library fastparquet dask/fastparquet#150 - which seemingly was fixed in this pull request (https://github.com/dask/fastparquet/pull/179/files)

Basically querying the string directly i.e.

SELECT * FROM "default"."parquet_go" where name = 'Student Name'

Will produce no results. However the following will

SELECT * FROM "default"."parquet_go" where name like 'Student Name%'
SELECT * FROM "default"."parquet_go" where varchar(name) = 'Student Name'

Seemingly it seems to be a character encoding issue?

I produced the Athena table based on the example parquet file produced by csv_write.go with one modification - the addition of ph.NameToLower() as Athena cannot cope with uppercase characters in column names

The table definition in Athena is:

create external table if not exists `default.parquet_go` (
  `name` STRING,
  `age` INT,
  `id` BIGINT,
  `weight` DOUBLE,
  `sex` BOOLEAN
  )
STORED AS PARQUET
LOCATION 's3://<bucket>/parquet_go/'

The parquet file is stored within the S3 directory i.e. s3://<bucket>/parquet_go/csv.parquet

Hardcoded RowGroupSize

Provide way to chose row group size instead of hardcoding it in ParquetWriter and JSONWriter (currently hardcoded 128MB)

A JSONWriter with OPTIONAL field got unexpected output

Hey, @xitongsys
I am doing more tests on JSON writer, and find a bug that the output doesn't match what is expected. To be more specific, let's have 20 cols, col_0, col_1, col_2, col_3.... All are in type optional utf8. And we have two records rec1 and rec2. rec1 = {nil, hello, nil, hello...}, and rec2 = {hello, nil, hello, nil, ...}. Basically, rec1 and rec2 are alternating (see the code for details).
But the output doesn't match what is expected. For example, row1 should have col_0, col_2, ... col_16, col_18 but turns out it only has col_0, col_2, ..., col_8.
Thanks for the help! I actually find another bug, but working on constructing a easy way to reproduce that. I will create a separate ticket on that.
:)

package main

import (
	"encoding/json"
	"fmt"
	"log"

	"github.com/xitongsys/parquet-go/ParquetFile"
	"github.com/xitongsys/parquet-go/ParquetWriter"
)

func main() {
	var err error
	numCols := 20
	numRows := 1000
	fields := []interface{}{}
	for i := 0; i < numCols; i++ {
		fields = append(fields, map[string]string{"Tag": fmt.Sprintf("name=col_%d, type=UTF8, repetitiontype=OPTIONAL", i)})
	}
	m := map[string]interface{}{}
	m["Fields"] = fields
	m["Tag"] = "name=parquet-go-root"
	bs, err := json.Marshal(m)
	if err != nil {
		log.Println("json marshal error: ", err)
		return
	}
	md := string(bs)

	fmt.Println("\n\n\n", md, numRows)

	//write
	fw, err := ParquetFile.NewLocalFileWriter("json_big111.parquet")
	if err != nil {
		log.Println("Can't create file", err)
		return
	}
	pw, err := ParquetWriter.NewJSONWriter(md, fw, 4)
	if err != nil {
		log.Println("Can't create json writer", err)
		return
	}
	r1 := map[string]string{}
	r2 := map[string]string{}
	for i := 0; i < numCols; i++ {
		if i%2 == 0 {
			r1[fmt.Sprintf("col_%d", i)] = "helloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorld"
		} else {
			r2[fmt.Sprintf("col_%d", i)] = "helloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorldhelloWorld"
		}
	}

	rr1, err := json.Marshal(r1)
	if err != nil {
		log.Println("json marshal error: ", err)
		return
	}

	rr2, err := json.Marshal(r2)
	if err != nil {
		log.Println("json marshal error: ", err)
		return
	}

	rec1 := string(rr1)
	rec2 := string(rr2)

	fmt.Println(rec1)
	fmt.Println(rec2)

	for i := 0; i < numRows; i++ {
		if i%100000 == 0 {
			fmt.Println("writing row: ", i*2)
		}
		if err = pw.Write(rec1); err != nil {
			log.Println("Write error", err)
		}
		if err = pw.Write(rec2); err != nil {
			log.Println("Write error", err)
		}
	}

	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
	}
	log.Println("Write Finished")
	fw.Close()

}

UNKNOWN FieldRepetitionType is not allowed

Firstly, I want to thank you for this awesome library, it saves me a lot of time. But I got one problem when I upload the generated parquet file to google cloud Big query, the error message is "UNKNOWN FieldRepetitionType is not allowed". I'm using the CSVWriter to write the file.

        func prepareParquetWriter(md *[]string, parquetType string, columnNames []string, index int) {

if len(*md) < len(columnNames) {

	*md = append(*md, fmt.Sprintf("name=%s, type=%s, repetitontype=OPTIONAL, encoding=%s", columnNames[index], parquetType, "PLAIN"))

}

}

Do you have any idea about this issue? Any suggestions on how I can verify if the parquet file that I generated is correct? Thank you.

Parquet column reader to datatable

Hi,

Is there a way i can select particular column from parquet file and create Temp table in go code like datatable.

Note: I tried column reader and am able to get the column values but i need to form a table wit spcific columns value from parquet file.

Let parquet writer ignore a struct field

I am not sure if we have a method for this already.
It would be nice if we can ignore a certain field like we have for json encoding: json:"-".
Suppose my source data is something like {"timestamp":1.485672999E9}, and my output parquet requires the timestamp field to be int64.
Since parquet-go does not accept custom types during write, that means a custom type unmarshaller is out of the question. To get around this issue I need to use something like the following:

type myTime struct {
    TS float64 `json:"timestamp"`
    ParquetTS int64 `parquet:"name=timestamp type=INT64"
}
func (*t myTime) convert() {
    t.ParquetTS = int64(t.TS)
}

The problem with this approach is that parquet-go does not offer any way for me to ignore fields. Therefore when I try to initialize the parquet writer using the struct above, the following error would be generated:
Can't create parquet writer runtime error: index out of range
Currently the only way to circumvent this issue it to create a second copy of the myTime struct without the TS field, which is not very scalable or maintainable.
Can we implement some functionality that allow users to ignore fields when writing parquet files?

Nil de-reference on table map during flush

There is a bug here: https://github.com/xitongsys/parquet-go/blob/master/ParquetWriter/ParquetWriter.go#L175

If err2 is non-nil then tableMap is nil, so this de-reference panics. I'm comfortable enough with how this function fits into the rest of the code to suggest a short-circuit or failure mechanism.

in case of inserting nill data, performance degradation?

In case of inserting nil on column, it seems it takes so long time and CPU to pw.Write() or pw.Flush(true).
It does not happen without nil on column. is it normal?

use tag to defined parquet field name

I think using struct field tag (such as parquet:"name") to defined parquet field name is a good idea.

Please take it into consideration.

Provide format agnostic interface of writer

Hello it would be great if you could provide format agnostic interface that could be used in application regarding of target format (parquet, csv, json) something like

type Writer interface {
    Write(interface{})
    Flush(bool)
    WriteStop()
}

in our use-case we are deciding target format (csv/parquet/json) with application parameters and having facade that returns Writer to application would be great.

interface compliance (e.g. for ParquetWriter) would be like so:

type ParquetWriter struct {
    *Writer
    ...
}

type CSVWriter struct {
    *Writer
    ...
}

type JSONWriter struct {
    *Writer
    ...
}

Amazon S3 support?

I need to read and write Parquet files to Amazon S3. I have file writes working by creating a MemFile and streaming the results to S3. I might be able to use a similar approach for reads except that there's currently no NewMemFileReader().

It might also be possible to write an S3File implementation (similar to HdfsFile). Is that something you might consider including? If not, I'll work on solving my problem inside my app.

WriteString truncates string on first space character

I'm using the ph.WriteString function and came across an issue with strings that contain spaces. Effectively the string is truncated when written into the parquet file. Only the part of the string before the first "space" i.e. " " character ends up in the parquet file.

This can be easily replicated by modifying the example file csv_write.go this code

for i := 0; i < num; i++ {
		data := []string{
			"WriteString StudentName ",
			fmt.Sprintf("%d", 20+i%5),
			fmt.Sprintf("%d", i),
			fmt.Sprintf("%f", 50.0+float32(i)*0.1),
			fmt.Sprintf("%t", i%2 == 0),
		}
		rec := make([]*string, len(data))
		for j := 0; j < len(data); j++ {
			rec[j] = &data[j]
		}
		ph.WriteString(rec)

		data2 := []interface{}{
			UTF8("Write StudentName"),
			INT32(20 + i*5),
			INT64(i),
			FLOAT(50.0 + float32(i)*0.1),
			BOOLEAN(i%2 == 0),
		}
		ph.Write(data2)
	}

produces the following output on parquet-tools head csv.parquet

Name = WriteString
Age = 20
Id = 0
Weight = 50.0
Sex = true

Name = Write StudentName
Age = 20
Id = 0
Weight = 50.0
Sex = true

Name = WriteString
Age = 21
Id = 1
Weight = 50.1
Sex = false

Name = Write StudentName
Age = 25
Id = 1
Weight = 50.1
Sex = false

Name = WriteString
Age = 22
Id = 2
Weight = 50.2
Sex = true

As you can see "StudentName" is missing from the "WriteString" output

more better performance?

hi @xitongsys
I found one more performance improvement.

https://github.com/xitongsys/parquet-go/blob/master/Marshal/Marshal.go#L78

func (p *ParquetStruct) Marshal(node *Node, nodeBuf *NodeBufType) []*Node {
        numField := node.Val.Type().NumField()
        nodes := make([]*Node, numField)
        for j := 0; j < numField; j++ {
                tf := node.Val.Type().Field(j)
                name := tf.Name
                newNode := nodeBuf.GetNode()
                newNode.PathMap = node.PathMap.Children[name]
                //newNode.Val = node.Val.FieldByName(name)
                id := []int{j}         // <-----
                newNode.Val = node.Val.FieldByIndex(id)         // <-----
                newNode.RL = node.RL
                newNode.DL = node.DL
                nodes[j] = newNode
        }
        return nodes
}

It seems FieldByName take times.
In my case, the cpu usage and the time has down 5%.

What do you think?

Possible bug : panic when closing empty parquet file

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x6b5751]

goroutine 6 [running]:
testing.tRunner.func1(0xc4201001e0)
	/usr/lib/go/src/testing/testing.go:711 +0x2d2
panic(0x71c9e0, 0x947660)
	/usr/lib/go/src/runtime/panic.go:491 +0x283
github.com/xitongsys/parquet-go/ParquetWriter.(*ParquetWriter).Flush(0xc420110090, 0xc420144201)
github.com/xitongsys/parquet-go/ParquetWriter/ParquetWriter.go:226 +0xe11

steps to replicate:

import (
        "github.com/xitongsys/parquet-go/ParquetFile"
        "github.com/xitongsys/parquet-go/ParquetWriter"
)

type Schema struct {
        Stub *int32 `parquet:"name=stub, type=INT32, repetitiontype=optional"`
}

func crashMe() {
        fw, _ := ParquetFile.NewLocalFileWriter("/tmp/stub.snappy.parquet")
        pw, _ := ParquetWriter.NewParquetWriter(fw, new(Schema), 1)
        pw.Flush(true)
        pw.WriteStop()
}

Recommendation for reading a nested parquet file

Hey!
Few questions: I got a use case that users want to read a nested parquet file (with map and list) in it. Is that possible to read this file without defining the go structure a prior ? (I look at the example example/local_nested.go but that provides example for reading a file with the go structure pre-defined.

I am thinking of using ReadColumnByIndex or ReadColumnByPath but naive way of doing that seem to raise error (see the attached test file in the bottom)

for i := 0; i < int(pr.SchemaHandler.GetColumnNum()); i++ {
		data := make([]interface{}, 1)
		pr.ReadColumnByIndex(i, &data)
}

For ReadColumnByPath. This seems to be promising but still need some tweak. For example, probably need to use pr.SchemaHandler.PathMap.Children["foo"].Children["bar"] to iterate through if "foo" is a map type.
I am wondering if there is any simple way to directly read the whole map object into a interface{}, and similar for list. Or adding couple examples on the "best practice" of iterating map or list?
Maybe some examples on help understand Path is quite useful (in flat data parquet it "coincident" with schemaElement.GetName(), which can make users confuse).
Thanks for all the help!

https://s3-us-west-1.amazonaws.com/ascend-io-dev-zzhang/composite.parquet

location of performance testing scripts

hi,

i am looking at your parquet go repository (awesome work!) and was wondering how you ran your performance tests.

specific questions are:

where can i find the schema for the test data (or generator file)?
were the tests run in multi-threaded or single thread mode?
performance of loading a single column (int, string, etc) vs. all columns

thanks!

GCSFile

hi, @aohua Thanks your contributions! I open a new issue for the gcs file.
The seek function is only used in reading parquet file. I have no experience in GCS and don't know if it supports the seek function.
Some other cloud file system like s3, it doesn't support it. So some other projects like gleam just downloads the file from s3 to local.

Memory leak or inefficient use on ParquetColumnReader

Hey, @xitongsys
I believe there could be a more severe issue on memory. When I try to read a ~100m Parquet file, I got OOM on a VM which has 5G memory. I dig out more and found there can be some memory leak or at least inefficient use of memory.

The script (I adopted the original example/column_read.go) writes and read a 100m parquet file. The writing works well, and in the peak it takes 200m memory. But the reader takes 9000m memory on its peak (about 50x its original size). That's may be serious. Since practically parquet file can be as large as few GB or more, it is impossible for a a machine to have 50 * 2G = 100G memory to read that (I know the math is not accurate, it just a qualitative metaphor). Hope we can ensure there is no mem leak.

For your convenience, I also profile the (total) memory allocation, it may helps give you some clues on where the memory goes (41G is the cumulative size without counting the GC I believe)

Thanks for your help again!!! If you need me do some experiment or any help, please don't hesitate and let me know :)

Code snippet:

package main

import (
	"flag"
	"log"
	"os"
	"runtime"
	"runtime/pprof"
	"time"

	"github.com/xitongsys/parquet-go/ParquetFile"
	"github.com/xitongsys/parquet-go/ParquetReader"
	"github.com/xitongsys/parquet-go/ParquetWriter"
)

type Student struct {
	Name   string           `parquet:"name=name, type=UTF8"`
	Age    int32            `parquet:"name=age, type=INT32"`
	Id     int64            `parquet:"name=id, type=INT64"`
	Weight float32          `parquet:"name=weight, type=FLOAT"`
	Sex    bool             `parquet:"name=sex, type=BOOLEAN"`
	Day    int32            `parquet:"name=day, type=DATE"`
	Class  []string         `parquet:"name=class, type=SLICE, valuetype=UTF8"`
	Score  map[string]int32 `parquet:"name=score, type=MAP, keytype=UTF8, valuetype=INT32"`
}

var memprofile = flag.String("memprofile", "", "write memory profile to this file")

func main() {
	flag.Parse()

	go func() {
		for {
			var m runtime.MemStats
			runtime.ReadMemStats(&m)
			log.Printf("zzzz　Alloc = %v　NumGC = %v\n", m.HeapAlloc/1024/1024, m.NumGC)
			time.Sleep(time.Millisecond * 500)
		}
	}()

	//write
	fw, _ := ParquetFile.NewLocalFileWriter("column.parquet")
	pw, _ := ParquetWriter.NewParquetWriter(fw, new(Student), 1)
	num := 5000000
	for i := 0; i < num; i++ {
		stu := Student{
			Name:   "StudentName",
			Age:    int32(20 + i%5),
			Id:     int64(i),
			Weight: float32(50.0 + float32(i)*0.1),
			Sex:    bool(i%2 == 0),
			Day:    int32(time.Now().Unix() / 3600 / 24),
			Class:  []string{"Math", "Physics", "Algorithm"},
			Score:  map[string]int32{"Math": int32(100 - i), "Physics": int32(100 - i), "Algorithm": int32(100 - i)},
		}
		pw.Write(stu)
	}
	pw.Flush(true)
	pw.WriteStop()
	log.Println("Write Finished")
	fw.Close()

	///read
	fr, _ := ParquetFile.NewLocalFileReader("column.parquet")
	pr, err := ParquetReader.NewParquetColumnReader(fr, 4)
	if err != nil {
		log.Println("Failed new reader", err)
	}
	num = int(pr.GetNumRows())
	ids := make([]interface{}, num)
	for i := 0; i < int(pr.SchemaHandler.GetColumnNum()); i++ {
		pr.ReadColumnByIndex(i, &ids)
		//fmt.Println(ids)
	}

	pr.ReadStop()
	fr.Close()

	if *memprofile != "" {
		f, err := os.Create(*memprofile)
		if err != nil {
			log.Fatal(err)
		}
		pprof.WriteHeapProfile(f)
		f.Close()
		return
	}

}

Snapshot for memprint log:

...
2018/01/17 23:03:12 zzzz　Alloc = 6280　NumGC = 916
2018/01/17 23:03:13 zzzz　Alloc = 6618　NumGC = 916
2018/01/17 23:03:13 zzzz　Alloc = 6891　NumGC = 916
2018/01/17 23:03:14 zzzz　Alloc = 7330　NumGC = 916
2018/01/17 23:03:14 zzzz　Alloc = 7657　NumGC = 916
2018/01/17 23:03:15 zzzz　Alloc = 8069　NumGC = 916
2018/01/17 23:03:15 zzzz　Alloc = 8357　NumGC = 916
2018/01/17 23:03:16 zzzz　Alloc = 8694　NumGC = 916
2018/01/17 23:03:16 zzzz　Alloc = 8903　NumGC = 916
2018/01/17 23:03:18 zzzz　Alloc = 9090　NumGC = 917
2018/01/17 23:03:19 zzzz　Alloc = 5490　NumGC = 917
2018/01/17 23:03:19 zzzz　Alloc = 5894　NumGC = 917
...

Memory profiling graph: (you may need to decompress that first, since git didn't support svg preview directly, sorry about that)

memprint.svg.zip

Buffered reading

I'm working on a situation where I'm memory restricted. Would you be willing to accept a PR to add another condition to the file ParquetReader.go on line 210 link so that we can read the file by batches?

Any recommendation on how to do this?

BTW an example on using this to deal with Spark files would be great. In case anybody finds this, you need to add "spark_schema" to the path. But I recommend debugging a program to find your schema!

Thanks!!

Does this library have any sql-like syntax for querying?

Hi! I accidentally stumbled upon this library when I was looking for a data storage format for use in my pet-project. Before that, I did not work with parquet or apache spark. I'm trying to understand whether I can use this library as an embedded column oriented database, but I do not see a query interface similar to SQL. Or does this library provide only the ability to read and write?

maybe there is memory leak

I use this package to read some json files and convert them to parquet files looply.
But the memory of my program always keep growing.

Please wait for me to write a demo and put it here.

Hardcoded Compression

Provide way to chose compression algorithm instead of hardcoding it in ParquetWriter

Generate structs from the schema stored in parquet file

Hey thanks for the work on this lib. I'm new to golang. So, I'm thinking that it might be useful to have a CLI with this project (or maybe external to it) to generate the go code for the structs to Read/Write from a parquet file that I have and hence can extract the schema from. Is there any reason not to do this. Does that make sense within the scope of this project?

Bug closing empty parquet file

Best practice on writting large parquet file using JSONWriter

When I try to write a large parquet file, (for example, ~1 billion rows).
What is the best practice?

Keeping calling pw.Write(rec) for each records a billion times, and calling

pw.Flush(true)
pw.WriteStop()
log.Println("Write Finished")
fw.Close()

once when finished

Or for a certain number of pre-defined row group size, calling Flush,

const rowGroupSize := N
buffSize := 0

func AddRecord(rec string) error {
　pw.Write(rec)
　buffSize += 1
    if buffSize >=  rowGroupSize {
      pw.Flush(true)
      buffSize = 0
    }
    // return some possible error
    return err
}

My questions are for 1), will JSONWriter respect RowGroupSize properly? the Flush function has a boolean flag, what is that used for (on what circumstance user should use pw.Flush(true) v.s. pw.Flush(false)?

Invalid Footer is being generated

revision af926639f3143d273ffc59a1358f9bfc2b003a4d worked fine
revision d1071f96e033512f7638f6fd726f9fc8af7ab8ca generates files that are unreadable will following exception

parquet.hadoop.ParquetFileReader.readAllFootersInParallel(parquet/hadoop/ParquetFileReader.java to read failed line     
     Java::JavaIo::IOException:
       Could not read footer: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

encoding=PLAIN_DICTIONARY + repetitiontype=OPTIONAL nil panic

When using optional type with encoding PLAIN_DICTIONARY ParquetReader panics

github.com/xitongsys/parquet-go/Layout.NewTableFromTable(...)
github.com/xitongsys/parquet-go/Layout/Table.go:9
github.com/xitongsys/parquet-go/Layout.(*Table).Pop(0x0, 0xffffffffffffffff, 0xc420154c70)
github.com/xitongsys/parquet-go/Layout/Table.go:68 +0x51
github.com/xitongsys/parquet-go/ParquetReader.(*ColumnBufferType).ReadRows(0xc42015a540, 0xffffffffffffffff, 0xc420138560, 0x18)
github.com/xitongsys/parquet-go/ParquetReader/ColumnBuffer.go:130 +0x67
github.com/xitongsys/parquet-go/ParquetReader.(*ParquetReader).Read.func1(0xc42001a900, 0xc42005b140, 0xc42013e180, 0x2, 0xc42012e6f0, 0xc42000e1d8, 0xc42000e1e0)
github.com/xitongsys/parquet-go/ParquetReader/ParquetReader.go:140 +0x1ad
created by github.com/xitongsys/parquet-go/ParquetReader.(*ParquetReader).Read
github.com/xitongsys/parquet-go/ParquetReader/ParquetReader.go:133 +0x2c5

example of schemas that fails

with tagged optional

type Schema struct {
  Id      int64  `parquet:"name=id, type=INT64"`
  Count   int32  `parquet:"name=count, type=INT32, encoding=PLAIN_DICTIONARY"`
  Size    int32  `parquet:"name=size, type=INT32, repetitiontype=OPTIONAL, encoding=PLAIN_DICTIONARY"`
}

with implicit optional

type Schema struct {
  Id      int64  `parquet:"name=id, type=INT64"`
  Count   int32  `parquet:"name=count, type=INT32, encoding=PLAIN_DICTIONARY"`
  Size    *int32 `parquet:"name=size, type=INT32, encoding=PLAIN_DICTIONARY"`
}

ParquetFile.ParquetFile should implement io.Seeker

ParquetFile.Seek() has signature Seek(offset int, pos int) (int64, error) while io.Seeker is Seek(offset int64, whence int) (int64, error). So ParquetFile can't be used when a Seeker or ReadSeeker is needed. If I do a PR, would you want to maintain the old signature for compatibility?

How do I append new rows to the same parquet file?

I am writing data into parquet file in batches. In certain cases I want to append the data to an existing parquet file. How can I do that? Im a beginner so please do help.

JsonWrite bug on int32/int64

Hey, @xitongsys
I found that jsonWriter sometimes doesn't write the correct data into file, for example,

package main

import (
	"fmt"
	"log"

	"github.com/xitongsys/parquet-go/ParquetFile"
	"github.com/xitongsys/parquet-go/ParquetWriter"
)

func main() {
	var err error
	md := `
    {
        "Tag":"name=parquet-go-root",
        "Fields":[
		    {"Tag":"name=name, type=UTF8, repetitiontype=OPTIONAL"},
		    {"Tag":"name=age, type=UINT_32"},
		    {"Tag":"name=id, type=INT64"}
        ]
	}
`

	//write
	fw, err := ParquetFile.NewLocalFileWriter("json.parquet")
	if err != nil {
		log.Println("Can't create file", err)
		return
	}
	pw, err := ParquetWriter.NewJSONWriter(md, fw, 4)
	if err != nil {
		log.Println("Can't create json writer", err)
		return
	}

	num := 10
	for i := 0; i < num; i++ {
		rec := `
            {
                "name":"%s",
                "age":12949613,
                "id":12949633
            }
        `

		rec = fmt.Sprintf(rec, "Student Name")
		fmt.Println(rec)
		if err = pw.Write(rec); err != nil {
			log.Println("Write error", err)
		}

	}
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
	}
	log.Println("Write Finished")
	fw.Close()

}

name = Student Name
age = 1
id = 1

name = Student Name
age = 1
id = 1

name = Student Name
age = 1
id = 1

name = Student Name
age = 1
id = 1

name = Student Name
age = 1
id = 1

This bug may related to marshaller, since in local_flat.go, it does not have this problem. But seems that in local_flag.go, it didn't handle uint_32 very well, for example 4294967290 is in the valid range of uint_32 but the output data is -6 instead of 4294967290

package main

import (
	"log"
	"time"

	"github.com/xitongsys/parquet-go/ParquetFile"
	"github.com/xitongsys/parquet-go/ParquetReader"
	"github.com/xitongsys/parquet-go/ParquetWriter"
	"github.com/xitongsys/parquet-go/parquet"
)

type Student struct {
	Name   string  `parquet:"name=name, type=UTF8, encoding=PLAIN_DICTIONARY"`
	Age    uint32  `parquet:"name=age, type=UINT_32"`
	Id     int64   `parquet:"name=id, type=INT64"`
	Weight float32 `parquet:"name=weight, type=FLOAT"`
	Sex    bool    `parquet:"name=sex, type=BOOLEAN"`
	Day    int32   `parquet:"name=day, type=DATE"`
}

func main() {
	var err error
	fw, err := ParquetFile.NewLocalFileWriter("flat.parquet")
	if err != nil {
		log.Println("Can't create local file", err)
		return
	}

	//write
	pw, err := ParquetWriter.NewParquetWriter(fw, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet writer", err)
		return
	}

	pw.RowGroupSize = 128 * 1024 * 1024 //128M
	pw.CompressionType = parquet.CompressionCodec_SNAPPY
	num := 10
	for i := 0; i < num; i++ {
		stu := Student{
			Name:   "StudentName",
			Age:    uint32(4294967290),
			Id:     int64(123456789000000),
			Weight: float32(50.0 + float32(i)*0.1),
			Sex:    bool(i%2 == 0),
			Day:    int32(time.Now().Unix() / 3600 / 24),
		}
		if err = pw.Write(stu); err != nil {
			log.Println("Write error", err)
		}
	}
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
		return
	}
	log.Println("Write Finished")
	fw.Close()

	///read
	fr, err := ParquetFile.NewLocalFileReader("flat.parquet")
	if err != nil {
		log.Println("Can't open file")
		return
	}

	pr, err := ParquetReader.NewParquetReader(fr, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet reader", err)
		return
	}
	num = int(pr.GetNumRows())
	for i := 0; i < num; i++ {
		stus := make([]Student, 1)
		if err = pr.Read(&stus); err != nil {
			log.Println("Read error", err)
		}
		log.Println(stus)
	}

	pr.ReadStop()
	fr.Close()

}

Please check that and thanks for the help

Possible support to have offset in `ColumnReader`

Hey, @xitongsys
I did some experiments today to compare performance between cpp-parquet and go-parquet. Go implementation is much faster.

(Basically, it compares standard JAVA implementation and cpp-parquet (called cgo here) and go-parquet. Hope I can fill the rest in one or two days)
Go implementation is 10x+ faster, but also use 20x+ more memory. The reason of the trade off I guess is go keeps multiple buffers for each columns and read them using multi thread go-routine. Two suggestion:

In some case, I really want to control my memory use (e.g. in a machine that has less memory). The most obvious way is to set np, but that won't really work. I try to tune the parameter and got the same performance result, and even set it to 0. So, I prefer we can really enforce the parameter np to control the maximum number of living go-routine (maybe using a counter channel in this case)
could we have a parameter in ColumnReader called offset (like standard seek), that we can start read from given position. This will gives much more flexibility. So, I can read each column for a given number of row. Record the current position and close the reader to release memory. This will help not keeping too many columnBuffer at the same time.

btw, I also found there is a case that when I use parquet-tools (i.e. tools provided in parquet-MR https://github.com/apache/parquet-mr/tree/master/parquet-tools) to get head, i.e. top 10 rows of a not very big parquet file written by go-parquet, like 50Mb. It sometimes takes very long time and even cause parquet-tools OOM. I think there is a bug in that, but I didn't find a simple way to reproduce that. What do you think of a possible reason for this issue (all the fields are optional utf-8, plain-dictionary

Thanks for the help!

ParquetFile not possible to implement for S3

Hey!

I have a suggestion to update the README.md, as it says:

read/write a parquet file need a ParquetFile interface implemented

type ParquetFile interface {
  Seek(offset int, pos int) (int64, error)
  Read(b []byte) (n int, err error)
  Write(b []byte) (n int, err error)
  Close()
  Open(name string) (ParquetFile, error)
  Create(name string) (ParquetFile, error)
}

Using this interface, parquet-go can read/write parquet 
file on any platform(local/hdfs/s3...)

But with S3, random access is not possible and files are immutable; therefore Seek and as a result random Read and Write are not implementable. Alternative would be writing to a local file and uploading it afterwards.

As for HDFS, the ticket to support in-place write is still open.

I understand that it's the columnar nature of Parquet that dictates this; so perhaps we should live without the luxury of direct writes to immutable file systems.

Possible Bug in Parquet Writter

Hey, @xitongsys
Sorry to bother you again (am I opening too many issues... but I really want to let great parquet-go goes to production on bug-free bases)
I have following code snippet for jsonWriter (it is copied from original example, but adding repetitiontype=OPTIONAL to the field name)

md := `
    {
        "Tag":"name=parquet-go-root",
        "Fields":[
		    {"Tag":"name=name, type=UTF8, encoding=PLAIN_DICTIONARY, repetitiontype=OPTIONAL"},
		    {"Tag":"name=age, type=INT32"},
		    {"Tag":"name=time, type=TIMESTAMP_MICROS"},
		    {"Tag":"name=weight, type=FLOAT"},
		    {"Tag":"name=sex, type=BOOLEAN"},
            {"Tag":"name=classes, type=LIST",
             "Fields":[
                  {"Tag":"name=element, type=UTF8"}
              ]
            },
            {"Tag":"name=scores, type=MAP",
             "Fields":[
                 {"Tag":"name=key, type=UTF8"},
                 {"Tag":"name=value, type=LIST",
                  "Fields":[{"Tag":"name=element, type=FLOAT"}]
                 }
             ]
            },
            {"Tag":"name=friends, type=UTF8, repetitiontype=REPEATED"}
        ]
	}
`

	//write
	fw, _ := ParquetFile.NewLocalFileWriter("json.parquet")
	pw, _ := JSONWriter.NewJSONWriter(md, fw, 1)

	num := 10
	for i := 0; i < num; i++ {
		// ,
		// ,
		rec := `
            {
                "age":%d,
                "time":%d,
                "weight":%f,
                "sex":%t,
                "classes":["Math", "Computer", "English"],
                "scores":{
                            "Math":[99.5, 98.5, 97],
                            "Computer":[98,97.5],
                            "English":[100]
                         },
                "friends":["aa","bb"]
            }
        `

		rec = fmt.Sprintf(rec, 20+i%5, i, 50.0+float32(i)*0.1, i%2 == 0)

		pw.Write(rec)

	}
	pw.Flush(true)
	pw.WriteStop()
	log.Println("Write Finished")
	fw.Close()

It will throw panic: runtime error: index out of range. The only change I did to the original example is to add the repetitiontype=OPTIONAL, and delete the name field in rec (if we set that to optional, I believe naturally we should allow missing key, name).
And if I keep the name field optional but setting name field with non-empty data like,

rec := `
            {
		"name":"%s",
                "age":%d,
                "time":%d,
                "weight":%f,
                "sex":%t,
                "classes":["Math", "Computer", "English"],
                "scores":{
                            "Math":[99.5, 98.5, 97],
                            "Computer":[98,97.5],
                            "English":[100]
                         },
                "friends":["aa","bb"]
            }
        `

It will write the parquet file successfully, but when I try to read that field with some tool like parquet-tools, I got error

Zhouyangs-MacBook-Pro:parquet-go zhouyang$ parquet-tools head json.parquet
Can not read value at 0 in block -1 in file file:/Users/zhouyang/test/go/src/github.com/xitongsys/parquet-go/json.parquet

My guess is that this bug may not actually related to jsonWriter but may have something to do with repetition=optional. Please check that.
Thanks again for your help!

Possible to have a JSONWriter with nested struct and list support?

Why does read require a slice of struct, but write doesn't?

In local_flat.go:

stus := make([]Student, 1)
		if err = pr.Read(&stus); err != nil {
...
stu := Student{
if err = pw.Write(stu); err != nil {

In the most common use cases, parquet files are gonna be used for very large files (otherwise why use parquet?). Without any knowledge of your implementation, and considering parquet metadata tells you the row number in the file and per rowgroup, I'd expect something like:

var items = make([]Item, numrows)
for i := 0; i<numrows; i++ {
   pr.Read(&items[i])
}

Having to create a fresh slice per row seems inefficient? Please tell me if I'm way off.

Thank you for this great piece of work! Are you looking for collaborators? I'm interested in using your library for a project, and I suspect sooner or later I'll have to extend some part or give up. Some way to ease into the internals would be greatly appreciated. At the moment looks a little steep to get comfortable with.

Add example for offset seek

At the moment I'm falling back to reading the initial number of rows and not using them.

I understand that ParquetFile exposes a Seek method, but that seems to be a per-byte file seek.
What I think is needed is a way to read rows from a given starting offset.

There may not be a way to implement this more efficiently than just doing the Read fallback, in which case this is a no-op. It may also be very easy to use the Seek, in which case I just need some clarification (hence the request of an example).

There are no examples that call seek.

For me it's pretty bad, as my worst-case useless read might be a 2GB read. I expect most people with a Seek requirement to have the same issue, given the nature of parquet files being big.

Let me know if I can help with the code :) Also be sure to reach me on Twitter DM @marianogappa or via email spinetta[at]gmail if you want more fluid communication. Happy to help.

Support nested structure in list

It seems that only primitive list is supported right now, it should be better to support nested structure in list.

CSV SchemaHandler should support TIMESTAMP_MILLIS

Current it's missing in the NewSchemaHandlerFromMetadata function。
Suggest change from
name == "TIME_MICROS" || name == "TIMESTAMP_MICROS"
to
name == "TIME_MICROS" || name == "TIMESTAMP_MILLIS" || name == "TIMESTAMP_MICROS"

newbie issue

Hey @xitongsys thanks for this cool project!
I am a newbie with go but I wanted to give the project a try, but when I do as per the README:

go get github.com/xitongsys/parquet-go

I get:

package github.com/xitongsys/parquet-go: no Go files in ~/src/github.com/xitongsys/parquet-go
I have used go get with other libs and that works fine, could you tell me what is the obvious thing I am missing? Thanks @xitongsys !

ColumnReader bug: returned data array is not in correct length

Hey @xitongsys,
Long time no see! :)
Sorry I just found another bug in columnReader. There is some time data, _, _ := ReadColumnByPath(path, num) didn't return the correct result. More precisely, if num is 5, it can return a data with length 4, etc. I mean the length of data should always be equal to num even it can contains some nil in the data array.

Digging a bit more deeply, in ParquetReader/ColumnBuffer.go, func (self *ColumnBufferType) ReadRows(num int64) (*Layout.Table, int64) didn't return a right table.
Following code snippet to help you debug:

package main

import (
	"fmt"
	"log"
	"time"

	"github.com/xitongsys/parquet-go/ParquetFile"
	"github.com/xitongsys/parquet-go/ParquetReader"
	"github.com/xitongsys/parquet-go/ParquetWriter"
)

type Student struct {
	Name   string  `parquet:"name=name, type=UTF8"`
	Age    int32   `parquet:"name=age, type=INT32"`
	Id     int64   `parquet:"name=id, type=INT64"`
	Weight float32 `parquet:"name=weight, type=FLOAT"`
	Sex    bool    `parquet:"name=sex, type=BOOLEAN"`
	Day    int32   `parquet:"name=day, type=DATE"`
}

func main() {
	var err error
	//write
	fw, err := ParquetFile.NewLocalFileWriter("column.parquet")
	if err != nil {
		log.Println("Can't create file", err)
		return
	}
	pw, err := ParquetWriter.NewParquetWriter(fw, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet writer")
		return
	}
	num := 100000000
	for i := 0; i < num; i++ {
		stu := Student{
			Name:   "StudentName",
			Age:    int32(20 + i%5),
			Id:     int64(i),
			Weight: float32(50.0 + float32(i)*0.1),
			Sex:    bool(i%2 == 0),
			Day:    int32(time.Now().Unix() / 3600 / 24),
		}
		if err = pw.Write(stu); err != nil {
			log.Println("Write error", err)
		}
	}
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
	}
	log.Println("Write Finished")
	fw.Close()

	fr, err := ParquetFile.NewLocalFileReader("column.parquet")
	if err != nil {
		log.Println("Can't open file", err)
		return
	}
	pr, err := ParquetReader.NewParquetColumnReader(fr, 4)
	if err != nil {
		log.Println("Can't create column reader", err)
		return
	}
	num = int(pr.GetNumRows())
	fmt.Println("number of row: ", num)

	bufferSize := 10
	for num > 0 {
		rowRead := bufferSize
		if num < bufferSize {
			rowRead = num
		}
		d, _, _ := pr.ReadColumnByPath("name", rowRead)
		if len(d) != rowRead {
			fmt.Println("Baddddddd1", len(d), rowRead, d, num)
			return
		}
		d, _, _ = pr.ReadColumnByPath("age", rowRead)
		if len(d) != rowRead {
			fmt.Println("Baddddddd2", len(d), rowRead, num)
			fmt.Println(d)
			return
		}
		d, _, _ = pr.ReadColumnByPath("id", rowRead)
		if len(d) != rowRead {
			fmt.Println("Baddddddd3", len(d), rowRead)
			return
		}
		d, _, _ = pr.ReadColumnByPath("weight", rowRead)
		if len(d) != rowRead {
			fmt.Println("Baddddddd4", len(d), rowRead)
			return
		}
		d, _, _ = pr.ReadColumnByPath("sex", rowRead)
		if len(d) != rowRead {
			fmt.Println("Baddddddd4", len(d), rowRead)
			return
		}
		d, _, _ = pr.ReadColumnByPath("day", rowRead)
		if len(d) != rowRead {
			fmt.Println("Baddddddd4", len(d), rowRead)
			return
		}

		num -= rowRead
	}

	pr.ReadStop()
	fr.Close()

}

The particular interesting thing is that when I set num to be num := 100000000 the problem happen in row 67183770. When I set num := 200000000, bug happens in 167183770 or num := 500000000, bug in 467183770 etc

Have fun and good luck on debugging~
Thanks for the help

Consider updating NewParquetWriter and NewParquetReader to take Writer and Reader respectively

Currently these files require NewLocalFileReader/NewLocalFileWriter be called, and only accept *ParquetFile.

This breaks the established Reader/Writer model in golang, and means that things like compression can't be transparently added.

For example, consider the following reader model with bzip compressed json.


fh, _ := os.OpenFile(op, os.O_APPEND|os.O_RDWR, 0644)
bz, _ := bzip2.NewReader(fh, &bzrc)
if err = json.NewDecoder(bz).Decode(&data); err != nil {
    bz.Close()
    fh.Close()
    return
)

// do stuff with json

Ideally one could layer Gzip on top of parquet using

fh, _ := os.OpenFile(op, os.O_APPEND|os.O_RDWR, 0644)
gz, _ := gzip.NewReader(fh)
pr, err := ParquetReader.NewParquetReader(gz, new(Student), 2)					
r = pr.GetNumRows()
for i := 0; i < r  i++ {
    s := make([]Student, 1)
    pr.Read(&s)
    nfs = append(ss, s[0])
}
pr.ReadStop()
fh.Close()

// do something with students

Apache Thrift dependency broken

It looks like Apache has moved their Thrift repo to a different server. Seems like it's now available on GitHub and their GitBox service. According to their mailing list, the intention is for the GitHub version to become the main one.

Our builds broke today because of this, and it seems like the fix might be to change the import in parquet-go to pull from GitHub.

Too many arguments in call to ts.Write

../../vendor/github.com/xitongsys/parquet-go/Layout/DictPage.go:67:30: too many arguments in call to ts.Write
	have (context.Context, *parquet.PageHeader)
	want (thrift.TStruct)
../../vendor/github.com/xitongsys/parquet-go/Layout/DictPage.go:194:30: too many arguments in call to ts.Write
	have (context.Context, *parquet.PageHeader)
	want (thrift.TStruct)
../../vendor/github.com/xitongsys/parquet-go/Layout/Page.go:251:30: too many arguments in call to ts.Write
	have (context.Context, *parquet.PageHeader)
	want (thrift.TStruct)
../../vendor/github.com/xitongsys/parquet-go/Layout/Page.go:346:30: too many arguments in call to ts.Write
	have (context.Context, *parquet.PageHeader)
	want (thrift.TStruct)

NewLocalFileReader leaking file handles (5 per use)

Machine ran out of file handles when processing files. Tracked it down to NewLocalFileReader.

repro issues using the code in the example (notice that there are file fd's in /proc//fd/ on linux after use of NewLocalFileReader, notice there are 5 lingering handles to file.


package main

import (
	"github.com/xitongsys/parquet-go/ParquetFile"
	"github.com/xitongsys/parquet-go/ParquetReader"
	"github.com/xitongsys/parquet-go/ParquetType"
	"github.com/xitongsys/parquet-go/ParquetWriter"
	"log"
	"time"
)

type Student struct {
	Name   ParquetType.UTF8
	Age    ParquetType.INT32
	Id     ParquetType.INT64
	Weight ParquetType.FLOAT
	Sex    ParquetType.BOOLEAN
	Day    ParquetType.DATE
}

func main() {

	fw, _ := ParquetFile.NewLocalFileWriter("/tmp/flat.parquet")
	//write flat
	pw, _ := ParquetWriter.NewParquetWriter(fw, new(Student), 4)
	num := 10
	for i := 0; i < num; i++ {
		stu := Student{
			Name:   ParquetType.UTF8("StudentName"),
			Age:    ParquetType.INT32(20 + i%5),
			Id:     ParquetType.INT64(i),
			Weight: ParquetType.FLOAT(50.0 + float32(i)*0.1),
			Sex:    ParquetType.BOOLEAN(i%2 == 0),
			Day:    ParquetType.DATE(time.Now().Unix() / 3600 / 24),
		}
		pw.Write(stu)
	}
	pw.Flush(true)
	//pw.NameToLower()// convert the field name to lowercase
	pw.WriteStop()
	//log.Println("Write Finished")
	fw.Close()

	log.Println("check file handles after writer (should be none)")
	time.Sleep(time.Second * 30)

	///read flat
	fr, _ := ParquetFile.NewLocalFileReader("/tmp/flat.parquet")
	pr, err := ParquetReader.NewParquetReader(fr, 4)
	if err != nil {
		log.Fatal("Failed new reader", err.Error())

	}
	num = int(pr.GetNumRows())
	for i := 0; i < num; i++ {
		stus := make([]Student, 1)
		pr.Read(&stus)
	}
	fr.Close()

	log.Println("check file after reader handles (should also be none)")
	time.Sleep(time.Second * 1000)
}

Optional values not being honored

It seems optional values, that will not always exist, defined as pointer type, still cause nil value deference

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc0 pc=0x64c055]

goroutine 1 [running]:
github.com/xitongsys/parquet-go/Layout.PagesToChunk(0xc42000e0a8, 0x1, 0x1, 0x1e)
	src/github.com/xitongsys/parquet-go/Layout/Chunk.go:59 +0x4f5
github.com/xitongsys/parquet-go/ParquetWriter.(*ParquetWriter).Flush(0xc420118000, 0x6e1901)
	src/github.com/xitongsys/parquet-go/ParquetWriter/ParquetWriter.go:171 +0x84e
main.main()

Some repro code


type Nested struct {
	Three ParquetType.UTF8
	Four ParquetType.UTF8
}

type MyStruct struct {
	One ParquetType.INT64
	Two ParquetType.INT64
	Nest Nested
	Optional *ParquetType.UTF8	// pointer type because not always present
	Has_bool ParquetType.BOOLEAN
}
func main() {

	j := []byte(`{"one":1502210866,"two":0, "nest":{"three":"aaaa","four":"bbbb"}, "has_bool":true}`)
	n := MyStruct{}
	op := "/tmp/p.test"

	if err := json.Unmarshal(j, &n); err != nil {
		log.Fatal("Can't unmarshal json into new struct", err.Error())
		return
	}

        fmt.Printf("struct: %v\n", n)

	fw, err := ParquetFile.NewLocalFileWriter(op)
	if err != nil {
		log.Println("ERROR: Cannot open file ", op)
		return
	}
	pw, err := ParquetWriter.NewParquetWriter(fw, new(MyStruct), 1)
	if err != nil {
		log.Println("ERROR: Cannot create writer")
		return
	}

	pw.Write(n)
	pw.Flush(true)		// crashes in here
	pw.NameToLower()
	pw.WriteStop()
	fw.Close()

}